Skip to content

Hybrid prefix caching fixes#5502

Open
santhnm2 wants to merge 10 commits into
NVIDIA:mainfrom
santhnm2:prefix-caching-hybrid-fixes
Open

Hybrid prefix caching fixes#5502
santhnm2 wants to merge 10 commits into
NVIDIA:mainfrom
santhnm2:prefix-caching-hybrid-fixes

Conversation

@santhnm2

@santhnm2 santhnm2 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

This PR makes the following fixes:

  1. Stop the idle dummy_forward step from wiping the KV/Mamba prefix cache, so cross-request prefix reuse survives the gaps between requests (and at EP > 1).
  2. Cache Mamba state at the last complete block of multi-chunk prompts so a later turn can actually skip prefill on hybrid models, instead of the durable cache barely populating.
  3. Skip the entire cached prefix in a single chunk rather than one token-budget window at a time, so prefill latency scales with the uncached delta instead of the full prompt length (with a budget re-check that avoids the TokenOverflowError from add_request's minimum-compute clamp).
  4. Clamp the Mamba conv-state gather positions into range so a prefill shorter than the convolution window (e.g. a small CUDA-graph warmup bucket) no longer overruns the token axis.
  5. Honor ignore_eos in the dynamic /v1/chat/completions endpoint so forced-length generation works the same as on /v1/completions.
  6. Add cumulative prefill-compute (computed vs skipped tokens) and prefix-cache utilization stats to the engine's periodic log for diagnosing cache effectiveness.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

santhnm2 and others added 4 commits June 25, 2026 09:19
… window

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
(cherry picked from commit 98eadd8)
The /v1/chat/completions endpoint did not read ignore_eos, so it always
terminated generation at the model's EOS even when the request asked to
ignore it. This makes forced-length generation (min/exact output length)
impossible via the chat API. Mirror the /v1/completions endpoint by setting
SamplingParams.termination_id = -1 when ignore_eos is requested.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
dummy_forward() runs a full context.reset() to clear its transient one-token
state, but that reset also wipes the KV/Mamba prefix-cache allocator state.
The engine runs dummy_forward whenever it idles (between requests, and to keep
EP all-to-all collectives alive when EP > 1), so at low concurrency the prefix
cache was destroyed on every inter-request gap and never accumulated.

Add a preserve_prefix_cache flag to reset()/reset_metadata() that skips the
allocator/Mamba-slot reset (and keeps step_count monotonic for logging), and
have dummy_forward pass it. The flag is gated on enable_prefix_caching, so the
prefix-caching-disabled path is byte-identical to before.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Hybrid prefill-skip needs a cached Mamba state at the boundary a later turn
resumes from. compute_and_store_offsets only ran on a request's first prefill
chunk and framed extraction offsets against that chunk, so for prompts longer
than one chunk the last complete block (which lives in a continuation chunk)
was never extracted, and the live-state EOS path only fired for exactly
block-aligned prompts. As a result the durable Mamba cache barely populated and
cross-turn prefill-skip almost never triggered.

Make the offsets chunk-relative (chunk_start = finished_chunk_token_count +
skip_tokens) so a boundary in any chunk is extractable, guard the live-state
EOS path to the final chunk, and call compute_and_store_offsets on every
prefill chunk (slot allocation/restore stays first-chunk-only). Relies on the
existing invariant that continuation chunks carry/restore Mamba state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@santhnm2 santhnm2 force-pushed the prefix-caching-hybrid-fixes branch 2 times, most recently from 99ed138 to 2b3b77e Compare June 25, 2026 18:35
santhnm2 and others added 4 commits June 25, 2026 12:02
The chunked-prefill scheduler sized a request's chunk span to
min(remaining, token_budget), charging skipped (cached) tokens against the
per-step compute budget. A long cached prefix could therefore only be skipped
one budget-window at a time and the rest was re-prefilled across chunks, so
prefill latency scaled with prompt length instead of the uncached delta.

Size the first chunk as prefix_skip + min(remaining - prefix_skip, budget) so
the entire cached prefix is skipped at once and only the delta is computed
(add_request charges the budget against the computed length, not the span).
Add an explicit budget re-check: add_request's effective>=2 clamp can shrink
the skip and inflate the computed count, so validate the exact effective length
and defer the request on overflow (a later full-budget step admits it),
preventing a TokenOverflowError that crashed the engine under concurrency.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Add two cumulative, low-overhead metrics to the periodic engine step log
(gated on enable_prefix_caching):

- prefill (cumul): computed vs skipped prompt tokens (% skipped) -- shows how
  much prefill compute prefix caching actually saves, so a rising per-step
  latency can be attributed to attention over the growing KV context rather
  than re-prefilling.
- prefix cache util: KV blocks cached/total (+evictable) and Mamba durable
  slots used/max -- surfaces cache occupancy and Mamba-slot saturation.

The token counters are accumulated on the context per step and drained into
engine accumulators alongside the existing hit counters.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Cover the changed behavior in tests/unit_tests/inference/contexts/
test_dynamic_prefix_caching.py (TestPrefixCacheReuseFixes):

- reset(preserve_prefix_cache=True) keeps the KV hash index while a plain
  reset() clears it (the idle dummy_forward path), and the flag is a no-op
  when prefix caching is disabled.
- prefill computed/skipped token counters track a prefix-cache hit correctly.
- Mamba state extraction reaches the last complete block of a non-block-aligned
  multi-chunk prompt via chunk-relative offsets (the boundary lives in a
  continuation chunk).

The whole-chunk-skip scheduler sizing and the TokenOverflow defer guard are
exercised end-to-end by the hybrid prefix-caching e2e suite at runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
With CUDA graphs enabled the bucket list always includes a size-1 (tp_size)
graph, so a prompt shorter than d_conv is captured at a bucket whose token
layout is shorter than the conv-state gather window (d_conv positions per slot,
with unused slots using abs_position == d_conv). The test runs such a prefill
end-to-end and asserts it generates the requested number of tokens.

Also refactor TestHybridChunkedPrefillIntermediateState to seed once in
setup_class and use the shared clear_nvte_env_vars() helper instead of repeating
the seed/env boilerplate per test (drops the now-unused os import).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@santhnm2 santhnm2 force-pushed the prefix-caching-hybrid-fixes branch from 2b3b77e to 09b5994 Compare June 25, 2026 19:03
@santhnm2 santhnm2 marked this pull request as ready for review June 25, 2026 19:59
@santhnm2 santhnm2 requested review from a team as code owners June 25, 2026 19:59
@santhnm2 santhnm2 changed the title Prefix caching hybrid fixes Hybrid prefix caching fixes Jun 25, 2026
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@santhnm2

Copy link
Copy Markdown
Contributor Author

/ok to test 61be5e0

@santhnm2

Copy link
Copy Markdown
Contributor Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants