Hybrid prefix caching fixes by santhnm2 · Pull Request #5502 · NVIDIA/Megatron-LM

santhnm2 · 2026-06-25T17:08:54Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

This PR makes the following fixes:

Stop the idle dummy_forward step from wiping the KV/Mamba prefix cache, so cross-request prefix reuse survives the gaps between requests (and at EP > 1).
Cache Mamba state at the last complete block of multi-chunk prompts so a later turn can actually skip prefill on hybrid models, instead of the durable cache barely populating.
Skip the entire cached prefix in a single chunk rather than one token-budget window at a time, so prefill latency scales with the uncached delta instead of the full prompt length (with a budget re-check that avoids the TokenOverflowError from add_request's minimum-compute clamp).
Clamp the Mamba conv-state gather positions into range so a prefill shorter than the convolution window (e.g. a small CUDA-graph warmup bucket) no longer overruns the token axis.
Honor ignore_eos in the dynamic /v1/chat/completions endpoint so forced-length generation works the same as on /v1/completions.
Add cumulative prefill-compute (computed vs skipped tokens) and prefix-cache utilization stats to the engine's periodic log for diagnosing cache effectiveness.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

… window Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> (cherry picked from commit 98eadd8)

The /v1/chat/completions endpoint did not read ignore_eos, so it always terminated generation at the model's EOS even when the request asked to ignore it. This makes forced-length generation (min/exact output length) impossible via the chat API. Mirror the /v1/completions endpoint by setting SamplingParams.termination_id = -1 when ignore_eos is requested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

dummy_forward() runs a full context.reset() to clear its transient one-token state, but that reset also wipes the KV/Mamba prefix-cache allocator state. The engine runs dummy_forward whenever it idles (between requests, and to keep EP all-to-all collectives alive when EP > 1), so at low concurrency the prefix cache was destroyed on every inter-request gap and never accumulated. Add a preserve_prefix_cache flag to reset()/reset_metadata() that skips the allocator/Mamba-slot reset (and keeps step_count monotonic for logging), and have dummy_forward pass it. The flag is gated on enable_prefix_caching, so the prefix-caching-disabled path is byte-identical to before. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Hybrid prefill-skip needs a cached Mamba state at the boundary a later turn resumes from. compute_and_store_offsets only ran on a request's first prefill chunk and framed extraction offsets against that chunk, so for prompts longer than one chunk the last complete block (which lives in a continuation chunk) was never extracted, and the live-state EOS path only fired for exactly block-aligned prompts. As a result the durable Mamba cache barely populated and cross-turn prefill-skip almost never triggered. Make the offsets chunk-relative (chunk_start = finished_chunk_token_count + skip_tokens) so a boundary in any chunk is extractable, guard the live-state EOS path to the final chunk, and call compute_and_store_offsets on every prefill chunk (slot allocation/restore stays first-chunk-only). Relies on the existing invariant that continuation chunks carry/restore Mamba state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot · 2026-06-25T17:08:59Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

The chunked-prefill scheduler sized a request's chunk span to min(remaining, token_budget), charging skipped (cached) tokens against the per-step compute budget. A long cached prefix could therefore only be skipped one budget-window at a time and the rest was re-prefilled across chunks, so prefill latency scaled with prompt length instead of the uncached delta. Size the first chunk as prefix_skip + min(remaining - prefix_skip, budget) so the entire cached prefix is skipped at once and only the delta is computed (add_request charges the budget against the computed length, not the span). Add an explicit budget re-check: add_request's effective>=2 clamp can shrink the skip and inflate the computed count, so validate the exact effective length and defer the request on overflow (a later full-budget step admits it), preventing a TokenOverflowError that crashed the engine under concurrency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Add two cumulative, low-overhead metrics to the periodic engine step log (gated on enable_prefix_caching): - prefill (cumul): computed vs skipped prompt tokens (% skipped) -- shows how much prefill compute prefix caching actually saves, so a rising per-step latency can be attributed to attention over the growing KV context rather than re-prefilling. - prefix cache util: KV blocks cached/total (+evictable) and Mamba durable slots used/max -- surfaces cache occupancy and Mamba-slot saturation. The token counters are accumulated on the context per step and drained into engine accumulators alongside the existing hit counters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Cover the changed behavior in tests/unit_tests/inference/contexts/ test_dynamic_prefix_caching.py (TestPrefixCacheReuseFixes): - reset(preserve_prefix_cache=True) keeps the KV hash index while a plain reset() clears it (the idle dummy_forward path), and the flag is a no-op when prefix caching is disabled. - prefill computed/skipped token counters track a prefix-cache hit correctly. - Mamba state extraction reaches the last complete block of a non-block-aligned multi-chunk prompt via chunk-relative offsets (the boundary lives in a continuation chunk). The whole-chunk-skip scheduler sizing and the TokenOverflow defer guard are exercised end-to-end by the hybrid prefix-caching e2e suite at runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

With CUDA graphs enabled the bucket list always includes a size-1 (tp_size) graph, so a prompt shorter than d_conv is captured at a bucket whose token layout is shorter than the conv-state gather window (d_conv positions per slot, with unused slots using abs_position == d_conv). The test runs such a prefill end-to-end and asserts it generates the requested number of tokens. Also refactor TestHybridChunkedPrefillIntermediateState to seed once in setup_class and use the shared clear_nvte_env_vars() helper instead of repeating the seed/env boilerplate per test (drops the now-unused os import). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot · 2026-06-25T20:11:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

santhnm2 · 2026-06-25T20:11:38Z

/ok to test 61be5e0

santhnm2 · 2026-06-25T20:21:44Z

/claude review

claude

LGTM

santhnm2 and others added 4 commits June 25, 2026 09:19

Fix a bug with prefix caching when number of tokens is less than conv…

c581e87

… window Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> (cherry picked from commit 98eadd8)

santhnm2 force-pushed the prefix-caching-hybrid-fixes branch 2 times, most recently from 99ed138 to 2b3b77e Compare June 25, 2026 18:35

santhnm2 and others added 4 commits June 25, 2026 12:02

santhnm2 force-pushed the prefix-caching-hybrid-fixes branch from 2b3b77e to 09b5994 Compare June 25, 2026 19:03

santhnm2 marked this pull request as ready for review June 25, 2026 19:59

santhnm2 requested review from a team as code owners June 25, 2026 19:59

copy-pr-bot Bot temporarily deployed to public June 25, 2026 19:59 Inactive

svcnvidia-nemo-ci added the complexity: medium label Jun 25, 2026

copy-pr-bot Bot temporarily deployed to public June 25, 2026 20:03 Inactive

santhnm2 changed the title ~~Prefix caching hybrid fixes~~ Hybrid prefix caching fixes Jun 25, 2026

Linting

15da24f

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Merge branch 'main' into prefix-caching-hybrid-fixes

61be5e0

copy-pr-bot Bot temporarily deployed to public June 25, 2026 20:12 Inactive

copy-pr-bot Bot temporarily deployed to test June 25, 2026 20:12 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 20:15 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 20:16 Inactive

claude Bot approved these changes Jun 25, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 25, 2026 20:25 Inactive

lmcafee-nvidia approved these changes Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hybrid prefix caching fixes#5502

Hybrid prefix caching fixes#5502
santhnm2 wants to merge 10 commits into
NVIDIA:mainfrom
santhnm2:prefix-caching-hybrid-fixes

santhnm2 commented Jun 25, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

santhnm2 commented Jun 25, 2026

Uh oh!

santhnm2 commented Jun 25, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

santhnm2 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

santhnm2 commented Jun 25, 2026

Uh oh!

santhnm2 commented Jun 25, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

santhnm2 commented Jun 25, 2026 •

edited

Loading