Hybrid prefix caching fixes#5502
Open
santhnm2 wants to merge 10 commits into
Open
Conversation
… window Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com> (cherry picked from commit 98eadd8)
The /v1/chat/completions endpoint did not read ignore_eos, so it always terminated generation at the model's EOS even when the request asked to ignore it. This makes forced-length generation (min/exact output length) impossible via the chat API. Mirror the /v1/completions endpoint by setting SamplingParams.termination_id = -1 when ignore_eos is requested. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
dummy_forward() runs a full context.reset() to clear its transient one-token state, but that reset also wipes the KV/Mamba prefix-cache allocator state. The engine runs dummy_forward whenever it idles (between requests, and to keep EP all-to-all collectives alive when EP > 1), so at low concurrency the prefix cache was destroyed on every inter-request gap and never accumulated. Add a preserve_prefix_cache flag to reset()/reset_metadata() that skips the allocator/Mamba-slot reset (and keeps step_count monotonic for logging), and have dummy_forward pass it. The flag is gated on enable_prefix_caching, so the prefix-caching-disabled path is byte-identical to before. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Hybrid prefill-skip needs a cached Mamba state at the boundary a later turn resumes from. compute_and_store_offsets only ran on a request's first prefill chunk and framed extraction offsets against that chunk, so for prompts longer than one chunk the last complete block (which lives in a continuation chunk) was never extracted, and the live-state EOS path only fired for exactly block-aligned prompts. As a result the durable Mamba cache barely populated and cross-turn prefill-skip almost never triggered. Make the offsets chunk-relative (chunk_start = finished_chunk_token_count + skip_tokens) so a boundary in any chunk is extractable, guard the live-state EOS path to the final chunk, and call compute_and_store_offsets on every prefill chunk (slot allocation/restore stays first-chunk-only). Relies on the existing invariant that continuation chunks carry/restore Mamba state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
99ed138 to
2b3b77e
Compare
The chunked-prefill scheduler sized a request's chunk span to min(remaining, token_budget), charging skipped (cached) tokens against the per-step compute budget. A long cached prefix could therefore only be skipped one budget-window at a time and the rest was re-prefilled across chunks, so prefill latency scaled with prompt length instead of the uncached delta. Size the first chunk as prefix_skip + min(remaining - prefix_skip, budget) so the entire cached prefix is skipped at once and only the delta is computed (add_request charges the budget against the computed length, not the span). Add an explicit budget re-check: add_request's effective>=2 clamp can shrink the skip and inflate the computed count, so validate the exact effective length and defer the request on overflow (a later full-budget step admits it), preventing a TokenOverflowError that crashed the engine under concurrency. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Add two cumulative, low-overhead metrics to the periodic engine step log (gated on enable_prefix_caching): - prefill (cumul): computed vs skipped prompt tokens (% skipped) -- shows how much prefill compute prefix caching actually saves, so a rising per-step latency can be attributed to attention over the growing KV context rather than re-prefilling. - prefix cache util: KV blocks cached/total (+evictable) and Mamba durable slots used/max -- surfaces cache occupancy and Mamba-slot saturation. The token counters are accumulated on the context per step and drained into engine accumulators alongside the existing hit counters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Cover the changed behavior in tests/unit_tests/inference/contexts/ test_dynamic_prefix_caching.py (TestPrefixCacheReuseFixes): - reset(preserve_prefix_cache=True) keeps the KV hash index while a plain reset() clears it (the idle dummy_forward path), and the flag is a no-op when prefix caching is disabled. - prefill computed/skipped token counters track a prefix-cache hit correctly. - Mamba state extraction reaches the last complete block of a non-block-aligned multi-chunk prompt via chunk-relative offsets (the boundary lives in a continuation chunk). The whole-chunk-skip scheduler sizing and the TokenOverflow defer guard are exercised end-to-end by the hybrid prefix-caching e2e suite at runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
With CUDA graphs enabled the bucket list always includes a size-1 (tp_size) graph, so a prompt shorter than d_conv is captured at a bucket whose token layout is shorter than the conv-state gather window (d_conv positions per slot, with unused slots using abs_position == d_conv). The test runs such a prefill end-to-end and asserts it generates the requested number of tokens. Also refactor TestHybridChunkedPrefillIntermediateState to seed once in setup_class and use the shared clear_nvte_env_vars() helper instead of repeating the seed/env boilerplate per test (drops the now-unused os import). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
2b3b77e to
09b5994
Compare
Contributor
Author
|
/ok to test 61be5e0 |
Contributor
Author
|
/claude review |
lmcafee-nvidia
approved these changes
Jun 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
This PR makes the following fixes:
ignore_eosin the dynamic /v1/chat/completions endpoint so forced-length generation works the same as on /v1/completions.Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.