fix(kv): bit-exact partial prefix-KV reuse via chunk-aligned prefill#188
Merged
Conversation
Partial prompt-prefix KV reuse forked greedy output from a cold (no-reuse) prefill — ~60% of generations took a different but equally valid token. Root cause (proven, not a bug): the new suffix was prefilled in a different-shaped forward than cold, so bf16 matmul rounding (~0.5 logit) flipped genuine near-ties. Equal-length tails still diverged; identical prompts were bit-exact. Fix: prefill the suffix in fixed-size, absolute-position-aligned chunks (_PREFILL_CHUNK) and align the reuse boundary DOWN to a chunk multiple, so every token is computed in the same matmul shape whether or not its prefix was cached — byte-identical to cold, no fp32. Reuse is capped at the prior request's prompt length (decode/spec tokens are written off the chunk grid). vLLM/PagedAttention- style block alignment; full cross-chunk attention preserved, so no quality loss. Validated lossless: off-grid bit-exactness grid (tests) + int4 server on every previously-failing case. Chunk-size sweep + 64-vs-128 head-to-head pick C=64. See benchmarks/prefix_reuse_chunking.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
squish's in-memory prompt-prefix KV reuse (default prompt-lookup decode path) let an extending request skip re-prefilling the shared prefix — a real TTFT win. But for partial reuse the greedy output was not byte-identical to a cold (no-reuse) run: ~60% of 40-token generations forked to a different (equally valid) token.
Root cause — bf16 rounding, not a bug
Proven empirically:
Fix — absolute-position-aligned chunked prefill (no fp32)
Prefill the suffix in fixed-size, absolute-position-aligned chunks (
_PREFILL_CHUNK), and align the reuse boundary down to a chunk multiple. A token at positionpis then always computed in the chunk[⌊p/C⌋·C, …)with the same matmul shape and same prior KV whether or not its prefix was cached → byte-identical to cold. This is how vLLM/PagedAttention get exact prefix reuse; squish keeps full cross-chunk attention (shared accumulating cache), so no quality loss.A guard caps reuse at the prior request's prompt length — decode/spec tokens are written off the chunk grid and must not be reused (
PromptPrefixCache.store/borrow).Chunk size
Correctness holds for any C (bit-exactness is shape-identity). C is purely a perf knob: a sweep + a 64-vs-128 head-to-head pick C=64 (best at low overlap, ties elsewhere; below ~48 the GPU underutilises). Full data + tables in
benchmarks/prefix_reuse_chunking.md.Validation
tests/test_prompt_prefix_cache.pyIDENTICALNotes
prefix_reuse_curve.pyis a standalone reuse benchmark (git-diff summarization prompts) used to produce the curves.🤖 Generated with Claude Code