Skip to content

fix(kv): bit-exact partial prefix-KV reuse via chunk-aligned prefill#188

Merged
konjoinfinity merged 1 commit into
mainfrom
fix/chunk-aligned-reuse
Jun 29, 2026
Merged

fix(kv): bit-exact partial prefix-KV reuse via chunk-aligned prefill#188
konjoinfinity merged 1 commit into
mainfrom
fix/chunk-aligned-reuse

Conversation

@wesleyscholl

Copy link
Copy Markdown
Collaborator

Problem

squish's in-memory prompt-prefix KV reuse (default prompt-lookup decode path) let an extending request skip re-prefilling the shared prefix — a real TTFT win. But for partial reuse the greedy output was not byte-identical to a cold (no-reuse) run: ~60% of 40-token generations forked to a different (equally valid) token.

Root cause — bf16 rounding, not a bug

Proven empirically:

  • Equal-length tails still diverged (~0.5 logit); identical prompts were bit-exact (0.0) → it's the suffix, not an off-by-one.
  • Reuse prefills the new suffix in a small variable-length forward; cold prefills it inside the full-prompt forward. The math is identical, but the two matmul shapes round differently in bf16, enough to flip a genuine near-tie (top-2 margin ~0.1). After one flip, generation cascades.
  • fp32 eliminates it (Δlogit 0.5 → 2e-5, 60% → 0%) but costs ~2× and won't fit a 7B in fp32 weights on 16 GB — rejected.

Fix — absolute-position-aligned chunked prefill (no fp32)

Prefill the suffix in fixed-size, absolute-position-aligned chunks (_PREFILL_CHUNK), and align the reuse boundary down to a chunk multiple. A token at position p is then always computed in the chunk [⌊p/C⌋·C, …) with the same matmul shape and same prior KV whether or not its prefix was cached → byte-identical to cold. This is how vLLM/PagedAttention get exact prefix reuse; squish keeps full cross-chunk attention (shared accumulating cache), so no quality loss.

A guard caps reuse at the prior request's prompt length — decode/spec tokens are written off the chunk grid and must not be reused (PromptPrefixCache.store/borrow).

Chunk size

Correctness holds for any C (bit-exactness is shape-identity). C is purely a perf knob: a sweep + a 64-vs-128 head-to-head pick C=64 (best at low overlap, ties elsewhere; below ~48 the GPU underutilises). Full data + tables in benchmarks/prefix_reuse_chunking.md.

Validation

  • Off-grid bit-exactness grid (shared lengths on and off the chunk grid) → 0 divergence over 40 tokens — tests/test_prompt_prefix_cache.py
  • int4 server on every previously-failing case → all IDENTICAL
  • 16/16 unit tests at C=64

Notes

  • Scoped to the prompt-lookup reuse path (the default deterministic traffic). Output shifts slightly vs the old single-forward path, but cold and reuse now move to chunked together and become mutually consistent.
  • prefix_reuse_curve.py is a standalone reuse benchmark (git-diff summarization prompts) used to produce the curves.

🤖 Generated with Claude Code

Partial prompt-prefix KV reuse forked greedy output from a cold (no-reuse)
prefill — ~60% of generations took a different but equally valid token. Root
cause (proven, not a bug): the new suffix was prefilled in a different-shaped
forward than cold, so bf16 matmul rounding (~0.5 logit) flipped genuine
near-ties. Equal-length tails still diverged; identical prompts were bit-exact.

Fix: prefill the suffix in fixed-size, absolute-position-aligned chunks
(_PREFILL_CHUNK) and align the reuse boundary DOWN to a chunk multiple, so every
token is computed in the same matmul shape whether or not its prefix was cached
— byte-identical to cold, no fp32. Reuse is capped at the prior request's prompt
length (decode/spec tokens are written off the chunk grid). vLLM/PagedAttention-
style block alignment; full cross-chunk attention preserved, so no quality loss.

Validated lossless: off-grid bit-exactness grid (tests) + int4 server on every
previously-failing case. Chunk-size sweep + 64-vs-128 head-to-head pick C=64.
See benchmarks/prefix_reuse_chunking.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@konjoinfinity konjoinfinity merged commit 18c7a87 into main Jun 29, 2026
29 checks passed
@konjoinfinity konjoinfinity deleted the fix/chunk-aligned-reuse branch June 29, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants