fix(kv): bit-exact partial prefix-KV reuse via chunk-aligned prefill by wesleyscholl · Pull Request #188 · konjoai/squish

wesleyscholl · 2026-06-29T17:29:53Z

Problem

squish's in-memory prompt-prefix KV reuse (default prompt-lookup decode path) let an extending request skip re-prefilling the shared prefix — a real TTFT win. But for partial reuse the greedy output was not byte-identical to a cold (no-reuse) run: ~60% of 40-token generations forked to a different (equally valid) token.

Root cause — bf16 rounding, not a bug

Proven empirically:

Equal-length tails still diverged (~0.5 logit); identical prompts were bit-exact (0.0) → it's the suffix, not an off-by-one.
Reuse prefills the new suffix in a small variable-length forward; cold prefills it inside the full-prompt forward. The math is identical, but the two matmul shapes round differently in bf16, enough to flip a genuine near-tie (top-2 margin ~0.1). After one flip, generation cascades.
fp32 eliminates it (Δlogit 0.5 → 2e-5, 60% → 0%) but costs ~2× and won't fit a 7B in fp32 weights on 16 GB — rejected.

Fix — absolute-position-aligned chunked prefill (no fp32)

Prefill the suffix in fixed-size, absolute-position-aligned chunks (_PREFILL_CHUNK), and align the reuse boundary down to a chunk multiple. A token at position p is then always computed in the chunk [⌊p/C⌋·C, …) with the same matmul shape and same prior KV whether or not its prefix was cached → byte-identical to cold. This is how vLLM/PagedAttention get exact prefix reuse; squish keeps full cross-chunk attention (shared accumulating cache), so no quality loss.

A guard caps reuse at the prior request's prompt length — decode/spec tokens are written off the chunk grid and must not be reused (PromptPrefixCache.store/borrow).

Chunk size

Correctness holds for any C (bit-exactness is shape-identity). C is purely a perf knob: a sweep + a 64-vs-128 head-to-head pick C=64 (best at low overlap, ties elsewhere; below ~48 the GPU underutilises). Full data + tables in benchmarks/prefix_reuse_chunking.md.

Validation

Off-grid bit-exactness grid (shared lengths on and off the chunk grid) → 0 divergence over 40 tokens — tests/test_prompt_prefix_cache.py
int4 server on every previously-failing case → all IDENTICAL
16/16 unit tests at C=64

Notes

Scoped to the prompt-lookup reuse path (the default deterministic traffic). Output shifts slightly vs the old single-forward path, but cold and reuse now move to chunked together and become mutually consistent.
prefix_reuse_curve.py is a standalone reuse benchmark (git-diff summarization prompts) used to produce the curves.

🤖 Generated with Claude Code

Partial prompt-prefix KV reuse forked greedy output from a cold (no-reuse) prefill — ~60% of generations took a different but equally valid token. Root cause (proven, not a bug): the new suffix was prefilled in a different-shaped forward than cold, so bf16 matmul rounding (~0.5 logit) flipped genuine near-ties. Equal-length tails still diverged; identical prompts were bit-exact. Fix: prefill the suffix in fixed-size, absolute-position-aligned chunks (_PREFILL_CHUNK) and align the reuse boundary DOWN to a chunk multiple, so every token is computed in the same matmul shape whether or not its prefix was cached — byte-identical to cold, no fp32. Reuse is capped at the prior request's prompt length (decode/spec tokens are written off the chunk grid). vLLM/PagedAttention- style block alignment; full cross-chunk attention preserved, so no quality loss. Validated lossless: off-grid bit-exactness grid (tests) + int4 server on every previously-failing case. Chunk-size sweep + 64-vs-128 head-to-head pick C=64. See benchmarks/prefix_reuse_chunking.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

konjoinfinity merged commit 18c7a87 into main Jun 29, 2026
29 checks passed

konjoinfinity deleted the fix/chunk-aligned-reuse branch June 29, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(kv): bit-exact partial prefix-KV reuse via chunk-aligned prefill#188

fix(kv): bit-exact partial prefix-KV reuse via chunk-aligned prefill#188
konjoinfinity merged 1 commit into
mainfrom
fix/chunk-aligned-reuse

wesleyscholl commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wesleyscholl commented Jun 29, 2026

Problem

Root cause — bf16 rounding, not a bug

Fix — absolute-position-aligned chunked prefill (no fp32)

Chunk size

Validation

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants