Eliminate persistent V cache for v_from_value_emb target layers by Upcccccc · Pull Request #13 · RiddleHe/nanochat

Upcccccc · 2026-05-21T04:12:28Z

Summary

Implements per-layer V cache in KVCache: target layers get None slot (no allocation), non-target layers keep their tensor as before. A token_ids_history field lets attention rebuild the full V history from value_embeds[layer] at each step.
Adds a new attention forward branch for v_from_value_emb inference that bypasses flash_attn_with_kvcache (which would write to v_cache), manually appends K to k_cache, and runs flash_attn_func with gathered V.
GPTBase.forward now gathers V for the full history (T0+T) when the cache is in v_from_value_emb mode, so attention has every position's V available without persistent storage.

Verification

tests/test_v_cache_elimination.py checks two things:

KVCache allocates v_cache slots only for non-target layers (target layers → None); token_ids_history is allocated.
Bit-exact equivalence between cached forward (new path) and uncached forward on the same v_from_value_emb model. Max |Δ| = 0.0000e+00 at every position; greedy argmax matches (" Berlin" for the "capital of Germany" test).

Memory savings (empirical, d24 second_half model, B=1)

Same model, two inference paths (A: full v_cache; B: new no-v_cache path) — VE-table overhead identical in both, so the delta is pure V-cache savings.

T	Path A (GB)	Path B (GB)	Savings	% of baseline
2K	4.52	4.52	0.05	1.1%
16K	6.58	6.11	0.48	7.3%
32K	8.70	7.92	0.78	9.0%
65K	13.94	11.54	2.40	17.2%
128K	23.62	18.79	4.83	20.5%

Asymptotic savings: L_t / (2L) of total KV cache. For second_half (L_t/L = 1/2) this is 25%.

FLOPs savings

estimate_flops already accounts for "dead c_v" weights at target layers:

Config	L_t/L	FLOPs/token	vs baseline
gpt_base	0	4.78 × 10⁹	—
v_from_value_emb_learn	1/3	4.66 × 10⁹	-2.37%
v_from_value_emb_second_half	1/2	4.61 × 10⁹	-3.56%

Files

nanochat/engine.py: KVCache v_from_value_emb mode with per-layer slots
nanochat/model/gpt_base.py: new attention forward branch + outer forward maintains token_ids_history
tests/test_v_cache_elimination.py: bit-exact + structural correctness tests
scripts/inspect/profile_inference_memory.py: cross-model gpt_base vs v_from_value_emb profiler
scripts/inspect/profile_inference_memory_ablation.py: same-model A/B clean ablation
docs/v_cache_memory_analysis.md: theoretical formulas + empirical fits

Test plan

tests/test_v_cache_elimination.py passes
scripts/inspect/profile_inference_memory_ablation.py reproduces 20.5% baseline reduction at T=128K
Standard models (non-v_from_value_emb) unaffected — v_from_value_emb=False cache path is the legacy implementation untouched

🤖 Generated with Claude Code

KVCache now allocates V cache slots per-layer: tensor for non-target layers, None for target layers. A token_ids_history field tracks the sequence so attention can gather V from each layer's VE_table on the fly. CausalSelfAttention.forward adds a v_from_value_emb inference branch that bypasses flash_attn_with_kvcache (which would write to v_cache), manually appends K to k_cache, and runs flash_attn_func with V gathered from the per-layer embedding table. GPTBase.forward updates token_ids_history at the start of forward (once per call, not per layer) and gathers full-history V for target layers when in v_from_value_emb inference mode. Standard (non-v_from_value_emb) inference path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

RiddleHe force-pushed the yuchen/v-cache-elimination branch from 743585d to 09c4288 Compare May 21, 2026 04:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate persistent V cache for v_from_value_emb target layers#13

Eliminate persistent V cache for v_from_value_emb target layers#13
Upcccccc wants to merge 1 commit into
masterfrom
yuchen/v-cache-elimination

Upcccccc commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Upcccccc commented May 21, 2026

Summary

Verification

Memory savings (empirical, d24 second_half model, B=1)

FLOPs savings

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant