Skip to content

Eliminate persistent V cache for v_from_value_emb target layers#13

Open
Upcccccc wants to merge 1 commit into
masterfrom
yuchen/v-cache-elimination
Open

Eliminate persistent V cache for v_from_value_emb target layers#13
Upcccccc wants to merge 1 commit into
masterfrom
yuchen/v-cache-elimination

Conversation

@Upcccccc

Copy link
Copy Markdown
Collaborator

Summary

  • Implements per-layer V cache in KVCache: target layers get None slot (no allocation), non-target layers keep their tensor as before. A token_ids_history field lets attention rebuild the full V history from value_embeds[layer] at each step.
  • Adds a new attention forward branch for v_from_value_emb inference that bypasses flash_attn_with_kvcache (which would write to v_cache), manually appends K to k_cache, and runs flash_attn_func with gathered V.
  • GPTBase.forward now gathers V for the full history (T0+T) when the cache is in v_from_value_emb mode, so attention has every position's V available without persistent storage.

Verification

tests/test_v_cache_elimination.py checks two things:

  1. KVCache allocates v_cache slots only for non-target layers (target layers → None); token_ids_history is allocated.
  2. Bit-exact equivalence between cached forward (new path) and uncached forward on the same v_from_value_emb model. Max |Δ| = 0.0000e+00 at every position; greedy argmax matches (" Berlin" for the "capital of Germany" test).

Memory savings (empirical, d24 second_half model, B=1)

Same model, two inference paths (A: full v_cache; B: new no-v_cache path) — VE-table overhead identical in both, so the delta is pure V-cache savings.

T Path A (GB) Path B (GB) Savings % of baseline
2K 4.52 4.52 0.05 1.1%
16K 6.58 6.11 0.48 7.3%
32K 8.70 7.92 0.78 9.0%
65K 13.94 11.54 2.40 17.2%
128K 23.62 18.79 4.83 20.5%

Asymptotic savings: L_t / (2L) of total KV cache. For second_half (L_t/L = 1/2) this is 25%.

FLOPs savings

estimate_flops already accounts for "dead c_v" weights at target layers:

Config L_t/L FLOPs/token vs baseline
gpt_base 0 4.78 × 10⁹
v_from_value_emb_learn 1/3 4.66 × 10⁹ -2.37%
v_from_value_emb_second_half 1/2 4.61 × 10⁹ -3.56%

Files

  • nanochat/engine.py: KVCache v_from_value_emb mode with per-layer slots
  • nanochat/model/gpt_base.py: new attention forward branch + outer forward maintains token_ids_history
  • tests/test_v_cache_elimination.py: bit-exact + structural correctness tests
  • scripts/inspect/profile_inference_memory.py: cross-model gpt_base vs v_from_value_emb profiler
  • scripts/inspect/profile_inference_memory_ablation.py: same-model A/B clean ablation
  • docs/v_cache_memory_analysis.md: theoretical formulas + empirical fits

Test plan

  • tests/test_v_cache_elimination.py passes
  • scripts/inspect/profile_inference_memory_ablation.py reproduces 20.5% baseline reduction at T=128K
  • Standard models (non-v_from_value_emb) unaffected — v_from_value_emb=False cache path is the legacy implementation untouched

🤖 Generated with Claude Code

KVCache now allocates V cache slots per-layer: tensor for non-target layers,
None for target layers. A token_ids_history field tracks the sequence so
attention can gather V from each layer's VE_table on the fly.

CausalSelfAttention.forward adds a v_from_value_emb inference branch that
bypasses flash_attn_with_kvcache (which would write to v_cache), manually
appends K to k_cache, and runs flash_attn_func with V gathered from the
per-layer embedding table.

GPTBase.forward updates token_ids_history at the start of forward (once per
call, not per layer) and gathers full-history V for target layers when in
v_from_value_emb inference mode.

Standard (non-v_from_value_emb) inference path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RiddleHe RiddleHe force-pushed the yuchen/v-cache-elimination branch from 743585d to 09c4288 Compare May 21, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant