A prompt-caching proxy for llama-server + Claude Code. Normalizes Claude Code's per-request billing header so llama-server's KV cache can match prompt prefixes across turns.
Claude Code injects an x-anthropic-billing-header into the system prompt on every request:
x-anthropic-billing-header: cc_version=2.1.63.a43; cc_entrypoint=sdk-cli; cch=e2224;
Fields in this header change per-request. Since the header appears at the very start of the system prompt, the tokenized prefix diverges at ~token 33, forcing llama-server to re-evaluate the entire prompt from scratch. For a typical turn with ~18K prompt tokens at ~300 tok/s, that's ~60 seconds of prefill per turn.
cache-proxy.py sits between your client and llama-server, doing a byte-level regex replace on the request body to normalize the entire billing header value to a fixed string. With this fix, subsequent turns reuse ~99% of the KV cache — only the new tokens need evaluation.
Measured on the terminal-bench fix-git task with MiniMax-M2.5 (Q8_0) on Apple M3 Ultra:
| Configuration | Time | Cache working? |
|---|---|---|
| llama-server, no proxy | 8:40 | No |
| llama-server + cache-proxy | 1:50 | Yes |
With caching working, each turn after the first reuses ~99% of the KV cache. Prefill drops from ~60s/turn to <1s/turn.
brew install llama.cppllama-server \
--model /path/to/model.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 131072 \
--n-gpu-layers 999 \
--parallel 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0python3 cache-proxy.pyConfigure Claude Code (or any other client) to send requests to http://localhost:8081 instead of http://localhost:8080.
| Flag | Default | Description |
|---|---|---|
--port |
8081 |
Port the proxy listens on |
--upstream |
http://localhost:8080 |
Upstream llama-server URL |
--verbose |
off | Log request body size and normalization details |
Examples:
# Defaults: listen on 8081, forward to localhost:8080
python3 cache-proxy.py
# Custom port and upstream
python3 cache-proxy.py --port 9090 --upstream http://localhost:8080
# Verbose logging
python3 cache-proxy.py --verboseIn llama-server's output, each request shows cache behavior:
slot update_slots: id 0 | task 232 | new prompt, ... task.n_tokens = 17928
slot update_slots: id 0 | task 232 | n_tokens = 17800, memory_seq_rm [17800, end)
This means 17,800 of 17,928 tokens were served from cache — only 128 new tokens needed evaluation. If you see memory_seq_rm [33, end), the cache is not working (only 33 tokens matched — the billing header is diverging).
| Flag | Value | Purpose |
|---|---|---|
--host |
0.0.0.0 |
Allow connections from Docker or other hosts |
--port |
8080 |
Server port (proxy sits on 8081) |
--ctx-size |
131072 |
128K context window |
--n-gpu-layers |
999 |
Offload all layers to Metal GPU |
--parallel |
1 |
Single slot — prevents slot rotation that breaks cache |
--flash-attn |
on |
Flash attention on Metal for faster inference |
--cache-type-k |
q8_0 |
KV cache key quantization (~50% memory savings vs F16) |
--cache-type-v |
q8_0 |
KV cache value quantization |
--reasoning-budget |
0 |
Disable thinking tokens (optional, saves time if unused) |
--verbose |
Log cache hit/miss stats for debugging |
- Single-threaded stdlib proxy: Uses Python's
http.servermodule. Fine for single-slot llama-server (which is the recommended configuration for prompt caching), but not suitable for high-concurrency workloads. - Mac/Metal only: llama-server GPU offload via
--n-gpu-layersrequires Apple Metal. The proxy itself runs anywhere, but you need Metal for practical inference speeds with large models. - No HTTPS: The proxy speaks plain HTTP. Only use on localhost or trusted networks.
MIT