Skip to content

cchuter/claude-cache-proxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

claude-cache-proxy

A prompt-caching proxy for llama-server + Claude Code. Normalizes Claude Code's per-request billing header so llama-server's KV cache can match prompt prefixes across turns.

The problem

Claude Code injects an x-anthropic-billing-header into the system prompt on every request:

x-anthropic-billing-header: cc_version=2.1.63.a43; cc_entrypoint=sdk-cli; cch=e2224;

Fields in this header change per-request. Since the header appears at the very start of the system prompt, the tokenized prefix diverges at ~token 33, forcing llama-server to re-evaluate the entire prompt from scratch. For a typical turn with ~18K prompt tokens at ~300 tok/s, that's ~60 seconds of prefill per turn.

The solution

cache-proxy.py sits between your client and llama-server, doing a byte-level regex replace on the request body to normalize the entire billing header value to a fixed string. With this fix, subsequent turns reuse ~99% of the KV cache — only the new tokens need evaluation.

Performance

Measured on the terminal-bench fix-git task with MiniMax-M2.5 (Q8_0) on Apple M3 Ultra:

Configuration Time Cache working?
llama-server, no proxy 8:40 No
llama-server + cache-proxy 1:50 Yes

With caching working, each turn after the first reuses ~99% of the KV cache. Prefill drops from ~60s/turn to <1s/turn.

Quick start

1. Install llama.cpp

brew install llama.cpp

2. Start llama-server

llama-server \
  --model /path/to/model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

3. Start the proxy

python3 cache-proxy.py

4. Point your client at the proxy

Configure Claude Code (or any other client) to send requests to http://localhost:8081 instead of http://localhost:8080.

Configuration

Flag Default Description
--port 8081 Port the proxy listens on
--upstream http://localhost:8080 Upstream llama-server URL
--verbose off Log request body size and normalization details

Examples:

# Defaults: listen on 8081, forward to localhost:8080
python3 cache-proxy.py

# Custom port and upstream
python3 cache-proxy.py --port 9090 --upstream http://localhost:8080

# Verbose logging
python3 cache-proxy.py --verbose

Verifying cache hits

In llama-server's output, each request shows cache behavior:

slot update_slots: id 0 | task 232 | new prompt, ... task.n_tokens = 17928
slot update_slots: id 0 | task 232 | n_tokens = 17800, memory_seq_rm [17800, end)

This means 17,800 of 17,928 tokens were served from cache — only 128 new tokens needed evaluation. If you see memory_seq_rm [33, end), the cache is not working (only 33 tokens matched — the billing header is diverging).

Recommended llama-server flags

Flag Value Purpose
--host 0.0.0.0 Allow connections from Docker or other hosts
--port 8080 Server port (proxy sits on 8081)
--ctx-size 131072 128K context window
--n-gpu-layers 999 Offload all layers to Metal GPU
--parallel 1 Single slot — prevents slot rotation that breaks cache
--flash-attn on Flash attention on Metal for faster inference
--cache-type-k q8_0 KV cache key quantization (~50% memory savings vs F16)
--cache-type-v q8_0 KV cache value quantization
--reasoning-budget 0 Disable thinking tokens (optional, saves time if unused)
--verbose Log cache hit/miss stats for debugging

Limitations

  • Single-threaded stdlib proxy: Uses Python's http.server module. Fine for single-slot llama-server (which is the recommended configuration for prompt caching), but not suitable for high-concurrency workloads.
  • Mac/Metal only: llama-server GPU offload via --n-gpu-layers requires Apple Metal. The proxy itself runs anywhere, but you need Metal for practical inference speeds with large models.
  • No HTTPS: The proxy speaks plain HTTP. Only use on localhost or trusted networks.

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors