claude-cache-proxy

A prompt-caching proxy for llama-server + Claude Code. Normalizes Claude Code's per-request billing header so llama-server's KV cache can match prompt prefixes across turns.

The problem

Claude Code injects an x-anthropic-billing-header into the system prompt on every request:

x-anthropic-billing-header: cc_version=2.1.63.a43; cc_entrypoint=sdk-cli; cch=e2224;

Fields in this header change per-request. Since the header appears at the very start of the system prompt, the tokenized prefix diverges at ~token 33, forcing llama-server to re-evaluate the entire prompt from scratch. For a typical turn with ~18K prompt tokens at ~300 tok/s, that's ~60 seconds of prefill per turn.

The solution

cache-proxy.py sits between your client and llama-server, doing a byte-level regex replace on the request body to normalize the entire billing header value to a fixed string. With this fix, subsequent turns reuse ~99% of the KV cache — only the new tokens need evaluation.

Performance

Measured on the terminal-bench fix-git task with MiniMax-M2.5 (Q8_0) on Apple M3 Ultra:

Configuration	Time	Cache working?
llama-server, no proxy	8:40	No
llama-server + cache-proxy	1:50	Yes

With caching working, each turn after the first reuses ~99% of the KV cache. Prefill drops from ~60s/turn to <1s/turn.

Quick start

1. Install llama.cpp

brew install llama.cpp

2. Start llama-server

llama-server \
  --model /path/to/model.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0

3. Start the proxy

python3 cache-proxy.py

4. Point your client at the proxy

Configure Claude Code (or any other client) to send requests to http://localhost:8081 instead of http://localhost:8080.

Configuration

Flag	Default	Description
`--port`	`8081`	Port the proxy listens on
`--upstream`	`http://localhost:8080`	Upstream llama-server URL
`--verbose`	off	Log request body size and normalization details

Examples:

# Defaults: listen on 8081, forward to localhost:8080
python3 cache-proxy.py

# Custom port and upstream
python3 cache-proxy.py --port 9090 --upstream http://localhost:8080

# Verbose logging
python3 cache-proxy.py --verbose

Verifying cache hits

In llama-server's output, each request shows cache behavior:

slot update_slots: id 0 | task 232 | new prompt, ... task.n_tokens = 17928
slot update_slots: id 0 | task 232 | n_tokens = 17800, memory_seq_rm [17800, end)

This means 17,800 of 17,928 tokens were served from cache — only 128 new tokens needed evaluation. If you see memory_seq_rm [33, end), the cache is not working (only 33 tokens matched — the billing header is diverging).

Recommended llama-server flags

Flag	Value	Purpose
`--host`	`0.0.0.0`	Allow connections from Docker or other hosts
`--port`	`8080`	Server port (proxy sits on 8081)
`--ctx-size`	`131072`	128K context window
`--n-gpu-layers`	`999`	Offload all layers to Metal GPU
`--parallel`	`1`	Single slot — prevents slot rotation that breaks cache
`--flash-attn`	`on`	Flash attention on Metal for faster inference
`--cache-type-k`	`q8_0`	KV cache key quantization (~50% memory savings vs F16)
`--cache-type-v`	`q8_0`	KV cache value quantization
`--reasoning-budget`	`0`	Disable thinking tokens (optional, saves time if unused)
`--verbose`		Log cache hit/miss stats for debugging

Limitations

Single-threaded stdlib proxy: Uses Python's http.server module. Fine for single-slot llama-server (which is the recommended configuration for prompt caching), but not suitable for high-concurrency workloads.
Mac/Metal only: llama-server GPU offload via --n-gpu-layers requires Apple Metal. The proxy itself runs anywhere, but you need Metal for practical inference speeds with large models.
No HTTPS: The proxy speaks plain HTTP. Only use on localhost or trusted networks.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cache-proxy.py		cache-proxy.py
prepull-harbor-images.sh		prepull-harbor-images.sh
start-proxy.sh		start-proxy.sh
start-server.sh		start-server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

claude-cache-proxy

The problem

The solution

Performance

Quick start

1. Install llama.cpp

2. Start llama-server

3. Start the proxy

4. Point your client at the proxy

Configuration

Verifying cache hits

Recommended llama-server flags

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

claude-cache-proxy

The problem

The solution

Performance

Quick start

1. Install llama.cpp

2. Start llama-server

3. Start the proxy

4. Point your client at the proxy

Configuration

Verifying cache hits

Recommended llama-server flags

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages