A transparent inspector and rule engine that sits between your AI clients (Claude Code, GitHub Copilot Chat, Cursor, raw SDKs) and their upstreams (Anthropic, Ollama, LM Studio, anything OpenAI-compatible). Logs every request, surfaces conversations and tool calls, applies configurable rules, and gives you a live web UI to see what's actually happening.
Built around two ideas:
- Observability without changing your client — set
ANTHROPIC_BASE_URL(orOLLAMA_URL) to the proxy and every request, response, tool call, token count, and conversation thread is captured automatically. - Rules that improve model behavior — pre/post-flight hooks catch loops, fix malformed tool calls, prune unused tools, prevent silent context-window truncation, route requests across models, and more.
- Inspect every request — full bodies, headers, streaming chunks, tool calls, token counts, latency. Searchable. PII-redacted by default for cross-network viewers.
- Group requests into conversations — automatic threading by
system + first_userhash, with a turn-by-turn timeline view. - Identify the client app — distinguishes Claude Code, VS Code Copilot Chat, Cursor, Anthropic SDK, OpenAI SDK, LangChain, browsers, etc. via header/User-Agent fingerprinting.
- Translate Anthropic ↔ OpenAI — point Claude Code at any OpenAI-compatible backend (Ollama, LM Studio, vLLM). The proxy translates request/response bodies and SSE streams in both directions.
- Shadow runs — send the same request to a primary upstream AND a local model in parallel, store both, and get an automatic side-by-side comparison page (latency delta, token delta, tool-call agreement, text similarity).
- Rule pipeline — pluggable pre-flight (block/warn/transform) and post-flight (intercept/autofix) rules with a JSON editor and quick-toggle UI.
- Auditor suggestions — the proxy analyzes recent traffic and recommends config changes (route slow-short requests off Opus, prune unused tools, bump
OLLAMA_NUM_PARALLEL, etc.). - MCP server — exposes the same data via Model Context Protocol so an LLM can query traffic patterns directly.
- Restart from the UI — one button to bounce the systemd-managed proxy, with health-check polling for confirmation.
- Python 3.10+
- An upstream to forward to (Ollama on
localhost:11434, Anthropic API, etc.)
git clone https://github.com/guscatalano/AI_Proxy.git
cd AI_Proxy
python -m venv .venv
# Windows
.venv\Scripts\Activate.ps1
# Linux/macOS
source .venv/bin/activate
pip install -r requirements.txt
python proxy.pyUI: http://127.0.0.1:8000/__proxy/
# Claude Code (and any Anthropic SDK client)
export ANTHROPIC_BASE_URL=http://localhost:8000
claude
# OpenAI SDK / VS Code Copilot Chat / Cursor / Continue / Cline
export OPENAI_BASE_URL=http://localhost:8000/v1
# or set the equivalent in the client's settings
# Ollama-native clients (set OLLAMA_HOST)
export OLLAMA_HOST=http://localhost:8000The proxy routes by path: /v1/messages* and /v1/complete* go to Anthropic (ANTHROPIC_URL env var, default https://api.anthropic.com); everything else (/v1/chat/completions, /api/*) goes to Ollama (OLLAMA_URL env var, default http://localhost:11434).
| Variable | Default | What it does |
|---|---|---|
OLLAMA_URL |
http://localhost:11434 |
OpenAI-compat / Ollama upstream |
ANTHROPIC_URL |
https://api.anthropic.com |
Anthropic upstream for /v1/messages* |
LMSTUDIO_URL |
http://localhost:1234 |
Optional LM Studio for the System tab |
PROXY_HOST |
127.0.0.1 |
Bind address |
PROXY_PORT |
8000 |
Bind port |
PROXY_DB |
./proxy.db |
SQLite DB path |
PROXY_RULES_FILE |
./rules.json |
One-time rules import (subsequently stored in DB) |
PROXY_REDACT_PII |
1 |
Redact bodies/headers from cross-subnet viewers |
PROXY_REDACT_SUBNET_BITS |
24 |
IPv4 subnet width for PII gating |
MCP_ALLOW_WRITE |
false |
Allow update_rules MCP tool |
MCP_API_KEY |
(none) | Bearer token for MCP endpoint |
All rules are configured via a single JSON object, edited in the Audit tab → Rules Editor (or POSTed to /__proxy/api/rules). Saves apply on the next request — no restart needed.
| Rule | Phase | What it does |
|---|---|---|
model_router |
transform | Rewrite model based on conditions (from_model, prompt size, has_tools, client IP). |
ollama_options |
transform | Inject Ollama options (num_ctx, keep_alive, cache_prompt, etc.) when client didn't set them. |
protocol_bridge |
transform | When an Anthropic-shape request gets routed to a non-Claude model, translate body to OpenAI shape and route to Ollama; translate the SSE stream back on the way out. |
tool_pruner |
transform | Drop tool definitions the model has been offered repeatedly but never invoked in this conversation. |
context_overflow_guard |
transform | Estimate prompt tokens; warn / bump num_ctx / trim oldest messages / block when prompt exceeds the effective context window. |
shadow_router |
transform | Fan out a parallel shadow request to a comparison model. Best-effort, never blocks the primary. |
loop_detector |
pre-flight | Block when the same tool call repeats too often. |
tool_failure_breaker |
pre-flight | Block when a tool has N consecutive error results. |
schema_validator |
post-flight | Validate tool-call args against the request's tool schemas; replace invalid calls with corrective assistant content. |
hallucinated_tool |
post-flight | Reject tool calls naming functions not declared in the request. |
tool_args_autofix |
post-flight | Fill in missing required tool-call fields from configured defaults. |
The Audit tab includes a one-click toggle between three modes (with auto-detection of the active mode):
- Passthrough — Anthropic requests go straight to api.anthropic.com.
- Shadow — primary still goes to Anthropic; local model runs in parallel for comparison.
- Redirect — Anthropic requests are translated to OpenAI shape and sent to the local model entirely.
sudo cp ai_proxy.service /etc/systemd/system/
# Edit the Environment= lines for your paths/ports
sudo systemctl daemon-reload
sudo systemctl enable --now ai_proxyThe proxy supports self-restart via the System tab's "↻ Restart proxy" button when running under systemd (it exits with code 1; Restart=on-failure brings it back).
# Run as Administrator
New-Service -Name "AIProxy" `
-BinaryPathName "C:\path\to\python.exe C:\path\to\proxy.py" `
-DisplayName "AI Proxy" -StartupType Automaticclients ──► [PROXY :8000] ──► upstreams
│
├── pre-flight: block/warn rules (loop_detector, tool_failure_breaker)
├── transform: model_router, ollama_options, protocol_bridge,
│ tool_pruner, context_overflow_guard
├── fan-out: shadow_router (concurrent comparison runs)
├── upstream: stream chunks captured + live token tracking
└── post-flight: schema_validator, hallucinated_tool, tool_args_autofix
(intercept invalid tool calls, replace with corrective content)
▼
SQLite (proxy.db)
│
├── /__proxy/ ◄── web UI (single-page, no build step)
└── /__proxy/mcp ◄── MCP server (Model Context Protocol)
- Single Python file (
proxy.py) — FastAPI + httpx, no other heavy deps. - SQLite for storage — WAL mode, schema migrations baked in, automatic backfills for parser improvements.
- Single static file (
static/index.html) — vanilla JS, dark theme, no build step or framework. - Streaming-aware — buffers SSE only when post-flight intercept is enabled or
protocol_bridgeis translating; otherwise passes through chunk-by-chunk.
| Endpoint | Method | Description |
|---|---|---|
/__proxy/api/info |
GET | Upstream URLs, port |
/__proxy/api/health |
GET | DB stats, process info, pre-flight overhead |
/__proxy/api/requests |
GET | List requests (with live token state for pending rows) |
/__proxy/api/requests/{id} |
GET | Full detail incl. shadows |
/__proxy/api/conversations |
GET | List grouped conversations |
/__proxy/api/conversations/{id} |
GET | Turn-by-turn timeline |
/__proxy/api/stats |
GET | Per-model, per-client, per-app, per-tool aggregates |
/__proxy/api/audit |
GET | Gate verdict log |
/__proxy/api/suggestions |
GET | Auto-detected config recommendations |
/__proxy/api/rules |
GET / POST | Rules config (live-edit) |
/__proxy/api/system/now |
GET | CPU, memory, GPU, loaded models |
/__proxy/api/restart |
POST | Self-restart (requires X-Confirm: restart-now) |
/__proxy/api/export |
GET | Markdown/JSON digest for AI review |
/__proxy/mcp |
POST | MCP server (JSON-RPC) |
Register the proxy as an MCP server in any MCP-aware client:
claude mcp add --transport http ai-proxy http://localhost:8000/__proxy/mcpAvailable tools: list_recent_requests, get_request_detail, list_conversations, get_conversation, get_stats, get_audit, get_suggestions, get_rules, get_system_metrics, export_digest (and update_rules when MCP_ALLOW_WRITE=true). Now Claude can answer "what's been going through the proxy in the last hour?" or "show me the slowest requests today" by querying the data directly.
Reproducible UI screenshots via Playwright headless Chromium:
pip install playwright
playwright install chromium
sudo playwright install-deps chromium # Linux only
python scripts/screenshots.py --url http://localhost:8000/__proxy/ --out docs/screenshots/Generates 9 PNGs covering every tab plus request detail, conversation detail, and the shadow compare view. See scripts/screenshots.py for options.
- No authentication built in — bind to
127.0.0.1for local-only or front with a reverse proxy that handles auth. - PII redaction is on by default: viewers from a different subnet than the request originator see body/header/preview fields replaced with a placeholder. Loopback viewers always see everything. Tunable via
PROXY_REDACT_PIIandPROXY_REDACT_SUBNET_BITS. - Stored data includes full request/response bodies and headers (which means API keys, since they're in headers). The DB lives at
PROXY_DB(default./proxy.db); protect it accordingly. - Anthropic API keys are stripped from headers when the proxy bridges a request to a non-Anthropic upstream (so they don't leak into Ollama logs).
- Proxy not starting: check Ollama is running (
curl http://localhost:11434), check the port isn't in use (netstat -ano | findstr :8000on Windows,ss -ltn | grep :8000on Linux). - Database errors: delete
proxy.dband restart for a fresh DB. Schema migrations run automatically on startup. - Slow shadow runs on Ollama: see the auditor's
OLLAMA_NUM_PARALLELsuggestion in the Audit tab. Each parallel slot is a separate KV-cache; with too few slots, interleaved conversations evict each other and pay full prefill every turn. - Conversations not grouping: the proxy hashes
system + first_userto derive a conversation ID. Per-request volatile content (e.g., timestamps, billing headers) is normalized out automatically; if you find a client that breaks grouping, the normalizer in_normalize_for_cidis the place to extend. - Tokens missing on Anthropic streams: requires
Accept-Encoding: identityto upstream (already forced by the proxy) — older rows from before that fix get repaired automatically bybackfill_v5/backfill_v7migrations on next startup.
Active development. Single-file design, SQLite-backed, no build step. Pull requests welcome.






