You maintain fallback logic, provider quirks, cost trade-offs, and model churn — copied into every client. That work belongs in one place, behind one interface.
Helm API is that place: an open-source, self-hosted LLM routing gateway — nginx for the LLM world. Your app sends a normal OpenAI, Anthropic, or Gemini request. A declarative YAML config decides which model answers, fails over when a provider breaks, translates protocols both ways, and records every decision. Clients set a base_url and an API key. Nothing else.
Manage traffic as configuration, not as code.
# Your app: the same OpenAI client, just a new base_url and key.
client = OpenAI(base_url="http://localhost:8080/v1", api_key="<helm-key>")
client.chat.completions.create(model="auto", messages=[...]) # Helm classifies and routesChange the model behind a lane? Edit one YAML line — or click in the dashboard. Your apps never notice.
Prerequisites: Docker, or Node ≥ 22 + pnpm 10 to build from source.
# 1. Clone and create your env file
git clone https://github.com/EasyMetaAu/helm-api.git && cd helm-api
cp .env.example .env
# In .env, set HELM_ADMIN_PASSWORD and at least DEEPSEEK_API_KEY
# 2. Start it
docker compose up -d
# 3. Copy the root API key — generated and printed once on first boot
docker compose logs helm | grep -i "root API key"| What | Where |
|---|---|
| Gateway | http://localhost:8080 (status landing page at /) |
| Dashboard | http://localhost:8080/admin — HELM_ADMIN_USER / HELM_ADMIN_PASSWORD |
| API docs | GET /docs (Swagger UI) · GET /openapi.json (OpenAPI 3.1, generated from the same Zod schemas the gateway validates with) |
| Health / version | GET /healthz · GET /version |
docker-compose.yml mounts ./config and ./data — config and database survive restarts. Credentials enter via environment variables only, never the image.
| Feature | Detail | |
|---|---|---|
| 🔀 | Four client protocols | OpenAI Chat, Anthropic Messages, OpenAI Responses, Google Gemini — all streaming + non-streaming. One IR in the middle: any client reaches any backend with a consistent output shape, SSE included. |
| 🧭 | Three-layer classification | Deterministic rules (pure, zero-network, unit-tested — always on) → optional small-model eval (temperature: 0, cached, off by default — needs a configured eval model) → balanced lane as the fail-open sink. |
| 🛣️ | Lanes + policies | Requests route through lanes (economy / balanced / premium, plus task lanes coding, json, vision, tool_use), never raw provider names. First-match policies pin or cap the lane. Each lane = a primary model + a fallback chain, all in config. Opt-in Agentic Signals can promote a degraded lane within those caps. |
| 🪪 | Drop-in for fixed-model clients | A client that hard-codes a vendor model id (Claude Code's claude-opus-4-8, an SDK locked to gpt-5.5) just works — no 400 unknown model. A standard key classifies it like auto; a custom-model key can map each vendor family onto a lane via model-aliases.yaml (cap-bounded). |
| 🛡️ | Resilient execution | Circuit breaker (OPEN/HALF_OPEN + single probe), capability filter with explicit skip reasons, :free-tier 429 skipping, per-key concurrency queueing. Client disconnects are never counted as provider faults. |
| 🔐 | OAuth subscriptions | Route your Claude Pro/Max, ChatGPT Codex, and GitHub Copilot subscriptions as backends — pooled accounts, per-account model curation / egress proxy / scheduling, all hot-reloaded. (Opt-in; read the ToS warning.) |
| 🔑 | Keys with teeth | Mandatory auth; keys authenticate by SHA-256 hash; encrypted recovery material can be stored for admin reveal/rotation. Per key: lane whitelist, custom-model permission, RPM/TPM limits, usage budgets (degrade or reject), concurrency cap, memory mode. Rotate in place, revoke softly, then delete permanently. |
| 🧠 | Memory middleware | On by default: remembered context is injected before routing as a trailing turn; a background worker compresses and consolidates — compaction is auto-adaptive and zero-config (prices and context windows resolve from the model catalog; size / idle / context-pressure triggers). Summarize/merge default to deterministic local logic, with an opt-in LLM path (config.memory.llm, off by default). A forgetting/tiering layer (decay, reinforcement, retention) keeps it honest. Opt out per key or per request (x-memory-mode: off). |
| 📊 | Total observability | A redacted decision record per request — classifier, policy, lane, every provider attempt, latency, fallbacks, cost. Verbatim payload capture to a separate table (on by default, 30-day retention). A payload inspector reads long fields fullscreen, previews inline images, and an editable Retry button replays any captured request in its own protocol. |
| 🖥️ | Admin dashboard | SvelteKit SPA at /admin behind HTTP Basic: overview, key CRUD, lane/policy/classifier editors, system settings, drill-down request log. Edits write back to config/*.yaml (comment-preserving, atomic) and rebind live — no restart, and they survive one. Five languages. |
| 💾 | Storage | SQLite by default (one local file). Postgres / Supabase behind the same Store-port abstraction — switch with one env var. |
Roadmap: Account/customer billing is intentionally out of scope. See 09 Roadmap.
The gateway ships a SvelteKit console at /admin (HTTP Basic, five languages). Everything here is live — edits write back to config/*.yaml and rebind on the next request, no restart.
Every request, fully explained. Open any request to follow the whole trail: which layer classified it, the policy that applied, the lane's full candidate chain, each provider actually tried, and the cost split down to cached tokens.
A payload inspector built for debugging. With verbatim capture on, the same page loads the full request and response bodies as a collapsible tree (or Formatted / Raw):
- Read anything. Pop any oversized field — a giant system prompt, a tool schema, a continued-session summary — into a fullscreen, copyable reader instead of scrolling a wrapped cell.
- See the multimedia. A media overview at the top collects every image sent (request) and generated (response) as clickable thumbnails — no tree-digging — and inline base64 or remote images still render in place, with zoom, fit-to-window, and open-in-new-tab.
- Edit and replay. Hit Retry, tweak the body, and re-send it in its original protocol (OpenAI Chat, Anthropic, Responses, or Gemini) as an isolated, newly-traced debug run.
Pool your subscriptions. Route Claude Pro/Max, ChatGPT Codex, and GitHub Copilot logins as backends — several accounts per provider, each with its own model curation, egress proxy, priority, and live quota.
Routing is just config. Each lane is a primary model plus an ordered fallback chain — reorder, swap, or constrain it from the UI or the YAML.
See every admin screen — all 10 screenshots (click to expand)
Each screen is annotated in 11 · Admin UI.
This is the design rule everything else hangs off:
- Config and credentials are fail-closed. Invalid YAML, a missing required key, an unknown store driver — the gateway refuses to start. It never runs half-configured.
- The request path is fail-open. Classification, eval, memory, cache — any optional step that stumbles degrades quietly to the
balancedlane and gets logged. A client sees a structured error only when every provider in the chain is genuinely down.
And two fallbacks that are never conflated: classification fallback (undecided → balanced lane) and execution fallback (provider failed → next model in the chain). Separate mechanisms, separate decision-record fields — you can always tell which one fired.
Four client protocols enter one stable interface; one framework-agnostic core does the work; config drives every stage. (For the same pipeline as sequence, flow, and state diagrams, see Architecture & Data Flow.)
CLIENT ── OpenAI · Anthropic · OpenAI Responses · Google Gemini
one base_url + one Helm key · send model:"auto"
│
▼
GATEWAY apps/gateway (Hono) · thin HTTP shell — also serves /admin SPA + /docs
│ normalize any protocol ──▶ one InternalRequest (IR)
▼
CORE packages/core · the routing brain (imports no web framework)
│
├─ auth resolve sha256 key, load per-key caps · fail-closed
├─ gate rate limit (off) · usage budget (off) · fail-closed
├─ memory inject remembered context (on by default) · fail-open
├─ classify L1 rules ─uncertain→ L2 eval (off) ─→ balanced · fail-open
├─ resolve alias shim · explicit model · first-match policy
│ └─▶ lane → caps (+ signals) → fallback chain
├─ execute capability filter → circuit breaker → provider
│ └── on failure: advance to next model in the chain
└─ translate provider-native ⇄ IR ⇄ client protocol (streaming SSE)
│
▼
RESULT ── streamed/JSON response, in the client's own protocol
│
├─▶ telemetry redacted decision record + verbatim payload capture
├─▶ memory write back the turn
└─▶ upstream static API keys + OAuth subscriptions (pooled · hot-reload)
config/*.yaml drives every stage · Zod-validated · invalid config refuses to boot (fail-closed)
The core is headless by contract: routing, classification, provider execution, protocol translation, and storage live in packages/core and import no web framework — an architecture test enforces it. Hono and SvelteKit are thin, optional shells.
helm-api/
├─ apps/
│ ├─ gateway/ # Hono API + serves the dashboard + /healthz, /version
│ └─ admin/ # SvelteKit + Tailwind dashboard (static SPA)
├─ packages/
│ ├─ core/ # routing, classification, providers, protocol translation, storage ports (no framework)
│ └─ shared/ # Zod schemas + shared types (single source of truth)
├─ config/ # default lanes / policies / classifier / providers / model-aliases / … YAML
├─ docs/ # documentation (start at docs/README.md)
└─ scripts/ # sync:catalog and other build-time tools
Any OpenAI-compatible client works. Point it at Helm with a Helm key:
curl http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer $HELM_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Explain consistent hashing in two sentences."}],
"stream": true
}'| Endpoint | Protocol | Streaming |
|---|---|---|
POST /v1/chat/completions |
OpenAI Chat Completions | ✅ |
POST /v1/messages |
Anthropic Messages | ✅ |
POST /v1/responses |
OpenAI Responses | ✅ |
POST /v1beta/models/{model}:generateContent |
Google Gemini | ✅ (via :streamGenerateContent; auth via x-goog-api-key) |
POST /v1/images/generations |
OpenAI Images API (image generation) | — (model-pinned, any key) |
POST /v1beta/interactions |
Gemini Interactions API (image generation) | — (model-pinned, any key) |
What to put in model:
| Value | What Helm does |
|---|---|
auto (recommended) |
Classifies the request and routes it to the best lane. |
| any model/lane on a standard key | Helm still classifies and routes as if you'd sent auto (never a 400) — the model field doesn't pick the lane. But if the model you named is already in the chosen lane's chain, Helm serves that candidate first. |
a pinned vendor id, e.g. claude-opus-4-8 — custom-model key |
The compatibility shim maps it onto a lane (config/model-aliases.yaml), cap-bounded by the key's lanes. |
a lane name (premium) or exact alias (deepseek/deepseek-v4-pro) — custom-model key |
Routes straight into that lane / model, skipping classification. |
A standard key only ever needs
auto. Themodelfield never changes which lane is chosen — but when the named model already sits in that lane's chain, Helm promotes it to the front (so Claude Code pinningclaude-sonnet-4-6gets Sonnet, not the lane's primary; it falls back to the rest of the chain on failure). Pinning a lane, a vendor family, or an out-of-lane model requires a custom-model key (allow_custom_model). Lanes are operator config (lanes.yaml+ dashboard).
Image models are model-pinned: you name the exact model (or an image lane — see Failover below), with no classification, and any valid key works (no allow_custom_model needed; cost is bounded by the key's budget / rate limit). Operator-configured models: gpt-image-2 (OpenAI), gemini-3.1-flash-image / gemini-3-pro-image (Google "Nano Banana"). Every call is metered per image (output tokens × the model's image rate) and appears in the dashboard like any other request. Three entrypoints — match the one your SDK speaks:
1. OpenAI Images API — POST /v1/images/generations (Bearer auth), { "created", "data": [{ "b64_json" }], "usage" }:
curl http://localhost:8080/v1/images/generations \
-H "Authorization: Bearer $HELM_KEY" -H "Content-Type: application/json" \
-d '{ "model": "gpt-image-2", "prompt": "a single red apple on a plain white background", "size": "1024x1024" }'2. Gemini generateContent — the Gemini SDK's generate_content path. Name an image model and ask for image output; Helm routes it natively, so the response carries candidates[].content.parts[].inlineData:
curl "http://localhost:8080/v1beta/models/gemini-3.1-flash-image:generateContent" \
-H "x-goog-api-key: $HELM_KEY" -H "Content-Type: application/json" \
-d '{ "contents": [{ "parts": [{ "text": "a single red apple on a plain white background" }] }],
"generationConfig": { "responseModalities": ["TEXT", "IMAGE"] } }'3. Gemini Interactions API — POST /v1beta/interactions (the SDK's client.interactions.create). Response is the steps[] shape, with the image at steps[].content[] ({ "type": "image", "data": … }); the SDK's interaction.output_image.data reads it:
curl http://localhost:8080/v1beta/interactions \
-H "x-goog-api-key: $HELM_KEY" -H "Content-Type: application/json" \
-d '{ "model": "gemini-3.1-flash-image", "input": "a single red apple on a plain white background",
"response_format": { "type": "image", "aspect_ratio": "1:1" } }'The OpenAI Images endpoint serves both OpenAI and Gemini image models (Helm translates Gemini to/from
generateContent). The two Gemini-native entrypoints serve only Gemini image models.gpt-image-2on/v1beta/interactionsis a 400 → use/v1/images/generations.
The same image model is often available from several providers (official upstream, ZenMux, OpenRouter…). The shipped config already groups them into image lanes — name the lane as your model and Helm tries the primary, then on a provider fault (timeout, 5xx, circuit-open) falls over to the next, using the same circuit breaker as the chat router. A deterministic client error (a 4xx invalid request — bad size, oversized image) is returned verbatim and does not trigger failover.
# config/lanes.yaml — the two shipped image lanes lead with the OFFICIAL upstream,
# then fall over to the ZenMux relay. Members must be image models
# (capabilities.outputImage) and a single kind (all gpt-image-* OR all gemini-*-image).
gpt-image: # request `model: "gpt-image"`
primary: openai/gpt-image-2 # OpenAI official → ZenMux relay
fallback: [gpt-image-2]
gemini-image: # request `model: "gemini-image"`
primary: google/gemini-3.1-flash-image # Google official → ZenMux flash → pro
fallback: [gemini-3.1-flash-image, gemini-3-pro-image]Image lanes work for any key on the two dedicated endpoints (/v1/images/generations, /v1beta/interactions). On the Gemini :generateContent path, naming a lane follows the normal lane rule — it requires an allow_custom_model key — so for the broadest reach, point image SDKs at the dedicated endpoints.
Other endpoints (full interactive docs at /docs, raw spec at /openapi.json):
| Endpoint | Auth | Purpose |
|---|---|---|
GET / · GET /healthz · GET /version |
— | Landing page · readiness · build info |
GET /v1/models · GET /v1/models/{id} |
API key | Models the key can route to (lanes + auto; concrete aliases with capabilities & pricing for custom-model keys) |
/admin · /admin/api/* |
Basic auth | Dashboard + its JSON backend (mounted only when admin is enabled) |
Everything lives in config/*.yaml, Zod-validated on load. Invalid config stops the gateway from starting. Lanes, policies, the classifier, and system settings are also editable live in the dashboard — edits persist back to the YAML files (comments preserved) and apply on the next request.
| File | Controls | Live-editable |
|---|---|---|
server.yaml |
Host / port / base path | — |
auth.yaml |
API key requirement + first-run root key | — |
runtime.yaml |
Request limits, rate-limit defaults, storage driver, opt-in signal feedback | partial |
providers.yaml |
Upstream providers + model aliases (credentials by env-var name only) | — |
lanes.yaml |
Each lane's primary model + fallback chain (quality, task, and vendor-family lanes) | ✅ persists |
policies.yaml |
First-match rules that pick or cap the lane | ✅ persists |
classifier.yaml |
Built-in rules + the optional eval model | ✅ persists |
model-aliases.yaml |
Maps a pinned vendor model id → lane / auto (compatibility shim, optional) |
— |
memory.yaml |
Forgetting/tiering knobs (on in the shipped config) · optional compaction trigger overrides (compaction:) · optional LLM summarizer (llm:, off by default). A leftover observer: block from older configs refuses startup |
✅ |
capabilities.yaml / pricing.yaml |
Manual overrides on the model catalog (incl. prompt-cache read/write prices) | — |
Most-used environment variables (env wins over YAML; full list in .env.example):
| Variable | Purpose |
|---|---|
DEEPSEEK_API_KEY |
Primary provider credential (required) |
ZENMUX_API_KEY, OPENROUTER_API_KEY |
Optional provider credentials (provider skipped if missing) |
OPENAI_API_KEY, GEMINI_API_KEY |
Optional — official OpenAI / Google image providers; the shipped gpt-image / gemini-image lanes lead with these and fail over to ZenMux |
HELM_ADMIN_USER / HELM_ADMIN_PASSWORD |
Dashboard login (Basic auth) |
HELM_HOST / HELM_PORT |
Server binding (default 0.0.0.0:8080) |
HELM_STORE_DRIVER |
sqlite (default) or supabase |
HELM_STORE_URL_ENV |
For supabase: the name of the env var holding the Postgres DSN |
HELM_RATE_LIMIT_ENABLED |
Turn rate limiting on (off by default) |
HELM_OAUTH_ENC_KEY |
32-byte key encrypting recoverable API keys and stored OAuth tokens (required if any subscription provider is configured; needed for later API-key reveal) |
Storage. SQLite (
better-sqlite3, ahelm.dbfile under./data) is the default. For Postgres/Supabase, setHELM_STORE_DRIVER=supabaseand pointHELM_STORE_URL_ENVat the env var holding your DSN. Unknown drivers fail closed at startup.Credentials. Provider keys are referenced by env-var name in
providers.yaml— plaintext never enters the repo or the image.
A provider can authenticate with an OAuth subscription instead of a static key: log in from the dashboard (Providers → Connect). Claude Pro/Max and ChatGPT Codex use an authorization-code paste; GitHub Copilot uses a device code. Helm stores the rotating refresh token encrypted at rest and refreshes access tokens automatically.
Set HELM_OAUTH_ENC_KEY (32 bytes: base64 or 64 hex chars) — Helm refuses to start if a subscription provider is configured without it. The same key encrypts API-key recovery material used by the admin reveal/rotate flows. Then add an oauth: { provider: anthropic | github-copilot | openai-codex } block to the provider (commented examples in config/providers.yaml; for Claude use type: anthropic).
Pool several accounts per provider. Each account (Providers → Manage) gets its own:
- Models — a live allow-list, not a display filter: a removed model stops routing immediately; an uncurated model is refused (fail-closed).
- Proxy — HTTP/HTTPS/SOCKS5 egress per account, used across the entire subscription flow, so co-hosted accounts exit from distinct IPs.
- Schedule —
priority(lower serves first) + aschedulabletoggle; round-robin (LRU) within equal priority. Park an account to keep it connected but out of rotation.
Everything hot-reloads — connect, disconnect, curation, proxy, scheduling — next request, no restart. Helm also mirrors each official client's identity headers and sends a stable per-account device identity (never rotated mid-stream) to reduce ban-correlation risk.
⚠️ Terms of service. Routing a Claude/ChatGPT/Copilot subscription through a third-party gateway may violate the provider's ToS and can get accounts suspended. This is an opt-in feature for self-hosted personal use — you are responsible for compliance with your provider agreements. When in doubt, use a normal API key (api_key_env).
Requires Node ≥ 22 and pnpm 10.
pnpm install
pnpm dev # admin dashboard dev server (Vite) — see note below
pnpm test # Vitest unit tests
pnpm exec vitest run --coverage # unit coverage with source-only include/exclude + thresholds
pnpm test:e2e # Playwright end-to-end tests
pnpm typecheck # tsc --noEmit across the workspace
pnpm lint # Biome
pnpm build # build the gateway + dashboard
pnpm sync:catalog # refresh the generated model catalog (capabilities + pricing)
pnpm devstarts only the admin SPA. The gateway has no watch script — run it built (pnpm buildthennode apps/gateway/dist/index.js) or via Docker.
Tests come first: Vitest for the core, Playwright for full flows. Design decisions live in implementation-notes.md. Before a PR:
pnpm typecheck && pnpm lint && pnpm test && pnpm test:e2eStart at docs/README.md. For a visual tour of the pipeline, read Architecture & Data Flow. The numbered specification, in order:
01 Overview · 02 Architecture · 03 Classification · 04 Routing & Lanes · 05 Protocol Translation · 06 Auth & Rate Limits · 07 Observability · 08 Memory Middleware · 09 Roadmap · 10 Deployment · 11 Admin UI · 12 Memory Forgetting & Tiering · 13 Memory Admin & MCP · 14 Memory Deep Recall · Protocol Compatibility
Helm API is a real, end-to-end implementation, not a scaffold. The full pipeline (config → auth → classify → route → execute with circuit-breaking and fallback → protocol translation → telemetry → memory) is wired and covered by an extensive Vitest suite plus Playwright e2e specs. The version badge above tracks the current release.
MIT © 2026 EasyMeta AU









