Skip to content

EasyMetaAu/helm-api

Repository files navigation

Helm logo

Helm API

English · 简体中文

One gateway in front of every LLM provider. Pick models by config, not code.

Open-source · self-hosted · MIT

License: MIT Version Node TypeScript Built with Hono Admin: SvelteKit

You maintain fallback logic, provider quirks, cost trade-offs, and model churn — copied into every client. That work belongs in one place, behind one interface.

Helm API is that place: an open-source, self-hosted LLM routing gatewaynginx for the LLM world. Your app sends a normal OpenAI, Anthropic, or Gemini request. A declarative YAML config decides which model answers, fails over when a provider breaks, translates protocols both ways, and records every decision. Clients set a base_url and an API key. Nothing else.

Manage traffic as configuration, not as code.

# Your app: the same OpenAI client, just a new base_url and key.
client = OpenAI(base_url="http://localhost:8080/v1", api_key="<helm-key>")
client.chat.completions.create(model="auto", messages=[...])   # Helm classifies and routes

Change the model behind a lane? Edit one YAML line — or click in the dashboard. Your apps never notice.

Helm dashboard — live traffic, token usage by model, spend, and recent routing decisions

The dashboard — live traffic, token usage by model, spend, and the most recent routing decisions.

Quickstart

Prerequisites: Docker, or Node ≥ 22 + pnpm 10 to build from source.

# 1. Clone and create your env file
git clone https://github.com/EasyMetaAu/helm-api.git && cd helm-api
cp .env.example .env
#    In .env, set HELM_ADMIN_PASSWORD and at least DEEPSEEK_API_KEY

# 2. Start it
docker compose up -d

# 3. Copy the root API key — generated and printed once on first boot
docker compose logs helm | grep -i "root API key"
What Where
Gateway http://localhost:8080 (status landing page at /)
Dashboard http://localhost:8080/adminHELM_ADMIN_USER / HELM_ADMIN_PASSWORD
API docs GET /docs (Swagger UI) · GET /openapi.json (OpenAPI 3.1, generated from the same Zod schemas the gateway validates with)
Health / version GET /healthz · GET /version

docker-compose.yml mounts ./config and ./data — config and database survive restarts. Credentials enter via environment variables only, never the image.

What you get

Feature Detail
🔀 Four client protocols OpenAI Chat, Anthropic Messages, OpenAI Responses, Google Gemini — all streaming + non-streaming. One IR in the middle: any client reaches any backend with a consistent output shape, SSE included.
🧭 Three-layer classification Deterministic rules (pure, zero-network, unit-tested — always on) → optional small-model eval (temperature: 0, cached, off by default — needs a configured eval model) → balanced lane as the fail-open sink.
🛣️ Lanes + policies Requests route through lanes (economy / balanced / premium, plus task lanes coding, json, vision, tool_use), never raw provider names. First-match policies pin or cap the lane. Each lane = a primary model + a fallback chain, all in config. Opt-in Agentic Signals can promote a degraded lane within those caps.
🪪 Drop-in for fixed-model clients A client that hard-codes a vendor model id (Claude Code's claude-opus-4-8, an SDK locked to gpt-5.5) just works — no 400 unknown model. A standard key classifies it like auto; a custom-model key can map each vendor family onto a lane via model-aliases.yaml (cap-bounded).
🛡️ Resilient execution Circuit breaker (OPEN/HALF_OPEN + single probe), capability filter with explicit skip reasons, :free-tier 429 skipping, per-key concurrency queueing. Client disconnects are never counted as provider faults.
🔐 OAuth subscriptions Route your Claude Pro/Max, ChatGPT Codex, and GitHub Copilot subscriptions as backends — pooled accounts, per-account model curation / egress proxy / scheduling, all hot-reloaded. (Opt-in; read the ToS warning.)
🔑 Keys with teeth Mandatory auth; keys authenticate by SHA-256 hash; encrypted recovery material can be stored for admin reveal/rotation. Per key: lane whitelist, custom-model permission, RPM/TPM limits, usage budgets (degrade or reject), concurrency cap, memory mode. Rotate in place, revoke softly, then delete permanently.
🧠 Memory middleware On by default: remembered context is injected before routing as a trailing turn; a background worker compresses and consolidates — compaction is auto-adaptive and zero-config (prices and context windows resolve from the model catalog; size / idle / context-pressure triggers). Summarize/merge default to deterministic local logic, with an opt-in LLM path (config.memory.llm, off by default). A forgetting/tiering layer (decay, reinforcement, retention) keeps it honest. Opt out per key or per request (x-memory-mode: off).
📊 Total observability A redacted decision record per request — classifier, policy, lane, every provider attempt, latency, fallbacks, cost. Verbatim payload capture to a separate table (on by default, 30-day retention). A payload inspector reads long fields fullscreen, previews inline images, and an editable Retry button replays any captured request in its own protocol.
🖥️ Admin dashboard SvelteKit SPA at /admin behind HTTP Basic: overview, key CRUD, lane/policy/classifier editors, system settings, drill-down request log. Edits write back to config/*.yaml (comment-preserving, atomic) and rebind live — no restart, and they survive one. Five languages.
💾 Storage SQLite by default (one local file). Postgres / Supabase behind the same Store-port abstraction — switch with one env var.

Roadmap: Account/customer billing is intentionally out of scope. See 09 Roadmap.

Inside the dashboard

The gateway ships a SvelteKit console at /admin (HTTP Basic, five languages). Everything here is live — edits write back to config/*.yaml and rebind on the next request, no restart.

Every request, fully explained. Open any request to follow the whole trail: which layer classified it, the policy that applied, the lane's full candidate chain, each provider actually tried, and the cost split down to cached tokens.

Request trail — classifier verdict, lane candidate chain, provider attempts, and cost breakdown

A payload inspector built for debugging. With verbatim capture on, the same page loads the full request and response bodies as a collapsible tree (or Formatted / Raw):

  • Read anything. Pop any oversized field — a giant system prompt, a tool schema, a continued-session summary — into a fullscreen, copyable reader instead of scrolling a wrapped cell.
  • See the multimedia. A media overview at the top collects every image sent (request) and generated (response) as clickable thumbnails — no tree-digging — and inline base64 or remote images still render in place, with zoom, fit-to-window, and open-in-new-tab.
  • Edit and replay. Hit Retry, tweak the body, and re-send it in its original protocol (OpenAI Chat, Anthropic, Responses, or Gemini) as an isolated, newly-traced debug run.

Pool your subscriptions. Route Claude Pro/Max, ChatGPT Codex, and GitHub Copilot logins as backends — several accounts per provider, each with its own model curation, egress proxy, priority, and live quota.

Subscription providers — pooled OAuth accounts with per-account quota, proxy, schedule, and status

Routing is just config. Each lane is a primary model plus an ordered fallback chain — reorder, swap, or constrain it from the UI or the YAML.

Lanes editor — primary model and ordered fallback chain per lane

See every admin screen — all 10 screenshots (click to expand)

Dashboard — traffic, spend, token usage, recent decisions

Requests — the filterable request log

Request trail — the full per-request decision trail

Lanes — primary + ordered fallback chain per lane

Classifier — eval toggle, threshold, rule weights

Providers — pooled OAuth subscription accounts

Memory — facts & reflections by scope or key

Policies — first-match rules that pick or cap the lane

API Keys — per-key caps, limits, budgets, memory mode

Settings — payload capture, rate limits, queue, DB maintenance

Each screen is annotated in 11 · Admin UI.

Two failure disciplines

This is the design rule everything else hangs off:

  • Config and credentials are fail-closed. Invalid YAML, a missing required key, an unknown store driver — the gateway refuses to start. It never runs half-configured.
  • The request path is fail-open. Classification, eval, memory, cache — any optional step that stumbles degrades quietly to the balanced lane and gets logged. A client sees a structured error only when every provider in the chain is genuinely down.

And two fallbacks that are never conflated: classification fallback (undecided → balanced lane) and execution fallback (provider failed → next model in the chain). Separate mechanisms, separate decision-record fields — you can always tell which one fired.

Architecture

Four client protocols enter one stable interface; one framework-agnostic core does the work; config drives every stage. (For the same pipeline as sequence, flow, and state diagrams, see Architecture & Data Flow.)

CLIENT ── OpenAI · Anthropic · OpenAI Responses · Google Gemini
          one base_url + one Helm key · send model:"auto"
             │
             ▼
GATEWAY   apps/gateway (Hono) · thin HTTP shell — also serves /admin SPA + /docs
             │   normalize any protocol  ──▶  one InternalRequest (IR)
             ▼
CORE      packages/core · the routing brain (imports no web framework)
             │
             ├─ auth        resolve sha256 key, load per-key caps        · fail-closed
             ├─ gate        rate limit (off) · usage budget (off)        · fail-closed
             ├─ memory      inject remembered context (on by default)    · fail-open
             ├─ classify    L1 rules ─uncertain→ L2 eval (off) ─→ balanced · fail-open
             ├─ resolve     alias shim · explicit model · first-match policy
             │                  └─▶ lane → caps (+ signals) → fallback chain
             ├─ execute     capability filter → circuit breaker → provider
             │                  └── on failure: advance to next model in the chain
             └─ translate   provider-native  ⇄  IR  ⇄  client protocol (streaming SSE)
             │
             ▼
RESULT ── streamed/JSON response, in the client's own protocol
             │
             ├─▶ telemetry   redacted decision record + verbatim payload capture
             ├─▶ memory      write back the turn
             └─▶ upstream    static API keys + OAuth subscriptions (pooled · hot-reload)

config/*.yaml drives every stage · Zod-validated · invalid config refuses to boot (fail-closed)

The core is headless by contract: routing, classification, provider execution, protocol translation, and storage live in packages/core and import no web framework — an architecture test enforces it. Hono and SvelteKit are thin, optional shells.

helm-api/
├─ apps/
│  ├─ gateway/   # Hono API + serves the dashboard + /healthz, /version
│  └─ admin/     # SvelteKit + Tailwind dashboard (static SPA)
├─ packages/
│  ├─ core/      # routing, classification, providers, protocol translation, storage ports (no framework)
│  └─ shared/    # Zod schemas + shared types (single source of truth)
├─ config/       # default lanes / policies / classifier / providers / model-aliases / … YAML
├─ docs/         # documentation (start at docs/README.md)
└─ scripts/      # sync:catalog and other build-time tools

Calling the gateway

Any OpenAI-compatible client works. Point it at Helm with a Helm key:

curl http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer $HELM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Explain consistent hashing in two sentences."}],
    "stream": true
  }'
Endpoint Protocol Streaming
POST /v1/chat/completions OpenAI Chat Completions
POST /v1/messages Anthropic Messages
POST /v1/responses OpenAI Responses
POST /v1beta/models/{model}:generateContent Google Gemini ✅ (via :streamGenerateContent; auth via x-goog-api-key)
POST /v1/images/generations OpenAI Images API (image generation) — (model-pinned, any key)
POST /v1beta/interactions Gemini Interactions API (image generation) — (model-pinned, any key)

What to put in model:

Value What Helm does
auto (recommended) Classifies the request and routes it to the best lane.
any model/lane on a standard key Helm still classifies and routes as if you'd sent auto (never a 400) — the model field doesn't pick the lane. But if the model you named is already in the chosen lane's chain, Helm serves that candidate first.
a pinned vendor id, e.g. claude-opus-4-8custom-model key The compatibility shim maps it onto a lane (config/model-aliases.yaml), cap-bounded by the key's lanes.
a lane name (premium) or exact alias (deepseek/deepseek-v4-pro) — custom-model key Routes straight into that lane / model, skipping classification.

A standard key only ever needs auto. The model field never changes which lane is chosen — but when the named model already sits in that lane's chain, Helm promotes it to the front (so Claude Code pinning claude-sonnet-4-6 gets Sonnet, not the lane's primary; it falls back to the rest of the chain on failure). Pinning a lane, a vendor family, or an out-of-lane model requires a custom-model key (allow_custom_model). Lanes are operator config (lanes.yaml + dashboard).

Image generation

Image models are model-pinned: you name the exact model (or an image lane — see Failover below), with no classification, and any valid key works (no allow_custom_model needed; cost is bounded by the key's budget / rate limit). Operator-configured models: gpt-image-2 (OpenAI), gemini-3.1-flash-image / gemini-3-pro-image (Google "Nano Banana"). Every call is metered per image (output tokens × the model's image rate) and appears in the dashboard like any other request. Three entrypoints — match the one your SDK speaks:

1. OpenAI Images APIPOST /v1/images/generations (Bearer auth), { "created", "data": [{ "b64_json" }], "usage" }:

curl http://localhost:8080/v1/images/generations \
  -H "Authorization: Bearer $HELM_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "gpt-image-2", "prompt": "a single red apple on a plain white background", "size": "1024x1024" }'

2. Gemini generateContent — the Gemini SDK's generate_content path. Name an image model and ask for image output; Helm routes it natively, so the response carries candidates[].content.parts[].inlineData:

curl "http://localhost:8080/v1beta/models/gemini-3.1-flash-image:generateContent" \
  -H "x-goog-api-key: $HELM_KEY" -H "Content-Type: application/json" \
  -d '{ "contents": [{ "parts": [{ "text": "a single red apple on a plain white background" }] }],
        "generationConfig": { "responseModalities": ["TEXT", "IMAGE"] } }'

3. Gemini Interactions APIPOST /v1beta/interactions (the SDK's client.interactions.create). Response is the steps[] shape, with the image at steps[].content[] ({ "type": "image", "data": … }); the SDK's interaction.output_image.data reads it:

curl http://localhost:8080/v1beta/interactions \
  -H "x-goog-api-key: $HELM_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "gemini-3.1-flash-image", "input": "a single red apple on a plain white background",
        "response_format": { "type": "image", "aspect_ratio": "1:1" } }'

The OpenAI Images endpoint serves both OpenAI and Gemini image models (Helm translates Gemini to/from generateContent). The two Gemini-native entrypoints serve only Gemini image models. gpt-image-2 on /v1beta/interactions is a 400 → use /v1/images/generations.

Image failover across providers

The same image model is often available from several providers (official upstream, ZenMux, OpenRouter…). The shipped config already groups them into image lanes — name the lane as your model and Helm tries the primary, then on a provider fault (timeout, 5xx, circuit-open) falls over to the next, using the same circuit breaker as the chat router. A deterministic client error (a 4xx invalid request — bad size, oversized image) is returned verbatim and does not trigger failover.

# config/lanes.yaml — the two shipped image lanes lead with the OFFICIAL upstream,
# then fall over to the ZenMux relay. Members must be image models
# (capabilities.outputImage) and a single kind (all gpt-image-* OR all gemini-*-image).
gpt-image:                          # request `model: "gpt-image"`
  primary: openai/gpt-image-2       # OpenAI official → ZenMux relay
  fallback: [gpt-image-2]
gemini-image:                       # request `model: "gemini-image"`
  primary: google/gemini-3.1-flash-image   # Google official → ZenMux flash → pro
  fallback: [gemini-3.1-flash-image, gemini-3-pro-image]

Image lanes work for any key on the two dedicated endpoints (/v1/images/generations, /v1beta/interactions). On the Gemini :generateContent path, naming a lane follows the normal lane rule — it requires an allow_custom_model key — so for the broadest reach, point image SDKs at the dedicated endpoints.

Other endpoints (full interactive docs at /docs, raw spec at /openapi.json):

Endpoint Auth Purpose
GET / · GET /healthz · GET /version Landing page · readiness · build info
GET /v1/models · GET /v1/models/{id} API key Models the key can route to (lanes + auto; concrete aliases with capabilities & pricing for custom-model keys)
/admin · /admin/api/* Basic auth Dashboard + its JSON backend (mounted only when admin is enabled)

Configuration

Everything lives in config/*.yaml, Zod-validated on load. Invalid config stops the gateway from starting. Lanes, policies, the classifier, and system settings are also editable live in the dashboard — edits persist back to the YAML files (comments preserved) and apply on the next request.

File Controls Live-editable
server.yaml Host / port / base path
auth.yaml API key requirement + first-run root key
runtime.yaml Request limits, rate-limit defaults, storage driver, opt-in signal feedback partial
providers.yaml Upstream providers + model aliases (credentials by env-var name only)
lanes.yaml Each lane's primary model + fallback chain (quality, task, and vendor-family lanes) ✅ persists
policies.yaml First-match rules that pick or cap the lane ✅ persists
classifier.yaml Built-in rules + the optional eval model ✅ persists
model-aliases.yaml Maps a pinned vendor model id → lane / auto (compatibility shim, optional)
memory.yaml Forgetting/tiering knobs (on in the shipped config) · optional compaction trigger overrides (compaction:) · optional LLM summarizer (llm:, off by default). A leftover observer: block from older configs refuses startup
capabilities.yaml / pricing.yaml Manual overrides on the model catalog (incl. prompt-cache read/write prices)

Most-used environment variables (env wins over YAML; full list in .env.example):

Variable Purpose
DEEPSEEK_API_KEY Primary provider credential (required)
ZENMUX_API_KEY, OPENROUTER_API_KEY Optional provider credentials (provider skipped if missing)
OPENAI_API_KEY, GEMINI_API_KEY Optional — official OpenAI / Google image providers; the shipped gpt-image / gemini-image lanes lead with these and fail over to ZenMux
HELM_ADMIN_USER / HELM_ADMIN_PASSWORD Dashboard login (Basic auth)
HELM_HOST / HELM_PORT Server binding (default 0.0.0.0:8080)
HELM_STORE_DRIVER sqlite (default) or supabase
HELM_STORE_URL_ENV For supabase: the name of the env var holding the Postgres DSN
HELM_RATE_LIMIT_ENABLED Turn rate limiting on (off by default)
HELM_OAUTH_ENC_KEY 32-byte key encrypting recoverable API keys and stored OAuth tokens (required if any subscription provider is configured; needed for later API-key reveal)

Storage. SQLite (better-sqlite3, a helm.db file under ./data) is the default. For Postgres/Supabase, set HELM_STORE_DRIVER=supabase and point HELM_STORE_URL_ENV at the env var holding your DSN. Unknown drivers fail closed at startup.

Credentials. Provider keys are referenced by env-var name in providers.yaml — plaintext never enters the repo or the image.

OAuth subscription providers (Claude Pro/Max, ChatGPT Codex, GitHub Copilot)

A provider can authenticate with an OAuth subscription instead of a static key: log in from the dashboard (Providers → Connect). Claude Pro/Max and ChatGPT Codex use an authorization-code paste; GitHub Copilot uses a device code. Helm stores the rotating refresh token encrypted at rest and refreshes access tokens automatically.

Set HELM_OAUTH_ENC_KEY (32 bytes: base64 or 64 hex chars) — Helm refuses to start if a subscription provider is configured without it. The same key encrypts API-key recovery material used by the admin reveal/rotate flows. Then add an oauth: { provider: anthropic | github-copilot | openai-codex } block to the provider (commented examples in config/providers.yaml; for Claude use type: anthropic).

Pool several accounts per provider. Each account (Providers → Manage) gets its own:

  • Models — a live allow-list, not a display filter: a removed model stops routing immediately; an uncurated model is refused (fail-closed).
  • Proxy — HTTP/HTTPS/SOCKS5 egress per account, used across the entire subscription flow, so co-hosted accounts exit from distinct IPs.
  • Schedulepriority (lower serves first) + a schedulable toggle; round-robin (LRU) within equal priority. Park an account to keep it connected but out of rotation.

Everything hot-reloads — connect, disconnect, curation, proxy, scheduling — next request, no restart. Helm also mirrors each official client's identity headers and sends a stable per-account device identity (never rotated mid-stream) to reduce ban-correlation risk.

⚠️ Terms of service. Routing a Claude/ChatGPT/Copilot subscription through a third-party gateway may violate the provider's ToS and can get accounts suspended. This is an opt-in feature for self-hosted personal use — you are responsible for compliance with your provider agreements. When in doubt, use a normal API key (api_key_env).

Development

Requires Node ≥ 22 and pnpm 10.

pnpm install
pnpm dev          # admin dashboard dev server (Vite) — see note below
pnpm test         # Vitest unit tests
pnpm exec vitest run --coverage # unit coverage with source-only include/exclude + thresholds
pnpm test:e2e     # Playwright end-to-end tests
pnpm typecheck    # tsc --noEmit across the workspace
pnpm lint         # Biome
pnpm build        # build the gateway + dashboard
pnpm sync:catalog # refresh the generated model catalog (capabilities + pricing)

pnpm dev starts only the admin SPA. The gateway has no watch script — run it built (pnpm build then node apps/gateway/dist/index.js) or via Docker.

Tests come first: Vitest for the core, Playwright for full flows. Design decisions live in implementation-notes.md. Before a PR:

pnpm typecheck && pnpm lint && pnpm test && pnpm test:e2e

Documentation

Start at docs/README.md. For a visual tour of the pipeline, read Architecture & Data Flow. The numbered specification, in order:

01 Overview · 02 Architecture · 03 Classification · 04 Routing & Lanes · 05 Protocol Translation · 06 Auth & Rate Limits · 07 Observability · 08 Memory Middleware · 09 Roadmap · 10 Deployment · 11 Admin UI · 12 Memory Forgetting & Tiering · 13 Memory Admin & MCP · 14 Memory Deep Recall · Protocol Compatibility

Status

Helm API is a real, end-to-end implementation, not a scaffold. The full pipeline (config → auth → classify → route → execute with circuit-breaking and fallback → protocol translation → telemetry → memory) is wired and covered by an extensive Vitest suite plus Playwright e2e specs. The version badge above tracks the current release.

License

MIT © 2026 EasyMeta AU