Skip to content

adwibha/llm-router

Repository files navigation

LLM Router

A small middleware layer that sits between your app and an LLM provider. It routes each request to a Perplexity model based on prompt complexity (simple, medium, complex) and records every call in PostgreSQL: tokens, cost, latency, model, prompt, and response. One place to see what you spend and where.

This repo uses Perplexity as the reference provider. The design is provider-agnostic so you can plug in another API by implementing one interface and updating config.

Problem statement

Without a central layer you rarely know how much you spend per model or per day. Sending every request to the most capable model is wasteful. Here, simple queries go to a lighter model and complex ones to a research model. All usage is stored so you get a single view of cost and volume.

Provider and models Perplexity models used: sonar (simple), sonar-pro (medium), sonar-deep-research (complex). Pricing is configurable in the backend. The provider interface lives in backend/services/providers/ so you can add another API without changing tracking or analytics.

Architecture

flowchart TB
  subgraph Client
    app[App Dashboard]
  end
  subgraph Router[LLM Router]
    classify[Classify prompt]
    choose[Choose model]
    store[Store request]
  end
  app --> classify
  classify --> choose
  choose --> store
  store --> app
  choose --> pplx[Perplexity API]
  store --> pg[(PostgreSQL)]
  store --> prom[Prometheus]
  prom --> grafana[Grafana]
Loading

Routing logic

Each prompt is classified into simple, medium, or complex:

  1. Token count (estimate: len(prompt.split())):

    • < 50 tokens → lean simple
    • 50–200 tokens → lean medium
    • > 200 tokens → lean complex
  2. Keyword signals (case-insensitive):

    • Complex: "research", "analyze", "comprehensive", "detailed analysis", "in-depth", "investigate", "evaluate", "architecture", "explain in detail"
    • Simple: "what is", "define", "hello", "hi", "thanks", "yes", "no", "list"
    • Medium: "explain", "describe", "steps", "how would", "how do", "pros and cons", "compare", "structure", "deploy", and similar
  3. Final tier: Keyword match overrides token count when a strong signal is present.

  4. Route:

    • simple → sonar
    • medium → sonar-pro
    • complex → sonar-deep-research
  5. Override: If the request includes {"model_override": "sonar-pro"}, that model is used regardless of routing.

Tier Model Use case
simple sonar Simple queries
medium sonar-pro Reasoning tasks
complex sonar-deep-research Research/analysis

Cost tracking

Every request records:

Field Description
request_id UUID
timestamp UTC
prompt / response Full text
model_requested / model_used Routed model / actual model
prompt_tokens, completion_tokens, total_tokens From API usage
prompt_cost_usd, completion_cost_usd, total_cost_usd Calculated from config pricing
citation_tokens, search_queries_count Deep-research only (else 0)
citation_cost_usd, search_cost_usd Deep-research only
reasoning_tokens, reasoning_cost_usd From API when present (e.g. reasoning-pro), $3/1M
latency_ms Wall clock request→response
routing_tier, routing_reason simple/medium/complex + short reason
project_tag Optional multi-project tag
cached, provider, status, error_message Placeholder / provider / success or error

Cost savings

  • Actual cost: Sum of total_cost_usd for all requests (using routed model pricing).
  • If all sonar-pro: For the same token counts, cost if every request had used sonar-pro pricing.
  • Savings = (if all sonar-pro) − (actual). Exposed at GET /analytics/cost-savings.

Cost estimate vs Perplexity billing

Dashboard and demo totals are estimates from our configurable pricing (input, output, citation, search). They do not include reasoning tokens. Perplexity may bill deep-research under a different product name and charge for reasoning separately, so your real invoice can be higher. Use the router for relative comparison (savings, trends) and check Perplexity’s billing for actual spend.

Adding a new provider

  1. Implement the abstract Provider in backend/services/providers/base.py: implement call(model, messages) returning content, usage, and optional citation_tokens / search_queries_count.
  2. Add a new file under backend/services/providers/ (e.g. openai.py) and implement the interface.
  3. Update config (e.g. model list and pricing) and wire the new provider in the chat router (e.g. via env or config key).
  4. No change to tracking or analytics — they remain provider-agnostic.

Tech stack

Layer Technology
Backend FastAPI, uvicorn
DB PostgreSQL 16, SQLAlchemy 2 async, asyncpg
HTTP httpx
Metrics Prometheus, prometheus-client
Dashboards Grafana (pre-provisioned)
Frontend React 18, Vite, Tailwind, recharts
Tests pytest, pytest-asyncio
Orchestration Docker Compose

API reference

Method Endpoint Description
GET /health Health check
POST /v1/chat/completions OpenAI-compatible chat (routed)
GET /analytics/summary Total cost, tokens, requests today, running totals
GET /analytics/cost-by-model Cost and count per model
GET /analytics/cost-by-day?days=30 Cost per day
GET /analytics/cost-by-project Cost per project_tag
GET /analytics/cost-savings Savings vs all sonar-pro
GET /analytics/requests?limit=&offset= Paginated request list
GET /analytics/requests/:request_id Full request detail
GET /admin/models List valid models
GET /admin/pricing Current pricing config
POST /admin/pricing Update pricing (body: pricing dict)
GET /metrics Prometheus metrics

Sample request/response

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is Python?"}],"stream":false}'
{
  "id": "uuid",
  "model": "sonar",
  "choices": [{"message": {"role": "assistant", "content": "..."}}],
  "usage": {"prompt_tokens": 45, "completion_tokens": 120, "total_tokens": 165},
  "cost": {
    "prompt_cost_usd": 0.000045,
    "completion_cost_usd": 0.00012,
    "total_cost_usd": 0.000165,
    "routing_tier": "simple",
    "routing_reason": "keyword:simple"
  }
}

Dashboard

React app on port 3000: Dashboard (summary cards, cost by day, cost by model), Requests (table and detail), Models, Projects, Cost Savings. Grafana on port 3001 with a pre-provisioned Prometheus datasource and an LLM Cost Overview dashboard (total cost, cost per day, requests/min, tokens by model, latency, cost by tier).

How to run

You need Docker and Docker Compose. Copy .env.example to .env and set PPLX_API_KEY for real LLM calls.

# Start all services (backend, frontend, postgres, prometheus, grafana)
make up

# Optional: seed 100 synthetic rows for the dashboard
make seed

# Optional: send 5 real prompts and print routing and cost (requires PPLX_API_KEY)
make demo

# Run tests (23 tests)
make test

URLs: React dashboard at http://localhost:3000, Grafana at http://localhost:3001 (admin / admin). Health check: curl -s http://localhost:8000/health.

Other commands: make status (list containers), make down (stop and remove volumes), make summary (analytics summary), make savings (cost savings). Use make clean-data to truncate stored requests so only new data appears.

React dashboard

React dashboard

Requests table

Requests table

Grafana LLM Cost Overview

Grafana LLM Cost Overview

Demo output

Demo output

Tests passing

Tests passing

Models tab

Models tab

Cost Savings tab

Cost Savings tab

Performance

The router adds a small overhead (classification, DB write, metrics). Most of the latency comes from the LLM provider. The middleware usually adds under 50 ms.

Production considerations

  • Auth: Add API keys or JWT for /v1/chat/completions and /analytics//admin.
  • Multi-tenant: Use project_tag and filter analytics by tenant.
  • Caching: The cached field is a placeholder. Add response caching to cut cost and latency.
  • Budget alerts: Use Prometheus/Grafana alerts or a cron that checks GET /analytics/summary and notifies when thresholds are exceeded.

Project structure

.
├── LICENSE
├── Makefile
├── README.md
├── SECURITY.md
├── backend
│   ├── Dockerfile
│   ├── core.py
│   ├── main.py
│   ├── metrics
│   │   └── prometheus.py
│   ├── models
│   │   ├── database.py
│   │   └── schemas.py
│   ├── requirements.txt
│   ├── routers
│   │   ├── admin.py
│   │   ├── analytics.py
│   │   ├── chat.py
│   │   └── health.py
│   ├── scripts
│   │   ├── clear_data.py
│   │   ├── demo.py
│   │   ├── migrate_add_reasoning.py
│   │   └── seed_data.py
│   └── services
│       ├── providers
│       │   ├── base.py
│       │   └── perplexity.py
│       └── router_service.py
├── docker-compose.yml
├── docs
│   └── screenshots
│       ├── grafana-dashboard.png
│       ├── make-demo.png
│       ├── make-test.png
│       ├── react-cost-savings.png
│       ├── react-dashboard.png
│       ├── react-models.png
│       └── react-requests.png
├── frontend
│   ├── Dockerfile
│   ├── index.html
│   ├── package.json
│   ├── src
│   │   ├── App.jsx
│   │   ├── api.js
│   │   ├── components
│   │   │   ├── CostSavings.jsx
│   │   │   ├── Dashboard.jsx
│   │   │   ├── ModelStats.jsx
│   │   │   ├── ProjectStats.jsx
│   │   │   ├── RequestDetail.jsx
│   │   │   └── RequestsTable.jsx
│   │   ├── index.css
│   │   └── main.jsx
│   ├── tailwind.config.js
│   └── vite.config.js
├── grafana
│   ├── dashboards
│   │   └── llm-cost-overview.json
│   └── provisioning
│       ├── dashboards
│       │   └── dashboard.yml
│       └── datasources
│           └── datasources.yml
├── prometheus
│   └── prometheus.yml
├── scripts
│   ├── demo.py
│   └── seed_data.py
└── tests
    ├── conftest.py
    ├── test_analytics.py
    ├── test_chat.py
    ├── test_health.py
    └── test_router.py

Environment variables

Variable Description
DATABASE_URL PostgreSQL URL with asyncpg (default in compose: postgresql+asyncpg://user:pass@postgres:5432/llmrouter)
PPLX_API_KEY Perplexity API key (required for real calls)
PPLX_BASE_URL Perplexity API base (default: https://api.perplexity.ai)

This project is for learning and portfolio use. It is not production-ready. Use at your own risk. Pricing and model names follow the Perplexity reference (March 2026). Update config for your provider. See SECURITY.md.

License: MIT.

About

Provider agnostic LLM cost router and tracker. Routes prompts to cost-effective Perplexity models, tracks every token and dollar, exposes analytics API and React dashboard.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors