Skip to content

Nikhiljai03/Multi-LLM-Query-Router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AI Query Router - Multi-Model LLM Gateway

AI Query Router is a production-ready backend service that routes natural language queries to the optimal large language model (LLM) based on query complexity. It combines fast local caching, event streaming, and reliable fallback to maximize performance and minimize cost.


✨ Why this project matters

  • Smart query classification reduces inference cost by routing simpler requests to smaller models.
  • Multi-provider architecture ensures availability even if one LLM endpoint is hit or down.
  • Real-time metrics and event streaming make it easy to monitor and scale.
  • Lightweight architecture is easy to deploy with Docker and compose.

🧩 Key Features

  • Intelligent 3-tier routing: Simple (8B), Medium (70B), Complex (120B)
  • Provider fallback: Groq primary, Together AI fallback
  • Redis caching: 40-60% cache hit rate, cached repeat queries in <20ms
  • Kafka event pipeline: query analytics + usage logging
  • Publish-ready: Docker, environment configuration, health checks
  • Clean API: REST endpoints with JSON schema and OpenAPI docs

πŸ—οΈ Architecture

Client Request
    ↓
API Gateway (FastAPI)
    ↓
Cache Check (Redis) β†’ Hit? Return instantly (<20ms)
    ↓ Miss
Query Classifier
    β”œβ”€ Simple (≀10 words) β†’ 8B model
    β”œβ”€ Medium (11-50 words) β†’ 70B model
    └─ Complex (>50 words) β†’ 120B model
    ↓
Query Router
    β”œβ”€ Try Groq (1-2s) βœ…
    └─ Fallback: Together AI (2-3s) βœ…
    ↓
Cache Response + Log to Kafka
    ↓
Return to Client

πŸ› οΈ Tech Stack

  • Backend: Python 3.11, FastAPI, Async/Await
  • Caching: Redis
  • Event Streaming: Apache Kafka
  • Database: PostgreSQL (ready for analytics)
  • AI Providers: Groq, Together AI
  • DevOps: Docker, Docker Compose

πŸ“¦ Quick Start

Prerequisites

Installation

# 1. Clone the repository
git clone <your-repo-url>
cd ai-query-router

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env
# Edit .env and add your API keys:
# GROQ_API_KEY=your_groq_key
# TOGETHER_API_KEY=your_together_key

# 5. Start infrastructure services (optional)
docker-compose up -d

# 6. Run the application
python main.py

Visit: http://localhost:8000/docs for interactive API documentation

πŸ”‘ API Keys Setup

Groq (Primary - FREE)

  1. Go to https://console.groq.com/
  2. Sign up (no credit card needed)
  3. Create API key
  4. Add to .env: GROQ_API_KEY=gsk_your_key

Together AI (Fallback - FREE)

  1. Go to https://api.together.xyz/
  2. Sign up (get $25 free credits)
  3. Create API key
  4. Add to .env: TOGETHER_API_KEY=your_key

πŸ“‘ API Endpoints

POST /api/v1/query

Process a query and get AI response.

Request:

{
  "query": "What is machine learning?"
}

Response:

{
  "query": "What is machine learning?",
  "response": "Machine learning is...",
  "model_used": "llama-3.1-8b-instant",
  "complexity": "simple",
  "latency_ms": 1234.5,
  "cached": false,
  "cost_estimate": 0.001
}

GET /health

Health check endpoint

GET /api/v1/metrics

System metrics and statistics

🎯 3-Tier Model System

The system uses intelligent scoring to classify queries, not just hardcoded keywords.

Classification Method

Each query gets a complexity score based on:

  • High complexity indicators (+2): "comprehensive", "analyze", "in depth"
  • Medium complexity indicators (+1): "explain", "describe", "how does"
  • Multiple questions (+1)
  • Multiple sentences (+1)
  • Lists/enumerations (+1)
  • Complex conjunctions (+1)

Score 0: SIMPLE β†’ 8B model Score 1-2: MEDIUM β†’ 70B model
Score 3+: COMPLEX β†’ 120B model

Model Selection

Complexity Criteria Model Size Speed Cost
Simple ≀10 words, score=0 llama-3.1-8b-instant 8B 560 T/s $0.001
Medium 11-50 words or scoreβ‰₯1 llama-3.3-70b-versatile 70B 280 T/s $0.005
Complex >50 words or scoreβ‰₯3 openai/gpt-oss-120b 120B 500 T/s $0.02

Query Examples

# Simple (score=0, 8B model)
"What is AI?"
"Define Python"

# Medium (score=1-2, 70B model)
"Explain binary search"
"How does caching work?"
"Help me understand recursion"

# Complex (score=3+, 120B model)
"Provide a comprehensive analysis of microservices"
"Compare and contrast different sorting algorithms"
"Investigate the implications of quantum computing"  # Dynamic detection!

Note: The system uses intelligent scoring, so it can detect complexity even without hardcoded keywords!

πŸ”„ Fallback System

Primary: Groq (1-2 seconds)
    ↓ (if fails)
Fallback: Together AI (2-3 seconds)
    ↓ (if fails)
Error Response

πŸ“Š Performance

  • Cached Responses: <20ms
  • Groq (Primary): 1-2 seconds
  • Together AI (Fallback): 2-3 seconds
  • Cache Hit Rate: 40-60%
  • Cost Savings: 60% through intelligent routing

🐳 Docker Deployment

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop all services
docker-compose down

πŸ“ Project Structure

ai-query-router/
β”œβ”€β”€ main.py                 # Application entry point
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ docker-compose.yml      # Service orchestration
β”œβ”€β”€ .env                    # Configuration
β”‚
β”œβ”€β”€ api/                    # API layer
β”‚   β”œβ”€β”€ routes.py          # Endpoints
β”‚   └── models.py          # Request/response schemas
β”‚
β”œβ”€β”€ router/                 # Query routing
β”‚   β”œβ”€β”€ classifier.py      # Complexity classifier
β”‚   └── query_router.py    # Model routing logic
β”‚
β”œβ”€β”€ models/                 # AI model adapters
β”‚   β”œβ”€β”€ adapter.py         # Abstract interface
β”‚   β”œβ”€β”€ groq_adapter.py    # Groq integration
β”‚   └── together_adapter.py # Together AI integration
β”‚
β”œβ”€β”€ cache/                  # Caching layer
β”‚   └── redis_client.py    # Redis client
β”‚
β”œβ”€β”€ kafka_client/           # Event streaming
β”‚   └── producer.py        # Kafka producer
β”‚
β”œβ”€β”€ config/                 # Configuration
β”‚   └── settings.py        # Settings management
β”‚
└── utils/                  # Utilities
    β”œβ”€β”€ logger.py          # Logging
    └── metrics.py         # Metrics collection

πŸ§ͺ Testing

# Run complete flow test
python test_complete_flow.py

# Or use Makefile
make test

# Manual testing
curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is AI?"}'

πŸ”§ Configuration

Key environment variables in .env:

# Groq Models (Primary)
GROQ_API_KEY=your_groq_key
GROQ_SIMPLE_MODEL=llama-3.1-8b-instant
GROQ_MEDIUM_MODEL=llama-3.3-70b-versatile
GROQ_COMPLEX_MODEL=openai/gpt-oss-120b

# Together AI Models (Fallback)
TOGETHER_API_KEY=your_together_key
TOGETHER_SIMPLE_MODEL=meta-llama/Llama-3.2-3B-Instruct-Turbo
TOGETHER_MEDIUM_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
TOGETHER_COMPLEX_MODEL=meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo

# Infrastructure (Optional)
REDIS_HOST=localhost
KAFKA_BOOTSTRAP_SERVERS=localhost:9092

# Query Classification
SIMPLE_QUERY_MAX_WORDS=10
MEDIUM_QUERY_MAX_WORDS=50

πŸ“ˆ Monitoring

  • Structured Logging: JSON format for easy parsing
  • Health Checks: /health endpoint
  • Kafka Events: Real-time query analytics
  • Metrics: /api/v1/metrics endpoint

πŸ†˜ Troubleshooting

Port Already in Use

# Change port in .env
PORT=8001

Redis/Kafka Not Running

# System works without them (just no caching/events)
# To start them:
docker-compose up -d

Groq API Rate Limit

  • Wait 1 minute, or
  • Fallback will automatically use Together AI

Wrong Model Being Used

  • Check .env has correct model names
  • Restart application: python main.py
  • Check logs for "Calling Groq with model: ..."

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

🌟 Star this repo if you find it useful!


About

FastAPI gateway that routes queries to the right LLM based on complexity, with caching, fallback, and real-time analytics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors