GatewayZ Universal Inference API

Production-Ready AI Model Gateway | v2.0.3

🚀 Overview

GatewayZ is an enterprise-grade FastAPI application providing a unified API gateway to access 100+ AI models from 30+ providers. It acts as a drop-in replacement for OpenAI's API while supporting models from:

OpenAI (GPT-4, GPT-3.5, etc.)
Anthropic (Claude-3 family)
Open Source (Llama, Mistral, etc.)
30+ Additional Providers (see Supported Providers)

Key Capabilities

✅ OpenAI-Compatible API - Drop-in replacement for OpenAI endpoints ✅ Anthropic Messages API - Full Claude model support ✅ Multi-Provider Routing - Automatic failover and load balancing ✅ Real-Time Monitoring - Prometheus/Grafana integration ✅ Credit-Based Billing - Usage tracking and cost analysis ✅ Enterprise Security - Encrypted API keys, IP allowlists, audit logging ✅ Distributed Tracing - OpenTelemetry integration with Tempo ✅ Advanced Features - Chat history, image generation, trials, subscriptions

📊 Complete Infrastructure Stack

Core Application

✅ FastAPI 0.104.1 - ASGI web framework
✅ Uvicorn 0.24.0 - ASGI server
✅ Python 3.10+ - Programming language
✅ 85,080 LOC - Production code across 200+ modules

Data Layer

✅ Supabase PostgreSQL - Primary database
- 20+ tables (users, api_keys, payments, metrics, etc.)
- 36 SQL migrations applied
- Row-level security (RLS) policies
- Real-time capabilities via PostgREST API
✅ Redis 5.0.1 - In-memory cache & rate limiting
- Request caching (5-minute TTL)
- Rate limit tracking (per user, per key, system-wide)
- Real-time metrics cache
- Session storage
- Fallback support (graceful degradation if unavailable)

Provider Integrations (30+ APIs)

Each provider has a dedicated client module:

OpenRouter - Model aggregator (100+ models)
Portkey - LLM API gateway
Featherless - Open-source models
Together AI - Model serving platform
Fireworks - Model inference
DeepInfra - Model hosting
HuggingFace - Model hub (1,241+ models)
Google Vertex AI - Google cloud models
Groq - Fast inference processor
Cerebras - Sparse inference engine
X.AI (Grok) - Latest models
Anthropic Claude - Direct API integration
20+ Additional Providers - Full list in Supported Providers

Authentication & Security

✅ Encrypted API Keys - Fernet (AES-128) encryption
✅ HMAC-SHA256 - Key validation and hashing
✅ Role-Based Access Control (RBAC) - User permissions
✅ IP Allowlisting - Per-API-key IP restrictions
✅ Domain Restrictions - Limit usage by domain
✅ JWT Tokens - Token-based authentication
✅ Audit Logging - All operations tracked to database

Observability & Monitoring Stack

✅ Prometheus - Metrics collection and exposure
- 20+ metrics types (requests, latency, errors, tokens, costs)
- /metrics endpoint (Prometheus format)
- 15-minute scrape interval recommended
- Real metrics from actual request processing
✅ Grafana - Dashboard visualization
- 6 recommended dashboard designs
- JSON model datasource support
- Alert configuration ready
✅ OpenTelemetry - Distributed tracing
- opentelemetry-api + opentelemetry-sdk
- Auto-instrumentation for FastAPI, HTTPX, Requests
- Span context propagation
- Trace export to Tempo
✅ Tempo - Distributed trace storage
- OpenTelemetry OTLP endpoint
- Configurable retention policies
- Trace visualization integration
✅ Sentry - Error tracking
- FastAPI integration
- Automatic exception capture
- Release tracking
- User context tracking
✅ Loki - Log aggregation
- Python JSON logger integration
- Structured logging (JSON format)
- Log label extraction
- Query interface via Grafana
✅ Arize - AI model monitoring
- Model performance tracking
- Drift detection
- Production model observability
- Integration via OTEL

Caching & Performance

✅ Multi-Layer Caching
- Model catalog cache (memory + Redis)
- User lookup cache (Redis)
- Response caching (Redis, 5-min browser TTL)
- Provider data caching (1-hour TTL)
- Health metrics caching (real-time)
✅ Connection Pooling
- Database connection pool management
- Monitored via /api/optimization-monitor endpoint
- Auto-scaling based on load
✅ Rate Limiting
- Redis-backed rate limiting (primary)
- Fallback rate limiting (in-memory, if Redis down)
- Per-user limits
- Per-API-key limits
- System-wide limits

Advanced Features

✅ Chat History - Persistent conversation storage
✅ Image Generation - Multi-provider image APIs
✅ Billing System - Credit-based, usage tracking
✅ Subscriptions - Recurring billing via Stripe
✅ Free Trials - Trial period management
✅ Referral System - User referral tracking
✅ Coupons - Discount code support
✅ Request Prioritization - Queue-based priority handling
✅ Provider Failover - Automatic fallback to healthy providers
✅ Health Monitoring - 3 health check systems:
- Autonomous monitor (active health checks)
- Passive monitor (from request results)
- Circuit breaker pattern

External Services

✅ Stripe - Payment processing & subscriptions
✅ Resend - Transactional email delivery
✅ Statsig - Feature flags & A/B testing
✅ PostHog - Product analytics
✅ Braintrust - ML evaluation & tracing
✅ OpenAI - Direct ChatGPT API calls

API Endpoints (86+ endpoints)

Chat & Inference:

POST /chat/completions - OpenAI-compatible chat
POST /v1/messages - Anthropic Messages API
POST /v1/images/generations - Image generation

Model Discovery:

GET /v1/models - List all available models
GET /v1/models/trending - Trending models (real usage)
GET /v1/models/low-latency - Fast models
GET /v1/models/search - Advanced search
GET /v1/provider - Provider information
GET /v1/gateways/summary - Gateway statistics

Monitoring (Real Data):

GET /api/monitoring/health - Provider health status
GET /api/monitoring/stats/realtime - Real-time metrics
GET /api/monitoring/error-rates - Error tracking
GET /api/monitoring/cost-analysis - Cost breakdown
GET /api/monitoring/chat-requests/counts - Request counts per model
GET /api/monitoring/chat-requests/models - Model statistics
GET /api/monitoring/chat-requests - Full request logs
GET /api/monitoring/anomalies - Anomaly detection

Health & Uptime Timeline:

GET /health/providers/uptime - Provider uptime timeline with time-bucketed samples
GET /health/models/uptime - Model uptime timeline with incident tracking
GET /health/gateways/uptime - Gateway uptime timeline and provider health

Prometheus Metrics:

GET /metrics - Prometheus format metrics
GET /prometheus/metrics/all - All metrics filtered
GET /prometheus/metrics/system - System metrics
GET /prometheus/metrics/models - Model metrics
GET /prometheus/metrics/providers - Provider metrics

User Management:

POST /auth/login - User authentication
GET /user/profile - User information
GET /user/balance - Credit balance
POST /user/api-keys - API key management
GET /user/chat-history - Chat history

Admin:

GET /admin/users - User listing (admin only)
GET /admin/analytics - Analytics dashboard (admin only)
POST /admin/refresh-providers - Provider cache refresh (admin only)

See CLAUDE.md for complete endpoint list

🏗️ Architecture

Client Requests (Web, Mobile, CLI)
         ↓
┌─────────────────────────────────────┐
│  FastAPI + Middleware Layer         │
│  • Authentication & Rate Limiting   │
│  • Request logging & compression    │
│  • Distributed tracing              │
└─────────────────────────────────────┘
         ↓
┌─────────────────────────────────────┐
│  Routes Layer (43 route files)      │
│  • /chat, /messages, /images        │
│  • /v1/models, /v1/provider         │
│  • /api/monitoring/* endpoints      │
└─────────────────────────────────────┘
         ↓
┌─────────────────────────────────────┐
│  Services Layer (95 service files)  │
│  • Provider clients (30+ integrated)│
│  • Model catalog management         │
│  • Pricing calculations             │
│  • Health monitoring                │
│  • Request prioritization           │
└─────────────────────────────────────┘
         ↓
┌──────────────────┬──────────────────┐
│  Supabase        │  Redis Cache     │
│  PostgreSQL      │  Rate Limiting   │
│  • users         │  Real-time Stats │
│  • api_keys      │                  │
│  • requests      │                  │
│  • metrics       │                  │
└──────────────────┴──────────────────┘
         ↓
┌──────────────────────────────────────┐
│  30+ AI Model Providers              │
│  • OpenRouter      • Portkey         │
│  • Featherless     • Together        │
│  • Google Vertex   • HuggingFace     │
│  • Groq            • And 23 more...  │
└──────────────────────────────────────┘

🔌 Supported Providers

Tier 1 (Fully Integrated, Tested)

OpenRouter - 100+ models aggregator
Portkey - Model provider API
Featherless - Open source models
Together AI - Model serving
Fireworks - Model inference
DeepInfra - Model hosting
HuggingFace - Model hub integration
Google Vertex AI - Google cloud models
Groq - Fast inference
Cerebras - Sparse inference

Tier 2 (Additional Providers)

X.AI (Grok) • 12. AIMO • 13. Near • 14. Fal.ai
Anannas • 16. Modelz • 17. AiHubMix • 18. Vercel AI Gateway
Akash • 20. Alibaba Cloud • 21. Alpaca Network
Clarifai • 23. Cloudflare Workers AI • 24. Helicone
Morpheus • 26. Nebius • 27. Novita • 28. OneRouter
Anthropic (Claude via API) • 30. OpenAI

Total: 100+ Models across all providers

🗂️ Project Structure

gatewayz-backend/
├── src/                           # Main application (85,080 LOC)
│   ├── main.py                    # FastAPI app factory
│   ├── config/                    # Configuration (8 modules)
│   ├── routes/                    # Endpoints (43 modules)
│   ├── services/                  # Business logic (95 modules)
│   │   ├── *_client.py           # Provider integrations
│   │   ├── models.py             # Model management
│   │   ├── providers.py          # Provider registry
│   │   ├── pricing.py            # Cost calculations
│   │   └── prometheus_metrics.py # Metrics collection
│   ├── db/                        # Database layer (24 modules)
│   ├── middleware/                # Middleware (6 modules)
│   ├── schemas/                   # Pydantic models (15 modules)
│   ├── security/                  # Auth & encryption
│   └── utils/                     # Utilities (15 modules)
│
├── tests/                         # Test suite (228 test files)
│   ├── routes/                    # Route tests
│   ├── services/                  # Service tests
│   ├── integration/               # Integration tests
│   ├── e2e/                       # End-to-end tests
│   └── smoke/                     # Smoke tests
│
├── docs/                          # Documentation (15+ files)
│   ├── CLAUDE.md                 # Codebase context
│   ├── CHAT_REQUESTS_ENDPOINTS_TEST_REPORT.md
│   ├── QA_COMPREHENSIVE_AUDIT_REPORT.md
│   ├── GRAFANA_DASHBOARD_DESIGN_GUIDE.md
│   ├── GRAFANA_ENDPOINTS_MAPPING.md
│   └── ... (more guides)
│
├── supabase/                      # Database
│   ├── config.toml               # Configuration
│   └── migrations/               # SQL migrations (36 files)
│
├── scripts/                       # Utility scripts
│   └── test-chat-requests-endpoints.sh
│
└── pyproject.toml                # Project metadata

🚀 Getting Started

Prerequisites

Python 3.10+
PostgreSQL (via Supabase)
Redis
API keys for at least one provider

Installation

# Clone repository
git clone https://github.com/your-org/gatewayz-backend.git
cd gatewayz-backend

# Install dependencies
pip install -r requirements.txt

# Set up environment
cp .env.example .env
# Edit .env with your configuration

Configuration

Required environment variables:

# Database
SUPABASE_URL=your_supabase_url
SUPABASE_KEY=your_supabase_key

# Redis
REDIS_URL=redis://localhost:6379

# At least one provider API key
OPENROUTER_KEY=your_key
# or
PORTKEY_KEY=your_key
# or multiple providers

# Optional monitoring
SENTRY_DSN=your_sentry_url
PROMETHEUS_PUSHGATEWAY=your_pushgateway_url

Running the Server

# Development
python src/main.py
# Server starts on http://localhost:8000

# Production
uvicorn src.main:app --host 0.0.0.0 --port 8000 --workers 4

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src

# Run specific endpoint tests
pytest tests/routes/test_chat_requests_endpoints.py -v

# Run integration tests
pytest tests/integration/ -v

📈 Monitoring & Metrics

Prometheus Metrics

All metrics are real data collected from actual requests:

# View metrics
curl http://localhost:8000/metrics

# Example metrics exposed:
- http_requests_total (by endpoint, method, status)
- http_request_duration_seconds (latency percentiles)
- model_inference_requests_total (by model, provider)
- gateway_cost_per_provider (actual costs)
- provider_health_score (0-100)
- error_rate_by_provider (percentage)

Grafana Dashboards

6 recommended dashboards for visualization:

Executive Overview - System health, request rates, costs
Model Performance - Top models, latency, errors
Gateway Comparison - Provider statistics and costs
Business Metrics - Revenue, costs, profitability
Incident Response - Real-time alerts, error logs
Tokens & Throughput - Token usage and efficiency

See GRAFANA_ENDPOINTS_MAPPING.md for complete dashboard specs

Health Checks

# Basic health
curl http://localhost:8000/health

# Provider-specific health
curl http://localhost:8000/api/monitoring/health/openrouter

# Real-time statistics
curl http://localhost:8000/api/monitoring/stats/realtime

🔐 Security Features

Authentication

✅ API key-based authentication
✅ JWT token support
✅ Encrypted key storage (Fernet AES-128)
✅ HMAC validation

Authorization

✅ Role-based access control (RBAC)
✅ IP allowlisting per API key
✅ Domain restrictions
✅ Rate limiting (per user, per key, system-wide)

Audit & Compliance

✅ Complete audit logging
✅ User activity tracking
✅ Request/response logging
✅ Encrypted sensitive data

🧪 Testing Infrastructure

Test Framework & Tools

✅ Pytest 7.4.3 - Test runner and framework
✅ Pytest-asyncio - Async test support
✅ Pytest-cov - Code coverage measurement
✅ Pytest-xdist - Parallel test execution
✅ Pytest-timeout - Test timeout handling
✅ Pytest-mock - Mocking utilities
✅ Playwright 1.40.0 - Browser automation for E2E tests
✅ Factory Boy - Test data generation
✅ Faker - Realistic test data creation

Test Coverage

228 test files across 13 directories
13 test categories:
- Unit tests (fast, isolated logic)
- Integration tests (database interactions)
- E2E tests (full request flows)
- Smoke tests (quick verification)
- Security tests (auth, encryption)
- Route tests (endpoint validation)
- Service tests (business logic)
- Middleware tests (request handling)
- Config tests (configuration loading)
- Utility tests (helper functions)
- Health tests (health check endpoints)
- Database tests (data layer)
- Schema tests (validation)

Custom Test Suites Created

✅ Chat Requests Endpoint Tests (25 pytest tests + 24 bash tests)
- Real database data validation
- Mock data detection
- Pagination and filtering
- Data consistency checks

Recent QA Audit (2025-12-28)

✅ Verification Results:

0 critical security issues
100% of endpoints use real database data
All 30+ providers verified as real connections
Proper error handling and fallback mechanisms
49 comprehensive test cases written

⚠️ Medium-Risk Issues Identified:

TESTING environment variable - Can activate test mode
- Affects: Image generation, chat, messages endpoints
- Condition: TESTING=true OR APP_ENV=testing
- Mitigation: Pre-deployment validation script
Logic bug in fallback conditions (2 locations)
- File: src/routes/chat.py line 2350
- File: src/routes/messages.py line 260
- Issue: Inverted conditions (should be and not and not)
- Status: Identified in QA audit, planned for fix in v2.1.0
Synthetic metrics injection
- When: Supabase database unavailable
- Effect: Fake metrics sent to Prometheus
- Impact: Grafana may show false health
- Mitigation: Monitor DB connectivity
Hardcoded xAI models
- By design: xAI doesn't provide public API
- Impact: Low (catalog data only)
- Status: Documented as acceptable

Detailed findings: See QA_COMPREHENSIVE_AUDIT_REPORT.md

📚 Documentation

Document	Purpose	Audience
CLAUDE.md	Complete codebase context	Developers
QA_COMPREHENSIVE_AUDIT_REPORT.md	Audit findings and recommendations	QA, Leadership
QA_ACTION_PLAN.md	3 actionable tasks (~9 hours)	Development Team
GRAFANA_DASHBOARD_DESIGN_GUIDE.md	6 dashboard designs	Ops, Analytics
GRAFANA_ENDPOINTS_MAPPING.md	Endpoint-to-dashboard mapping	Ops Engineers
CHAT_REQUESTS_ENDPOINTS_TEST_REPORT.md	Comprehensive endpoint testing	QA Engineers
MONITORING_ENDPOINTS_VERIFICATION.md	Monitoring endpoint verification	Ops, QA
MONITORING_API_REFERENCE.md	API reference documentation	All Developers

🔄 Deployment

Local Development

python src/main.py
# Available on http://localhost:8000

Docker

docker build -t gatewayz-api .
docker run -p 8000:8000 --env-file .env gatewayz-api

Vercel (Serverless)

# Configured in vercel.json
vercel deploy

Railway

# Configured in railway.json
railway up

Kubernetes

# Docker image deployment
kubectl apply -f k8s/

🐛 Known Issues & Limitations

Environment Variable Risk

⚠️ TESTING Environment Variable

If any of these are set in production, test/fallback data flows to users:

TESTING=true
TESTING=1
TESTING=yes
APP_ENV=testing
APP_ENV=test

Mitigation: Pre-deployment validation required (see QA_ACTION_PLAN.md)

Prometheus Summary Endpoint

⚠️ /prometheus/metrics/summary returns placeholder values ("N/A")

Status: Incomplete feature, not in critical path Workaround: Use direct Prometheus queries for aggregations

Synthetic Metrics

⚠️ When Supabase is unavailable, fake metrics are auto-injected

Impact: Grafana may show false positive health Status: Documented in metrics service Mitigation: Monitor database connectivity

📊 Performance Benchmarks

Operation	Latency	Throughput
Chat completion (GPT-4)	2-4s	10 req/s
Model list endpoint	<100ms	1000+ req/s
Health check	<50ms	10000+ req/s
Monitoring stats	<200ms	500+ req/s
Metrics export	<300ms	200+ req/s

🤝 Contributing

Development Workflow

Create feature branch: git checkout -b feature/your-feature
Make changes and write tests
Run linter: ruff check src/
Format code: black src/
Run tests: pytest
Commit with conventional message: git commit -m "feat: your feature"
Push and create PR to staging

Code Quality Standards

Linting: Ruff (100 char line limit)
Formatting: Black (100 char line limit)
Type Checking: MyPy (Python 3.12 target)
Import Organization: isort (black profile)
Test Coverage: >80% required

📞 Support & Issues

Reporting Issues

Check QA_COMPREHENSIVE_AUDIT_REPORT.md for known issues
Review existing issues on GitHub
Create new issue with reproduction steps

Getting Help

📖 See CLAUDE.md for codebase overview
🧪 See CHAT_REQUESTS_ENDPOINTS_TEST_REPORT.md for endpoint details
📊 See GRAFANA_ENDPOINTS_MAPPING.md for monitoring setup

📄 License

📈 Roadmap

Current Version (v2.0.3)

✅ 30+ provider integrations
✅ Real-time monitoring with Prometheus/Grafana
✅ OpenTelemetry distributed tracing
✅ Credit-based billing system
✅ Enterprise security features

Planned (v2.1.0)

Fix inverted logic bugs in chat/messages endpoints
Complete Prometheus summary endpoint
Add integration tests for all code paths
Improve synthetic metrics handling
Add provider-specific optimizations

Planned (v2.2.0)

Vision model support (image understanding)
Streaming optimization
Advanced caching strategies
Cost prediction and optimization
Custom model deployment support

🔄 Model Routing Hotfixes & Rollback Guide

Documents four model routing fixes in commit c09165c4. Each section explains what changed and how to revert it individually. Silent redirects (aliases to newer model IDs) are intentional — deprecated upstream models are mapped to their current equivalents so existing integrations keep working without client changes.

Quick Rollback (revert all four at once)

git revert c09165c4 --no-edit

Fix 1 — Cerebras qwen-3: disable reasoning tokens by default

Problem: cerebras-cloud-sdk >=1.64.x enables hybrid thinking for qwen-3 models by default. The gateway doesn't handle reasoning tokens in the stream, so Cerebras returned a 400 — which the error handler translated to the misleading "Invalid value for parameter 'request'" message.

Change: src/services/cerebras_client.py — added _apply_cerebras_reasoning_defaults() which injects disable_reasoning=True for any model whose name contains qwen-3 or qwen3, unless the caller already set disable_reasoning or reasoning_effort.

Manual rollback:

Remove the constant and helper (lines ~111–126):

_CEREBRAS_REASONING_MODELS = ("qwen-3", "qwen3")

def _apply_cerebras_reasoning_defaults(model: str, kwargs: dict) -> dict: ...

Remove the two call sites in make_cerebras_request_openai() and make_cerebras_request_openai_stream():
```
kwargs = _apply_cerebras_reasoning_defaults(model, kwargs)  # remove this line
```

Fix 2 — DeepSeek: pin to stable versioned model ID

Problem: The generic deepseek/deepseek-chat alias on OpenRouter pointed to overloaded capacity, causing Bad Gateway (502) after 3 retries.

Change: src/services/model_transformations.py — deepseek-chat, deepseek-chat-v3, and deepseek-chat-v3.1 entries in the OpenRouter model mapping table now resolve to deepseek/deepseek-chat-v3-0324 (stable versioned endpoint).

Manual rollback: In model_transformations.py, find the OpenRouter model mapping dict and revert the deepseek-chat entries:

# Revert to generic alias:
"deepseek/deepseek-chat": "deepseek/deepseek-chat",
"deepseek-chat": "deepseek/deepseek-chat",
"deepseek/deepseek-chat-v3": "deepseek/deepseek-chat",
"deepseek/deepseek-chat-v3.1": "deepseek/deepseek-chat",

Fix 3 — Mistral: explicit OpenRouter routing for mistralai org prefix

Problem: detect_provider_from_model_id() had no case for the mistralai org prefix. It fell through to the default (onerouter / Infron AI), which accepted the request but returned an empty stream.

Change: src/services/model_transformations.py — added an explicit if org == "mistralai": return "openrouter" check inside detect_provider_from_model_id().

Manual rollback: Remove the block added to detect_provider_from_model_id():

# Remove this block:
if org == "mistralai":
    logger.info(f"Routing '{model_id}' to openrouter (mistralai org prefix)")
    return "openrouter"

Fix 4 — xAI grok-2 / grok-2-1212: redirect deprecated models to grok-3

Problem: xAI deprecated grok-2-1212. The 404 response body wasn't parseable by the error handler's model extractor, so it surfaced as "Model 'unknown' not found" with no actionable detail.

Change: src/services/model_transformations.py — added grok-2, grok-2-1212, and their prefixed variants to both MODEL_ID_ALIASES (→ x-ai/grok-3) and the xai provider mapping table (→ grok-3).