Python MVP project for simulating high-throughput LLM serving behavior.
This project provides the initial structure for an LLM high-throughput serving simulator. The current goal is to establish a clean backend layout, configuration foundation, placeholder policy and queue components, and minimal API endpoints.
backend/: FastAPI application, API routes, config loading, request and response models.backend/policies/: placeholders for rate limiting, quota, cost, and policy logic.backend/queue/: placeholders for admission control, priority scheduling, and queue management.backend/llm_backends/: placeholders for simulated and Ollama-backed LLM providers.backend/metrics/: placeholder metrics collection.stress_tester/: placeholders for future load generation, scenarios, and reporting.config/: YAML configuration for users, projects, models, limits, and degradation levels.tests/: minimal tests for the configuration loader.
pip install -e ".[dev]"uvicorn backend.main:app --reloadThe health endpoint is available at:
GET /health
GET /queue/status
GET /requests/{request_id}
GET /metrics
POST /metrics/reset
GET /usage/summary
GET /usage/project/{project_id}
GET /usage/user/{user_id}
GET /usage/recent?limit=50
See docs/development-guide.md for setup, mini-check commands, endpoint smoke tests, current limitations, and suggested next implementation steps.
See docs/testing_strategies_guide.md for practical experiments covering load, burst traffic, VIP protection, rate limiting, quota exhaustion, degradation, Ollama comparison, persistence, metrics, and reports.
See docs/requirements.md for the requirements specification and docs/architecture.md for the logical architecture.
Step 13 is implemented: request lifecycle and usage accounting are now persisted to local SQLite alongside the existing in-memory metrics.
Implemented foundations:
- simulated LLM backend
- user/project/model validation
- simple request rate limiting
- in-memory project token quota tracking
- estimated token cost calculation
- admission control
- in-memory priority queue foundation
- background queue workers
- queue status endpoint
- request status endpoint
- metrics summary endpoint
- metrics reset endpoint
- local async stress tester
- CSV and JSON stress test reports
- queued request polling in stress tester
- end-to-end latency reporting
- optional Ollama backend adapter
- Ollama stress scenarios
- backend and model comparison report
- configuration-driven degradation strategy
- SQLite persistent usage accounting
Current limitations:
- usage is stored in memory only
- rate limiting uses a simple fixed window
- queue is in-memory only
- queue and results are lost on restart
- metrics are in-memory only
- metrics are lost on restart
- SQLite is local only
- no retention policy yet
- no authentication on usage endpoints yet
- in-memory metrics and SQLite persistence may differ after restart
- no distributed queue
- no distributed workers
- no Prometheus or Grafana integration yet
- no persistent reporting yet
- local async stress tester only
- no distributed load generation
- Ollama model is disabled by default
- no Ollama streaming yet
- no advanced Ollama concurrency tuning yet
- tests do not require Ollama
- Ollama stress results depend on local CPU/GPU/RAM and model configuration
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d '{
"user_id": "user_standard_01",
"project_id": "standard_project",
"model": "simulated-small",
"prompt": "Hello from the simulator",
"max_tokens": 64
}'If a request is queued, use the returned request_id to check its status:
curl http://127.0.0.1:8000/requests/<request_id>Start the backend:
uvicorn backend.main:app --reloadRun a scenario:
python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario burst_load --poll-queued true --poll-timeout 30Available scenarios:
normal_loadburst_loadvip_protectionabusive_usermixed_loadollama_normal_loadollama_burst_loadollama_vip_protectionollama_long_prompt
Reports are written to:
reports/latest_results.csvreports/latest_summary.jsonreports/latest_comparison.md
Polling is enabled by default. Initial latency is the time to receive the first /generate response. End-to-end latency is the time until a queued request reaches a final status such as completed, failed, rejected, or timed_out.
Ollama scenario examples:
python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario ollama_normal_load --poll-queued true --poll-timeout 120python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario ollama_burst_load --poll-queued true --poll-timeout 180For Ollama scenarios, ollama-llama must be manually enabled in config/models.yaml, and Ollama must be running locally. Results depend heavily on local CPU/GPU/RAM, model size, and Ollama configuration.
Ollama support is available through the ollama-llama model config, but it is disabled by default.
To enable it manually:
- Install and run Ollama.
- Pull the model:
ollama pull llama3.1- Edit
config/models.yamland set:
ollama-llama:
enabled: true- Start the backend:
uvicorn backend.main:app --reload- Call
/generatewith:
{
"user_id": "user_vip_01",
"project_id": "vip_project",
"model": "ollama-llama",
"prompt": "Hello from Ollama",
"max_tokens": 64
}Current Ollama limitations:
- no streaming yet
- no advanced Ollama concurrency tuning yet
- tests use mocks and do not require Ollama installed or running
Degradation is based on queue usage ratio:
queue_usage_ratio = queue_size / max_queue_size
Configured levels:
normal: no degradation actions.soft_pressure: reduces requestedmax_tokens.high_pressure: reducesmax_tokensand rejects batch requests.critical_pressure: reducesmax_tokens, rejects batch requests, rejects standard traffic, and preserves high-priority traffic.
The degradation rules are configured in config/degradation.yaml.
Current behavior:
- degradation runs after policy validation and before admission control
- degraded
max_tokensare applied to a copied request object - token and cost estimates are recalculated when
max_tokenschanges - queued payloads contain the degraded request, not the original request
/queue/statusexposes the current degradation level and active actions
SQLite persistence complements the in-memory metrics collector. The default database is:
data/usage.db
Usage endpoints:
GET /usage/summary
GET /usage/project/{project_id}
GET /usage/user/{user_id}
GET /usage/recent?limit=50
Persisted records include request metadata, backend, status, admission decision, degradation level, token estimates, estimated cost, latency, queue wait, and error messages.
Current persistence limitations:
- SQLite is local to one process or machine
- not designed for distributed production deployment
- no retention policy yet
- no authentication on usage endpoints yet
- in-memory metrics reset on restart, while SQLite records remain
The quality_eval module evaluates LLM code generation quality using the
HumanEval benchmark (164 Python problems).
It runs problems through the simulator /generate endpoint and measures pass@1
alongside latency and cost metrics already tracked by the simulator.
pip install -e ".[dev]"datasets is included in the dev extras. For reading the downloaded Parquet files
pyarrow is required (installed automatically with datasets).
hf download openai/openai_humaneval --repo-type dataset --local-dir ./data/openai_humanevalThis places the dataset at data/openai_humaneval/openai_humaneval/test-00000-of-00001.parquet.
The runner detects the format automatically — no extra configuration needed.
Start the backend in one terminal:
uvicorn backend.main:app --reloadRun 3 problems with the simulated backend to confirm the pipeline works end to end:
python -m quality_eval.humaneval_runner \
--model simulated-small \
--user-id user_standard_01 \
--project-id standard_project \
--num-problems 3 \
--concurrency 1 \
--local-dataset data/openai_humanevalExpected output (pass@1 will be 0% — the simulated backend returns fake text):
Loaded 3 HumanEval problems.
Model: simulated-small | Concurrency: 1 | max_tokens: 1024
[ 1/3] HumanEval/0 — FAIL | latency=0.5s | error=execution_error
[ 2/3] HumanEval/1 — FAIL | latency=0.5s | error=execution_error
[ 3/3] HumanEval/2 — FAIL | latency=0.5s | error=execution_error
--- Results for simulated-small ---
pass@1 : 0.0% (0/3)
Enable the model in config/models.yaml (enabled: true) and pull it with Ollama:
ollama pull qwen2.5:14b-instructRun 20 problems (minimum recommended for a meaningful pass@1):
python -m quality_eval.humaneval_runner \
--model ollama-llama \
--user-id user_vip_01 \
--project-id vip_project \
--num-problems 20 \
--concurrency 1 \
--max-tokens 1024 \
--poll-timeout 120 \
--local-dataset data/openai_humanevalExample output:
Loaded 20 HumanEval problems.
Model: ollama-llama | Concurrency: 1 | max_tokens: 1024
[ 1/20] HumanEval/0 — FAIL | latency=1.98s | error=execution_error
[ 2/20] HumanEval/1 — PASS | latency=3.63s | error=None
...
[ 20/20] HumanEval/19 — PASS | latency=5.44s | error=None
--- Results for ollama-llama ---
pass@1 : 40.0% (8/20)
P95 latency : 5.44s
Total cost : €0.000000
Reports are written to:
reports/quality_results.csv — one row per problem with pass, latency, tokens, error
reports/quality_summary.json — aggregate: pass@1, latency percentiles, cost, error breakdown
reports/quality_report.md — human-readable summary for sharing with the group
Quick inspection in PowerShell:
# Per-problem results
Import-Csv reports/quality_results.csv |
Select-Object task_id, passed, error_type, latency_seconds |
Format-Table
# Summary
Get-Content reports/quality_summary.json
# Markdown report
Get-Content reports/quality_report.md| Metric | What it tells you |
|---|---|
pass@1 |
Fraction of problems solved correctly on the first attempt |
execution_error — AssertionError |
Model generated valid code but wrong logic |
execution_error — IndentationError |
Likely a prompt or extraction artifact, not a model failure |
execution_timeout |
Model output is too slow or generated an infinite loop |
simulator_rejected |
Simulator refused the request (rate limit, quota, capacity) |
Key insight: IndentationError failures are often caused by how the model formats
its response rather than by a lack of capability. If you see many of these, the pass@1
is an underestimate of the model's true quality.
Run the same command changing only --model for each model under evaluation:
python -m quality_eval.humaneval_runner --model ollama-llama --num-problems 20 ...
python -m quality_eval.humaneval_runner --model ollama-gemma4 --num-problems 20 ...
python -m quality_eval.humaneval_runner --model ollama-qwen --num-problems 20 ...Then compare reports/quality_summary.json across runs. A reference comparison:
Model Problems pass@1 P95 latency Cost/passed
ollama-llama 20 40.0% 5.44s €0.000000
ollama-gemma4 20 — — —
ollama-qwen 20 — — —
--base-url Simulator URL (default: http://127.0.0.1:8000)
--model Model name as defined in config/models.yaml
--user-id User ID for requests (default: user_vip_01)
--project-id Project ID for requests (default: vip_project)
--num-problems Number of HumanEval problems to evaluate, max 164 (default: 20)
--concurrency Concurrent requests to the simulator (default: 2)
--max-tokens Max tokens for code generation (default: 1024)
--poll-timeout Seconds to wait for a queued request (default: 120)
--code-timeout Seconds allowed for generated code to execute (default: 10)
--local-dataset Path to local dataset directory or .jsonl/.parquet file
- implement config validation
- implement persistent reporting dashboards
- add Ollama streaming support
- add multi-model quality comparison report