llm-throughput-simulator

Python MVP project for simulating high-throughput LLM serving behavior.

Project Goal

This project provides the initial structure for an LLM high-throughput serving simulator. The current goal is to establish a clean backend layout, configuration foundation, placeholder policy and queue components, and minimal API endpoints.

Architecture Summary

backend/: FastAPI application, API routes, config loading, request and response models.
backend/policies/: placeholders for rate limiting, quota, cost, and policy logic.
backend/queue/: placeholders for admission control, priority scheduling, and queue management.
backend/llm_backends/: placeholders for simulated and Ollama-backed LLM providers.
backend/metrics/: placeholder metrics collection.
stress_tester/: placeholders for future load generation, scenarios, and reporting.
config/: YAML configuration for users, projects, models, limits, and degradation levels.
tests/: minimal tests for the configuration loader.

Install Dependencies

pip install -e ".[dev]"

Run the FastAPI Backend

uvicorn backend.main:app --reload

The health endpoint is available at:

GET /health
GET /queue/status
GET /requests/{request_id}
GET /metrics
POST /metrics/reset
GET /usage/summary
GET /usage/project/{project_id}
GET /usage/user/{user_id}
GET /usage/recent?limit=50

Development Notes

See docs/development-guide.md for setup, mini-check commands, endpoint smoke tests, current limitations, and suggested next implementation steps.

See docs/testing_strategies_guide.md for practical experiments covering load, burst traffic, VIP protection, rate limiting, quota exhaustion, degradation, Ollama comparison, persistence, metrics, and reports.

See docs/requirements.md for the requirements specification and docs/architecture.md for the logical architecture.

Current Status

Step 13 is implemented: request lifecycle and usage accounting are now persisted to local SQLite alongside the existing in-memory metrics.

Implemented foundations:

simulated LLM backend
user/project/model validation
simple request rate limiting
in-memory project token quota tracking
estimated token cost calculation
admission control
in-memory priority queue foundation
background queue workers
queue status endpoint
request status endpoint
metrics summary endpoint
metrics reset endpoint
local async stress tester
CSV and JSON stress test reports
queued request polling in stress tester
end-to-end latency reporting
optional Ollama backend adapter
Ollama stress scenarios
backend and model comparison report
configuration-driven degradation strategy
SQLite persistent usage accounting

Current limitations:

usage is stored in memory only
rate limiting uses a simple fixed window
queue is in-memory only
queue and results are lost on restart
metrics are in-memory only
metrics are lost on restart
SQLite is local only
no retention policy yet
no authentication on usage endpoints yet
in-memory metrics and SQLite persistence may differ after restart
no distributed queue
no distributed workers
no Prometheus or Grafana integration yet
no persistent reporting yet
local async stress tester only
no distributed load generation
Ollama model is disabled by default
no Ollama streaming yet
no advanced Ollama concurrency tuning yet
tests do not require Ollama
Ollama stress results depend on local CPU/GPU/RAM and model configuration

Generate Example

curl -X POST http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user_standard_01",
    "project_id": "standard_project",
    "model": "simulated-small",
    "prompt": "Hello from the simulator",
    "max_tokens": 64
  }'

If a request is queued, use the returned request_id to check its status:

curl http://127.0.0.1:8000/requests/<request_id>

Stress Tester

Start the backend:

uvicorn backend.main:app --reload

Run a scenario:

python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario burst_load --poll-queued true --poll-timeout 30

Available scenarios:

normal_load
burst_load
vip_protection
abusive_user
mixed_load
ollama_normal_load
ollama_burst_load
ollama_vip_protection
ollama_long_prompt

Reports are written to:

reports/latest_results.csv
reports/latest_summary.json
reports/latest_comparison.md

Polling is enabled by default. Initial latency is the time to receive the first /generate response. End-to-end latency is the time until a queued request reaches a final status such as completed, failed, rejected, or timed_out.

Ollama scenario examples:

python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario ollama_normal_load --poll-queued true --poll-timeout 120

python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario ollama_burst_load --poll-queued true --poll-timeout 180

For Ollama scenarios, ollama-llama must be manually enabled in config/models.yaml, and Ollama must be running locally. Results depend heavily on local CPU/GPU/RAM, model size, and Ollama configuration.

Optional Ollama Backend

Ollama support is available through the ollama-llama model config, but it is disabled by default.

To enable it manually:

Install and run Ollama.
Pull the model:

ollama pull llama3.1

Edit config/models.yaml and set:

ollama-llama:
  enabled: true

Start the backend:

uvicorn backend.main:app --reload

Call /generate with:

{
  "user_id": "user_vip_01",
  "project_id": "vip_project",
  "model": "ollama-llama",
  "prompt": "Hello from Ollama",
  "max_tokens": 64
}

Current Ollama limitations:

no streaming yet
no advanced Ollama concurrency tuning yet
tests use mocks and do not require Ollama installed or running

Degradation Strategy

Degradation is based on queue usage ratio:

queue_usage_ratio = queue_size / max_queue_size

Configured levels:

normal: no degradation actions.
soft_pressure: reduces requested max_tokens.
high_pressure: reduces max_tokens and rejects batch requests.
critical_pressure: reduces max_tokens, rejects batch requests, rejects standard traffic, and preserves high-priority traffic.

The degradation rules are configured in config/degradation.yaml.

Current behavior:

degradation runs after policy validation and before admission control
degraded max_tokens are applied to a copied request object
token and cost estimates are recalculated when max_tokens changes
queued payloads contain the degraded request, not the original request
/queue/status exposes the current degradation level and active actions

Persistent Usage Accounting

SQLite persistence complements the in-memory metrics collector. The default database is:

data/usage.db

Usage endpoints:

GET /usage/summary
GET /usage/project/{project_id}
GET /usage/user/{user_id}
GET /usage/recent?limit=50

Persisted records include request metadata, backend, status, admission decision, degradation level, token estimates, estimated cost, latency, queue wait, and error messages.

Current persistence limitations:

SQLite is local to one process or machine
not designed for distributed production deployment
no retention policy yet
no authentication on usage endpoints yet
in-memory metrics reset on restart, while SQLite records remain

HumanEval Quality Evaluation

The quality_eval module evaluates LLM code generation quality using the HumanEval benchmark (164 Python problems). It runs problems through the simulator /generate endpoint and measures pass@1 alongside latency and cost metrics already tracked by the simulator.

Install extra dependency

pip install -e ".[dev]"

datasets is included in the dev extras. For reading the downloaded Parquet files pyarrow is required (installed automatically with datasets).

Download the HumanEval dataset

hf download openai/openai_humaneval --repo-type dataset --local-dir ./data/openai_humaneval

This places the dataset at data/openai_humaneval/openai_humaneval/test-00000-of-00001.parquet. The runner detects the format automatically — no extra configuration needed.

Verify the pipeline (simulated backend, no Ollama required)

Start the backend in one terminal:

uvicorn backend.main:app --reload

Run 3 problems with the simulated backend to confirm the pipeline works end to end:

python -m quality_eval.humaneval_runner \
  --model simulated-small \
  --user-id user_standard_01 \
  --project-id standard_project \
  --num-problems 3 \
  --concurrency 1 \
  --local-dataset data/openai_humaneval

Expected output (pass@1 will be 0% — the simulated backend returns fake text):

Loaded 3 HumanEval problems.
Model: simulated-small | Concurrency: 1 | max_tokens: 1024
  [  1/3] HumanEval/0 — FAIL | latency=0.5s | error=execution_error
  [  2/3] HumanEval/1 — FAIL | latency=0.5s | error=execution_error
  [  3/3] HumanEval/2 — FAIL | latency=0.5s | error=execution_error

--- Results for simulated-small ---
  pass@1 : 0.0%  (0/3)

Run quality evaluation with a real model

Enable the model in config/models.yaml (enabled: true) and pull it with Ollama:

ollama pull qwen2.5:14b-instruct

Run 20 problems (minimum recommended for a meaningful pass@1):

python -m quality_eval.humaneval_runner \
  --model ollama-llama \
  --user-id user_vip_01 \
  --project-id vip_project \
  --num-problems 20 \
  --concurrency 1 \
  --max-tokens 1024 \
  --poll-timeout 120 \
  --local-dataset data/openai_humaneval

Example output:

Loaded 20 HumanEval problems.
Model: ollama-llama | Concurrency: 1 | max_tokens: 1024
  [  1/20] HumanEval/0  — FAIL | latency=1.98s | error=execution_error
  [  2/20] HumanEval/1  — PASS | latency=3.63s | error=None
  ...
  [ 20/20] HumanEval/19 — PASS | latency=5.44s | error=None

--- Results for ollama-llama ---
  pass@1 : 40.0%  (8/20)
  P95 latency : 5.44s
  Total cost  : €0.000000

Analyze the results

Reports are written to:

reports/quality_results.csv     — one row per problem with pass, latency, tokens, error
reports/quality_summary.json    — aggregate: pass@1, latency percentiles, cost, error breakdown
reports/quality_report.md       — human-readable summary for sharing with the group

Quick inspection in PowerShell:

# Per-problem results
Import-Csv reports/quality_results.csv |
  Select-Object task_id, passed, error_type, latency_seconds |
  Format-Table

# Summary
Get-Content reports/quality_summary.json

# Markdown report
Get-Content reports/quality_report.md

Interpreting results

Metric	What it tells you
`pass@1`	Fraction of problems solved correctly on the first attempt
`execution_error` — `AssertionError`	Model generated valid code but wrong logic
`execution_error` — `IndentationError`	Likely a prompt or extraction artifact, not a model failure
`execution_timeout`	Model output is too slow or generated an infinite loop
`simulator_rejected`	Simulator refused the request (rate limit, quota, capacity)

Key insight: IndentationError failures are often caused by how the model formats its response rather than by a lack of capability. If you see many of these, the pass@1 is an underestimate of the model's true quality.

Compare multiple models

Run the same command changing only --model for each model under evaluation:

python -m quality_eval.humaneval_runner --model ollama-llama   --num-problems 20 ...
python -m quality_eval.humaneval_runner --model ollama-gemma4  --num-problems 20 ...
python -m quality_eval.humaneval_runner --model ollama-qwen    --num-problems 20 ...

Then compare reports/quality_summary.json across runs. A reference comparison:

Model           Problems  pass@1   P95 latency   Cost/passed
ollama-llama    20        40.0%    5.44s         €0.000000
ollama-gemma4   20        —        —             —
ollama-qwen     20        —        —             —

Available CLI options

--base-url        Simulator URL (default: http://127.0.0.1:8000)
--model           Model name as defined in config/models.yaml
--user-id         User ID for requests (default: user_vip_01)
--project-id      Project ID for requests (default: vip_project)
--num-problems    Number of HumanEval problems to evaluate, max 164 (default: 20)
--concurrency     Concurrent requests to the simulator (default: 2)
--max-tokens      Max tokens for code generation (default: 1024)
--poll-timeout    Seconds to wait for a queued request (default: 120)
--code-timeout    Seconds allowed for generated code to execute (default: 10)
--local-dataset   Path to local dataset directory or .jsonl/.parquet file

Future Steps

implement config validation
implement persistent reporting dashboards
add Ollama streaming support
add multi-model quality comparison report

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
backend		backend
config		config
data/openai_humaneval		data/openai_humaneval
docs		docs
quality_eval		quality_eval
stress_tester		stress_tester
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-throughput-simulator

Project Goal

Architecture Summary

Install Dependencies

Run the FastAPI Backend

Development Notes

Current Status

Generate Example

Stress Tester

Optional Ollama Backend

Degradation Strategy

Persistent Usage Accounting

HumanEval Quality Evaluation

Install extra dependency

Download the HumanEval dataset

Verify the pipeline (simulated backend, no Ollama required)

Run quality evaluation with a real model

Analyze the results

Interpreting results

Compare multiple models

Available CLI options

Future Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-throughput-simulator

Project Goal

Architecture Summary

Install Dependencies

Run the FastAPI Backend

Development Notes

Current Status

Generate Example

Stress Tester

Optional Ollama Backend

Degradation Strategy

Persistent Usage Accounting

HumanEval Quality Evaluation

Install extra dependency

Download the HumanEval dataset

Verify the pipeline (simulated backend, no Ollama required)

Run quality evaluation with a real model

Analyze the results

Interpreting results

Compare multiple models

Available CLI options

Future Steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages