Skip to content

seseWho/llm-throughput-simulator

Repository files navigation

llm-throughput-simulator

Python MVP project for simulating high-throughput LLM serving behavior.

Project Goal

This project provides the initial structure for an LLM high-throughput serving simulator. The current goal is to establish a clean backend layout, configuration foundation, placeholder policy and queue components, and minimal API endpoints.

Architecture Summary

  • backend/: FastAPI application, API routes, config loading, request and response models.
  • backend/policies/: placeholders for rate limiting, quota, cost, and policy logic.
  • backend/queue/: placeholders for admission control, priority scheduling, and queue management.
  • backend/llm_backends/: placeholders for simulated and Ollama-backed LLM providers.
  • backend/metrics/: placeholder metrics collection.
  • stress_tester/: placeholders for future load generation, scenarios, and reporting.
  • config/: YAML configuration for users, projects, models, limits, and degradation levels.
  • tests/: minimal tests for the configuration loader.

Install Dependencies

pip install -e ".[dev]"

Run the FastAPI Backend

uvicorn backend.main:app --reload

The health endpoint is available at:

GET /health
GET /queue/status
GET /requests/{request_id}
GET /metrics
POST /metrics/reset
GET /usage/summary
GET /usage/project/{project_id}
GET /usage/user/{user_id}
GET /usage/recent?limit=50

Development Notes

See docs/development-guide.md for setup, mini-check commands, endpoint smoke tests, current limitations, and suggested next implementation steps.

See docs/testing_strategies_guide.md for practical experiments covering load, burst traffic, VIP protection, rate limiting, quota exhaustion, degradation, Ollama comparison, persistence, metrics, and reports.

See docs/requirements.md for the requirements specification and docs/architecture.md for the logical architecture.

Current Status

Step 13 is implemented: request lifecycle and usage accounting are now persisted to local SQLite alongside the existing in-memory metrics.

Implemented foundations:

  • simulated LLM backend
  • user/project/model validation
  • simple request rate limiting
  • in-memory project token quota tracking
  • estimated token cost calculation
  • admission control
  • in-memory priority queue foundation
  • background queue workers
  • queue status endpoint
  • request status endpoint
  • metrics summary endpoint
  • metrics reset endpoint
  • local async stress tester
  • CSV and JSON stress test reports
  • queued request polling in stress tester
  • end-to-end latency reporting
  • optional Ollama backend adapter
  • Ollama stress scenarios
  • backend and model comparison report
  • configuration-driven degradation strategy
  • SQLite persistent usage accounting

Current limitations:

  • usage is stored in memory only
  • rate limiting uses a simple fixed window
  • queue is in-memory only
  • queue and results are lost on restart
  • metrics are in-memory only
  • metrics are lost on restart
  • SQLite is local only
  • no retention policy yet
  • no authentication on usage endpoints yet
  • in-memory metrics and SQLite persistence may differ after restart
  • no distributed queue
  • no distributed workers
  • no Prometheus or Grafana integration yet
  • no persistent reporting yet
  • local async stress tester only
  • no distributed load generation
  • Ollama model is disabled by default
  • no Ollama streaming yet
  • no advanced Ollama concurrency tuning yet
  • tests do not require Ollama
  • Ollama stress results depend on local CPU/GPU/RAM and model configuration

Generate Example

curl -X POST http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "user_standard_01",
    "project_id": "standard_project",
    "model": "simulated-small",
    "prompt": "Hello from the simulator",
    "max_tokens": 64
  }'

If a request is queued, use the returned request_id to check its status:

curl http://127.0.0.1:8000/requests/<request_id>

Stress Tester

Start the backend:

uvicorn backend.main:app --reload

Run a scenario:

python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario burst_load --poll-queued true --poll-timeout 30

Available scenarios:

  • normal_load
  • burst_load
  • vip_protection
  • abusive_user
  • mixed_load
  • ollama_normal_load
  • ollama_burst_load
  • ollama_vip_protection
  • ollama_long_prompt

Reports are written to:

  • reports/latest_results.csv
  • reports/latest_summary.json
  • reports/latest_comparison.md

Polling is enabled by default. Initial latency is the time to receive the first /generate response. End-to-end latency is the time until a queued request reaches a final status such as completed, failed, rejected, or timed_out.

Ollama scenario examples:

python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario ollama_normal_load --poll-queued true --poll-timeout 120
python -m stress_tester.load_generator --base-url http://localhost:8000 --scenario ollama_burst_load --poll-queued true --poll-timeout 180

For Ollama scenarios, ollama-llama must be manually enabled in config/models.yaml, and Ollama must be running locally. Results depend heavily on local CPU/GPU/RAM, model size, and Ollama configuration.

Optional Ollama Backend

Ollama support is available through the ollama-llama model config, but it is disabled by default.

To enable it manually:

  1. Install and run Ollama.
  2. Pull the model:
ollama pull llama3.1
  1. Edit config/models.yaml and set:
ollama-llama:
  enabled: true
  1. Start the backend:
uvicorn backend.main:app --reload
  1. Call /generate with:
{
  "user_id": "user_vip_01",
  "project_id": "vip_project",
  "model": "ollama-llama",
  "prompt": "Hello from Ollama",
  "max_tokens": 64
}

Current Ollama limitations:

  • no streaming yet
  • no advanced Ollama concurrency tuning yet
  • tests use mocks and do not require Ollama installed or running

Degradation Strategy

Degradation is based on queue usage ratio:

queue_usage_ratio = queue_size / max_queue_size

Configured levels:

  • normal: no degradation actions.
  • soft_pressure: reduces requested max_tokens.
  • high_pressure: reduces max_tokens and rejects batch requests.
  • critical_pressure: reduces max_tokens, rejects batch requests, rejects standard traffic, and preserves high-priority traffic.

The degradation rules are configured in config/degradation.yaml.

Current behavior:

  • degradation runs after policy validation and before admission control
  • degraded max_tokens are applied to a copied request object
  • token and cost estimates are recalculated when max_tokens changes
  • queued payloads contain the degraded request, not the original request
  • /queue/status exposes the current degradation level and active actions

Persistent Usage Accounting

SQLite persistence complements the in-memory metrics collector. The default database is:

data/usage.db

Usage endpoints:

GET /usage/summary
GET /usage/project/{project_id}
GET /usage/user/{user_id}
GET /usage/recent?limit=50

Persisted records include request metadata, backend, status, admission decision, degradation level, token estimates, estimated cost, latency, queue wait, and error messages.

Current persistence limitations:

  • SQLite is local to one process or machine
  • not designed for distributed production deployment
  • no retention policy yet
  • no authentication on usage endpoints yet
  • in-memory metrics reset on restart, while SQLite records remain

HumanEval Quality Evaluation

The quality_eval module evaluates LLM code generation quality using the HumanEval benchmark (164 Python problems). It runs problems through the simulator /generate endpoint and measures pass@1 alongside latency and cost metrics already tracked by the simulator.

Install extra dependency

pip install -e ".[dev]"

datasets is included in the dev extras. For reading the downloaded Parquet files pyarrow is required (installed automatically with datasets).

Download the HumanEval dataset

hf download openai/openai_humaneval --repo-type dataset --local-dir ./data/openai_humaneval

This places the dataset at data/openai_humaneval/openai_humaneval/test-00000-of-00001.parquet. The runner detects the format automatically — no extra configuration needed.

Verify the pipeline (simulated backend, no Ollama required)

Start the backend in one terminal:

uvicorn backend.main:app --reload

Run 3 problems with the simulated backend to confirm the pipeline works end to end:

python -m quality_eval.humaneval_runner \
  --model simulated-small \
  --user-id user_standard_01 \
  --project-id standard_project \
  --num-problems 3 \
  --concurrency 1 \
  --local-dataset data/openai_humaneval

Expected output (pass@1 will be 0% — the simulated backend returns fake text):

Loaded 3 HumanEval problems.
Model: simulated-small | Concurrency: 1 | max_tokens: 1024
  [  1/3] HumanEval/0 — FAIL | latency=0.5s | error=execution_error
  [  2/3] HumanEval/1 — FAIL | latency=0.5s | error=execution_error
  [  3/3] HumanEval/2 — FAIL | latency=0.5s | error=execution_error

--- Results for simulated-small ---
  pass@1 : 0.0%  (0/3)

Run quality evaluation with a real model

Enable the model in config/models.yaml (enabled: true) and pull it with Ollama:

ollama pull qwen2.5:14b-instruct

Run 20 problems (minimum recommended for a meaningful pass@1):

python -m quality_eval.humaneval_runner \
  --model ollama-llama \
  --user-id user_vip_01 \
  --project-id vip_project \
  --num-problems 20 \
  --concurrency 1 \
  --max-tokens 1024 \
  --poll-timeout 120 \
  --local-dataset data/openai_humaneval

Example output:

Loaded 20 HumanEval problems.
Model: ollama-llama | Concurrency: 1 | max_tokens: 1024
  [  1/20] HumanEval/0  — FAIL | latency=1.98s | error=execution_error
  [  2/20] HumanEval/1  — PASS | latency=3.63s | error=None
  ...
  [ 20/20] HumanEval/19 — PASS | latency=5.44s | error=None

--- Results for ollama-llama ---
  pass@1 : 40.0%  (8/20)
  P95 latency : 5.44s
  Total cost  : €0.000000

Analyze the results

Reports are written to:

reports/quality_results.csv     — one row per problem with pass, latency, tokens, error
reports/quality_summary.json    — aggregate: pass@1, latency percentiles, cost, error breakdown
reports/quality_report.md       — human-readable summary for sharing with the group

Quick inspection in PowerShell:

# Per-problem results
Import-Csv reports/quality_results.csv |
  Select-Object task_id, passed, error_type, latency_seconds |
  Format-Table

# Summary
Get-Content reports/quality_summary.json

# Markdown report
Get-Content reports/quality_report.md

Interpreting results

Metric What it tells you
pass@1 Fraction of problems solved correctly on the first attempt
execution_errorAssertionError Model generated valid code but wrong logic
execution_errorIndentationError Likely a prompt or extraction artifact, not a model failure
execution_timeout Model output is too slow or generated an infinite loop
simulator_rejected Simulator refused the request (rate limit, quota, capacity)

Key insight: IndentationError failures are often caused by how the model formats its response rather than by a lack of capability. If you see many of these, the pass@1 is an underestimate of the model's true quality.

Compare multiple models

Run the same command changing only --model for each model under evaluation:

python -m quality_eval.humaneval_runner --model ollama-llama   --num-problems 20 ...
python -m quality_eval.humaneval_runner --model ollama-gemma4  --num-problems 20 ...
python -m quality_eval.humaneval_runner --model ollama-qwen    --num-problems 20 ...

Then compare reports/quality_summary.json across runs. A reference comparison:

Model           Problems  pass@1   P95 latency   Cost/passed
ollama-llama    20        40.0%    5.44s         €0.000000
ollama-gemma4   20        —        —             —
ollama-qwen     20        —        —             —

Available CLI options

--base-url        Simulator URL (default: http://127.0.0.1:8000)
--model           Model name as defined in config/models.yaml
--user-id         User ID for requests (default: user_vip_01)
--project-id      Project ID for requests (default: vip_project)
--num-problems    Number of HumanEval problems to evaluate, max 164 (default: 20)
--concurrency     Concurrent requests to the simulator (default: 2)
--max-tokens      Max tokens for code generation (default: 1024)
--poll-timeout    Seconds to wait for a queued request (default: 120)
--code-timeout    Seconds allowed for generated code to execute (default: 10)
--local-dataset   Path to local dataset directory or .jsonl/.parquet file

Future Steps

  1. implement config validation
  2. implement persistent reporting dashboards
  3. add Ollama streaming support
  4. add multi-model quality comparison report

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages