llm-bench

Registry-driven LLM benchmark harness for Apple Silicon local runtimes and OpenAI-compatible endpoints.

This repo keeps model and runtime declarations in models/registry.yaml, runs repeatable speed/memory benchmarks, runs multi-dimensional evals, and publishes the resulting coverage through Streamlit plus a static TanStack/Cloudflare site.

What this is for

Compare local runtime formats such as MLX, GGUF/llama.cpp, DS4, and MTPLX on the same Apple Silicon machine.
Track prompt-processing speed, generation speed, peak memory, wall time, and benchmark version for each scenario.
Run accuracy and capability evals through lm-eval-harness plus external runners for EvalPlus, LiveCodeBench, BigCodeBench, BFCL, SourceQA, LiveBench, KMMLU-Pro, ProgramBench import/eval, and Terminal-Bench.
Include hosted or remote OpenAI-compatible /v1 endpoints in the same registry and reporting pipeline when local artifacts are not required.
Keep coverage explicit: measured, directional, diagnostic, missing, optional, speed-only, and unsupported rows are surfaced separately.

Current scope

Area	Supported today
Speed runners	MLX, GGUF/llama.cpp, DS4, MTPLX, OpenAI-compatible endpoints
Eval runners	`lm-eval-harness`, EvalPlus, LiveCodeBench, BigCodeBench-Hard, BFCL, SourceQA, LiveBench subset, KMMLU-Pro, ProgramBench import/eval, Terminal-Bench
Reporting	Streamlit dashboard, Quarto report, TanStack Start public site on Cloudflare Workers
Primary machine	Apple M5 Max, 128GB unified memory, macOS
Registry	Local HF repos, local GGUF files, split GGUF files, MTPLX speed-only variants, hosted endpoints

Quickstart: local Apple Silicon

git clone https://github.com/haxlys/llm-bench.git ~/llm-bench
cd ~/llm-bench

# System tools (one-time)
brew install llama.cpp
brew install --cask quarto      # optional, only for the static report

# Python env
uv sync

# Inspect what the registry knows and what is already present locally.
uv run python scripts/sync_models.py --check

# Download one small/local target first. MLX variants land in the HF cache;
# GGUF variants land under the registry's {gguf_dir}, default ~/models/gguf/.
uv run python scripts/sync_models.py --variant gemma-4-E4B-gguf-q8

# Smoke test (single scenario, ~1 min) — verifies wiring before the full matrix
uv run python scripts/run_bench.py --variant gemma-4-E4B-gguf-q8 --smoke

# Full speed matrix for locally present variants
uv run python scripts/run_bench.py --all-pending

# Visualize
uv run streamlit run dashboard/app.py
# Opens on http://127.0.0.1:8502 by default.

Use the full registry download only when you intentionally want the whole local matrix; it can require tens or hundreds of GB depending on the variants present in models/registry.yaml.

uv run python scripts/sync_models.py --all-missing

Quickstart: OpenAI-compatible endpoint

For an existing /v1 server or hosted provider, add an endpoint variant to models/registry.yaml:

models:
  - id: my-endpoint-model
    family: hosted
    architecture: dense
    variants:
      - key: my-endpoint-api
        fmt: api
        backend: openai-compatible
        artifact_type: endpoint
        path: https://provider.example/v1
        api_model: provider/model-id
        api_key_env: PROVIDER_API_KEY
        quant: hosted
        tier: hosted
        capabilities: [chat, completions, code_eval_chat, tool_use_eval]

Then run a speed smoke and eval smoke without downloading local model weights:

uv sync
export PROVIDER_API_KEY=...

uv run python scripts/run_bench.py --variant my-endpoint-api --smoke

uv sync --extra evals
uv run python scripts/run_evals.py --variant my-endpoint-api --suite smoke --limit 3

Endpoint speed rows use wall-clock effective token rates because hosted APIs usually do not expose separate prefill/generation timing or local peak memory.

Optional reports and quality checks

# Output divergence (quality)
uv sync --extra quality          # pulls sentence-transformers
uv run python scripts/compare_quality.py \
  --gguf-model ~/models/gguf/gemma-4-26B-A4B-it-Q8_0.gguf

# Static report (requires Quarto)
quarto render report/
open report/_site/index.html

Public benchmark website

The public website lives in site/. It is a TanStack Start app built with the Cloudflare Vite plugin, deployed to Cloudflare Workers with Static Assets, and prerendered by TanStack Start during vite build.

Regenerate the typed data export before reviewing or publishing site changes:

uv run python scripts/export_site_public_data.py

Install frontend dependencies once:

cd site
npm install

Run the site locally during development:

cd site
npm run dev

Build and preview the production output:

cd site
npm run build
npm run preview

Deploy to Cloudflare Workers:

cd site
npm run deploy

The Cloudflare Workers configuration sets main to @tanstack/react-start/server-entry with nodejs_compat. The Cloudflare Vite plugin emits the Workers Static Assets configuration into the generated output, so wrangler.jsonc intentionally does not hard-code an assets.directory.

TTFT and ITL columns are present in the speed report, but they display not measured until the benchmark runner records those latency fields. Until then, use TG tok/s, wall time, and peak memory for speed comparisons.

Important: stop other GPU/Metal workloads first

Inference benchmarks are extremely sensitive to Metal contention. Before running:

# Confirm nothing else is holding the GPU
lsof -i :8080 -i :8081 -i :8082    # mlx servers in ~/llm-stack
ps aux | grep -iE "mlx|llama|ds4" | grep -v grep

If you run other MLX/DS4 servers (e.g. ~/llm-stack), pause them during the run. Otherwise expect 2–5× slower numbers, single-instance DS4 lock failures, and possible OOM at the 31B+ class.

What gets measured

Per (model, format, scenario):

Metric	Source	Notes
`pp_tps` (prompt processing tok/s)	mlx-lm verbose / `llama-bench` JSON	Synthetic prefill
`tg_tps` (generation tok/s)	mlx-lm verbose / `llama-bench` JSON	Greedy, temp=0
`peak_mem_gb`	`/usr/bin/time -l` max RSS, max'd with `mx.metal.get_peak_memory()` for MLX	Process-level
`wall_s`	`/usr/bin/time -l` real	End-to-end including model load
`cos_sim`	`paraphrase-multilingual-mpnet-base-v2` embedding	Quality script only

Scenarios = prefill ∈ {256, 1024, 4096, 8192} × gen ∈ {128, 512}. 3 measured runs + 1 warmup per scenario.

Model registry (`models/registry.yaml`)

The single source of truth for what gets benchmarked. Adding a new model:

models:
  - id: qwen-3.6-27b
    family: qwen
    architecture: dense
    params_total_b: 27
    variants:
      - key: qwen-27b-mlx-8bit
        fmt: mlx
        path: mlx-community/Qwen3.6-27B-8bit
        quant: MLX-8bit
        tier: 8bit
        approx_size_gb: 27
      - key: qwen-27b-gguf-q8
        fmt: gguf
        path: "{gguf_dir}/qwen-3.6-27b-Q8_0.gguf"
        quant: Q8_0
        tier: 8bit
        approx_size_gb: 28
        download:
          repo: bartowski/qwen-3.6-27b-GGUF
          pattern: "*Q8_0*.gguf"

Then:

uv run python scripts/sync_models.py --model qwen-3.6-27b
uv run python scripts/run_bench.py --variant qwen-27b-mlx-8bit --variant qwen-27b-gguf-q8
uv run python scripts/run_evals.py --variant qwen-27b-mlx-8bit --variant qwen-27b-gguf-q8 --suite full

Use --task to run an ordered catch-up bucket without re-running the whole suite:

uv run python scripts/run_evals.py --all-variants --suite full \
  --task kmmlu_pro --resilient-ifeval --strict-coverage

MTPLX-ready MLX checkpoints can be benchmarked through the normal MLX runner for apples-to-apples autoregressive speed/eval numbers:

uv run python scripts/run_bench.py \
  --variant qwen-3.6-27b-mtplx-speed-mlx-4bit \
  --variant qwen-3.6-27b-mtplx-optimized-mlx-mixed4

To measure MTPLX speculative decoding itself inside the same speed pipeline, use the paired mtplx-mtp and mtplx-ar variants. The mtplx-mtp rows run native MTP speculative decoding; the mtplx-ar rows run the same MTPLX runtime with MTP disabled as the target-only baseline. These MTPLX MTP/AR variants are speed-only and are skipped by the eval runner; use their paired flat MLX variants for quality coverage.

uv sync --extra mtplx
uv run mtplx pull Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed
uv run python scripts/run_bench.py --smoke --runs 1 --no-warmup \
  --variant qwen-3.6-27b-mtplx-speed-mtplx-mtp \
  --variant qwen-3.6-27b-mtplx-speed-mtplx-ar
uv run python scripts/compare_mtplx.py

Set MTPLX_MAX=1 to request MTPLX's fan-max path when the local ThermalForge setup is available. Without it, results represent the normal no-fan runtime.

Currently shipped:

Key	Model	Format	Quant	Tier
`26B-MoE-mlx-8bit`	gemma-4-26B-A4B-it (MoE)	MLX	8-bit	8bit
`26B-MoE-gguf-q8`	gemma-4-26B-A4B-it (MoE)	GGUF	Q8_0	8bit
`26B-MoE-mlx-4bit`	gemma-4-26B-A4B-it (MoE)	MLX	4-bit	4bit
`26B-MoE-gguf-q4`	gemma-4-26B-A4B-it (MoE)	GGUF	Q4_K_M	4bit
`31B-Dense-mlx-8bit`	gemma-4-31B-it (Dense)	MLX	8-bit	8bit
`31B-Dense-gguf-q8`	gemma-4-31B-it (Dense)	GGUF	Q8_0	8bit
`gemma-4-E4B-gguf-q8`	gemma-4-E4B-it (Dense)	GGUF	Q8_0	8bit
`qwen-3.5-4b-gguf-q8`	qwen-3.5-4B (Dense)	GGUF	Q8_0	8bit
`qwen-3.5-9b-gguf-q8`	qwen-3.5-9B (Dense)	GGUF	Q8_0	8bit
`qwen-3.6-35b-a3b-gguf-q4`	qwen-3.6-35B-A3B (MoE)	GGUF	Q4_K_M	4bit
`qwen-3-next-80b-a3b-instruct-gguf-q4`	Qwen3-Next-80B-A3B-Instruct (MoE)	GGUF	Q4_K_M	4bit
`qwen-3-coder-30b-a3b-instruct-gguf-q4`	Qwen3-Coder-30B-A3B-Instruct (MoE)	GGUF	Q4_K_M	4bit
`qwen-3.6-27b-mtplx-speed-mlx-4bit`	qwen-3.6-27B-MTPLX	MLX	4-bit	4bit
`qwen-3.6-27b-mtplx-speed-mtplx-mtp`	qwen-3.6-27B-MTPLX	MTPLX	4-bit MTP-on	4bit
`qwen-3.6-27b-mtplx-speed-mtplx-ar`	qwen-3.6-27B-MTPLX	MTPLX	4-bit MTP-off	4bit
`qwen-3.6-27b-mtplx-optimized-mlx-mixed4`	qwen-3.6-27B-MTPLX	MLX	mixed 4/8-bit	4bit
`qwen-3.6-27b-mtplx-optimized-mtplx-mtp`	qwen-3.6-27B-MTPLX	MTPLX	mixed 4/8-bit MTP-on	4bit
`qwen-3.6-27b-mtplx-optimized-mtplx-ar`	qwen-3.6-27B-MTPLX	MTPLX	mixed 4/8-bit MTP-off	4bit
`qwen-3-coder-next-gguf-q4`	Qwen3-Coder-Next (MoE)	GGUF	Q4_K_M	4bit
`deepseek-v4-flash-gguf-iq2xxs`	DeepSeek-V4-Flash (MoE)	GGUF DS4 imatrix	IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8	2bit
`gpt-oss-20b-gguf-q4`	gpt-oss-20b (MoE)	GGUF	Q4_K_M	4bit
`gpt-oss-120b-gguf-q4`	gpt-oss-120b (MoE, split GGUF)	GGUF	Q4_K_M	4bit
`nemotron-3-nano-omni-30b-a3b-reasoning-gguf-q4`	Nemotron-3-Nano-Omni-30B-A3B-Reasoning (MoE)	GGUF	UD-Q4_K_M	4bit

tier pairs MLX-Nbit ↔ Q*_K_M for fair runtime comparisons. The dashboard Catalog page shows registry × measurement status at a glance.

For generic benchmark use, each variant may also declare:

backend: openai-compatible     # runtime adapter; defaults to fmt
artifact_type: endpoint        # hf_repo, gguf_file, endpoint, ...
capabilities: [chat, completions, logprobs]
api_model: provider/model-id   # optional model= label for endpoint APIs
api_key_env: PROVIDER_API_KEY  # optional env var copied to Authorization/OpenAI_API_KEY

Existing mlx and gguf variants infer these fields automatically. Speed benchmark adapters cover MLX, GGUF, DS4, and OpenAI-compatible endpoints. Unsupported backends are rejected explicitly so new adapters can be added without silently misrouting results. Endpoint speed uses wall-clock effective token rates because hosted APIs generally do not expose separate prefill/generation timings. Eval runs can use openai-compatible endpoint variants directly; the endpoint is treated as an existing /v1 server and no local subprocess is spawned.

Idempotency

Every measurement records the current bench_version (currently 0.3). run_bench.py --skip-existing (default ON) skips combos that already have N runs at that version; --all-pending runs only what's missing across the registry. Bumping BENCH_VERSION in src/llm_bench/__init__.py triggers a full re-measurement when methodology changes.

Repository layout

models/
  registry.yaml             # single source of truth — add a model here
src/llm_bench/
  __init__.py               # BENCH_VERSION constant
  registry.py               # YAML loader + Variant/Model dataclasses
  manifest.py               # idempotency: which (variant, scenario) is measured
  index.py                  # build results/index.json (registry × status)
  eval_plan.py              # ordered catch-up plan from coverage gaps
  site_data.py              # typed public-site data export
  reporting.py              # shared ordering/report helpers
  runners/                  # speed/memory benchmark
    base.py                 # BenchResult + /usr/bin/time -l wrapper
    mlx_runner.py           # mlx_lm.generate subprocess
    gguf_runner.py          # llama-bench subprocess
    ds4_runner.py           # DeepSeek V4 Flash-specific ds4-bench adapter
    mtplx_runner.py         # MTPLX MTP-on / target-only AR speed adapter
    openai_runner.py        # OpenAI-compatible endpoint speed adapter
  evals/                    # multi-dim accuracy (lm-eval-harness)
    server.py               # ModelServer (mlx_lm.server | llama-server)
    lmeval.py               # lm_eval CLI wrapper
    suites.py               # SMOKE/FULL task lists, capability gating
    aggregate.py            # eval JSON → tidy DataFrame
    evalplus_runner.py      # HumanEval / MBPP via EvalPlus
    livecodebench_runner.py # contamination-fresh coding eval adapter
    bigcodebench_runner.py  # BigCodeBench-Hard adapter
    bfcl.py                 # BFCL v4 tool-use adapter
    sourceqa.py             # pinned-repo evidence QA diagnostic
    livebench_runner.py     # LiveBench subset adapter
    kmmlu_pro_runner.py     # KMMLU-Pro direct runner
    programbench_runner.py  # ProgramBench import/eval helpers
    terminal_bench_runner.py # Terminal-Bench run + import helpers
    trace.py                # per-task execution ledger
  prompts.py                # 20 quality-comparison prompts (KO/EN)
  scenarios.py              # speed scenario matrices
  aggregate.py              # speed raw JSON → summary CSV
scripts/
  sync_models.py            # registry-driven hf download
  run_bench.py              # speed CLI: --variant / --all-pending
  run_evals.py              # eval CLI: --variant / --all-variants
  run_evals_overnight.sh    # launchd stop → run → aggregate → restore
  preflight.py              # limit=2 wiring check for one local variant
  compare_quality.py        # cos-sim divergence (20 prompts)
  compare_mtplx.py          # paired MTPLX MTP vs AR summary
  aggregate_evals.py        # eval JSON → CSVs + index.json
  build_index.py            # build only the index
  plan_eval_catchup.py      # write results/eval_catchup_plan.{json,md}
  run_programbench.py       # run upstream ProgramBench eval + import
  run_terminal_bench.py     # run Terminal-Bench + import result
  import_programbench.py    # import existing ProgramBench eval JSON
  export_site_public_data.py # regenerate site/public/data and TS fixture
results/
  raw/                      # per-run speed JSON (gitignored)
  summary.csv               # speed aggregated (committed)
  mtplx_speedups.csv        # paired MTPLX MTP/AR summary (committed)
  quality_*.json            # gitignored
  eval_scores/              # lm-eval outputs (gitignored)
  eval_traces/              # per-task execution ledger
  eval_summary_*.csv        # eval aggregated (committed)
  index.json                # registry × measurement status (committed)
  eval_catchup_plan.*       # generated catch-up queue (committed)
  server_logs/              # gitignored
  overnight_logs/           # gitignored
dashboard/
  app.py                    # Streamlit (11 pages, Catalog first)
site/
  app/                      # TanStack Start public benchmark site
  public/data/              # exported benchmark JSON/CSV for the site
  wrangler.jsonc            # Cloudflare Workers deployment config
report/
  _quarto.yml
  index.qmd                 # static HTML report (Quarto)
docs/
  methodology.md            # measurement protocol
  model_policy.md           # local model selection policy and caveats
  overnight_plan.md         # family-batched eval catch-up plan

Multi-dimensional evals (added v0.2)

Runs lm-eval-harness against an OpenAI-compatible server (mlx_lm.server for MLX, llama-server for GGUF) booted ad-hoc per model variant.

Dimension	Tasks (chat-compatible)	Loglikelihood-only (gguf only)
Reasoning	`mmlu_generative`, `gsm8k_cot_zeroshot`	`hellaswag`, `leaderboard_mmlu_pro`, `leaderboard_gpqa_diamond`
Korean	`kmmlu_direct`, `hrm8k`	`haerae`, `kobest`
Code	primary: `humaneval` / `mbpp` (EvalPlus), `bigcodebench_hard`; optional legacy: `livecodebench`	—
Instruction	`leaderboard_ifeval`	—
Long context	`longbench` (21 sub-tasks, EN+ZH)	—
Safety	`truthfulqa-multi_gen_en`	`toxigen`
Tool use	`bfcl` (BFCL v4, opt-in via `--include-bfcl`)	—
Diagnostic source grounding	`sourceqa` (pinned-repo evidence QA, deterministic checker)	—
Fresh eval	`livebench_subset` (LiveBench non-agentic subset)	—
Agentic code	primary: `terminal_bench` (Docker-backed terminal tasks); optional: `programbench` eval + result import	—
Korean professional	`kmmlu_pro` (KMMLU-Pro weighted MCQ)	—

The reasoning + instruction additions mirror HF Open LLM Leaderboard v2 (MMLU-Pro / GPQA-Diamond / IFEval). BigCodeBench-Hard is the primary practical code-generation task; LiveCodeBench remains available as a legacy contest-code baseline. Terminal-Bench is the primary agentic code task, and BFCL fills the tool-use dim.

mlx_lm.server does not return token logprobs in /v1/completions, so loglikelihood-based MCQ tasks (hellaswag, kobest, haerae, toxigen, leaderboard_mmlu_pro, leaderboard_gpqa_diamond) only run on the GGUF path. Generative variants are used for the rest so both runtimes get apples-to-apples coverage.

⚠️ Code-eval safety. Code-family tasks (humaneval, mbpp, bigcodebench_hard, livecodebench) run through dedicated external runners outside lm-eval. They still execute model-generated code or benchmark harness tooling, so use trusted checkpoints and keep sandboxing in mind. --suite full skips these for unsupported backends and also supports --skip-existing gating for repeatability.

Setup:

uv sync --extra evals

# Optional, for the frontier external runners:
uv pip install bfcl-eval==2025.12.17                                          # BFCL v4 (tool use)
uv pip install git+https://github.com/LiveCodeBench/LiveCodeBench.git          # LiveCodeBench (contamination-free code)
uv pip install bigcodebench --upgrade                                          # BigCodeBench-Hard (practical code)
uv pip install "terminal-bench>=0.2.18"                                        # Terminal-Bench (agentic terminal tasks)
git clone https://github.com/LiveBench/LiveBench.git /path/to/LiveBench        # LiveBench subset
export LIVEBENCH_REPO=/path/to/LiveBench

kmmlu_pro uses the datasets and openai packages already installed by uv sync --extra evals, but the dataset is gated: request access to LGAI-EXAONE/KMMLU-Pro on Hugging Face and authenticate with hf auth login or HF_TOKEN before running it.

Smoke (verify wiring, ~10 min, limit=2 per task):

uv run python scripts/run_evals.py --variant 26B-MoE-mlx-8bit --suite smoke --limit 2

Frontier external smoke (LiveBench / BigCodeBench-Hard / KMMLU-Pro with a small per-task cap):

uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full --limit 2

# Optional: resilient instruction eval and strict coverage check
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full \
  --resilient-ifeval --strict-coverage

--suite full --limit N is the smoke path for these external runners. EvalPlus is skipped under a limit because its upstream CLI does not provide a compatible partial matrix, while LiveBench, BigCodeBench-Hard, and KMMLU-Pro do run with the cap.

For stricter governance on a full matrix run, add --strict-coverage so the run exits non-zero when any required supported primary task is missing a completed result (for example, external runner unavailable, limit-incompatible skip, or hard task error). Optional lanes (livecodebench, BFCL, LiveBench subset, ProgramBench) are reported in coverage but do not block the primary matrix. bigcodebench_hard and terminal_bench are primary rows. MTPLX MTP/AR rows are reported as speed_only under the mtplx_speedup lane and also do not block coverage.

Terminal-Bench is the maintained agentic terminal benchmark path. It runs tasks inside Docker and talks to the same OpenAI-compatible model server as the rest of llm-bench. The wrapper defaults to one task to keep smoke tests cheap:

uv sync --extra terminalbench

# If Docker uses Colima, expose the active socket to the Python Docker SDK.
export DOCKER_HOST=unix://$HOME/.colima/default/docker.sock

uv run python scripts/run_terminal_bench.py \
  --variant 26B-MoE-gguf-q8 \
  --task-id hello-world
uv run python scripts/aggregate_evals.py

The same runner is available from the full eval CLI. In --suite full, Terminal-Bench is part of the primary matrix and defaults to one sampled task unless TERMINAL_BENCH_TASK_IDS, TERMINAL_BENCH_N_TASKS, or TERMINAL_BENCH_FULL=1 is set:

TERMINAL_BENCH_TASK_IDS=hello-world \
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full --task terminal_bench

ProgramBench is agentic: the model/agent must first produce a complete <instance_id>/submission.tar.gz codebase, then ProgramBench evaluates it in Docker. llm-bench wraps the evaluation and import step:

uv sync --extra programbench
uv run python scripts/run_programbench.py \
  --variant 26B-MoE-gguf-q8 \
  --source-dir /path/to/programbench/submission-run \
  --tasks-dir /path/to/ProgramBench/src/programbench/data/tasks \
  --workers 4 \
  --branch-workers 2 \
  --docker-cpus 8 \
  --limit 5
uv run python scripts/aggregate_evals.py

If the ProgramBench eval JSON files were produced elsewhere, import them directly:

uv run python scripts/import_programbench.py \
  --variant 26B-MoE-gguf-q8 \
  --source-dir /path/to/programbench/evaluated-run \
  --tasks-dir /path/to/ProgramBench/src/programbench/data/tasks

The headline ProgramBench metric is resolved_rate,none (fully solved program-rebuild tasks). almost_resolved_rate,none and avg_test_pass_rate,none are supporting diagnostics, not the primary ranker. Pass --tasks-dir when available so ignored ProgramBench branches/tests are excluded the same way as programbench info.

Full overnight matrix (all locally present variants × full suite) — wrapper script manages optional launchd bootout + run + bootstrap automatically (always restores agents on EXIT, even if eval fails):

# Foreground (watch progress):
bash scripts/run_evals_overnight.sh

# Detached overnight (recommended):
nohup bash scripts/run_evals_overnight.sh > /tmp/llm-evals-overnight.log 2>&1 &
disown
tail -f /tmp/llm-evals-overnight.log

See docs/overnight_plan.md for the family-batched catch-up plan and optional lane schedule.

To generate the concrete catch-up queue from the current coverage index:

uv run python scripts/plan_eval_catchup.py

Env overrides:

SUITE=smoke|full (default full)
LIMIT=N (per-task sample cap)
VARIANTS="26B-MoE-mlx-8bit 26B-MoE-gguf-q8" (subset, default = all)
TASKS="kmmlu_pro" (task-filtered catch-up bucket)
LLM_BENCH_STRICT_COVERAGE=1 — pass --strict-coverage to run_evals.py
LLM_BENCH_RESILIENT_IFEVAL=1 — pass --resilient-ifeval to run_evals.py
LLM_BENCH_INCLUDE_BFCL=1 — pass --include-bfcl for the BFCL optional lane
TERMINAL_BENCH_TASK_IDS="hello-world" — comma/space-separated task filter
TERMINAL_BENCH_N_TASKS=N or TERMINAL_BENCH_FULL=1 — Terminal-Bench task count
TERMINAL_BENCH_MODEL=openai/<model> — LiteLLM model label override
TERMINAL_BENCH_DOCKER_HOST=unix://... — Docker socket override
LIVE_CODE_BENCH_REPO=/path/to/LiveCodeBench — run source checkout version
LIVE_CODE_BENCH_START_DATE=YYYY-MM-DD, LIVE_CODE_BENCH_END_DATE=YYYY-MM-DD, LIVE_CODE_BENCH_MAX_TOKENS=N — run a reproducible release window
LIVE_CODE_BENCH_RELEASE=release_vX — override run_livecodebench dataset release (default from scripts/livecodebench_runner.py)
LIVE_CODE_BENCH_NOT_FAST=1 — use the original non-lite LiveCodeBench code generation benchmark instead of the upstream default fast/lite setting
LIVEBENCH_REPO=/path/to/LiveBench, LIVEBENCH_RELEASE=YYYY-MM-DD, LIVEBENCH_MAX_TOKENS=N — LiveBench checkout and release selection
BIGCODEBENCH_EXECUTION=gradio|local|e2b, BIGCODEBENCH_GRADIO_ENDPOINT=https://... — BigCodeBench execution backend
KMMLU_PRO_MAX_TOKENS=N — maximum chat tokens for each KMMLU-Pro response
LAUNCH_AGENTS="com.you.foo com.you.bar" — launchd agent labels to stop before the run and restart at the end. Default empty = no launchd management; stop GPU-using processes manually instead.

Each variant boots its own server on port 9090; tasks run sequentially per variant. Expect ~2–3 hours per variant for the full suite.

sourceqa is a lightweight diagnostic runner inspired by repo-search benchmarks: it clones pinned source repositories, injects curated evidence files into a chat prompt, and writes deterministic acc,none / recall metrics to the same results_*.json shape as lm-eval. Because the current task set is small and saturated, SourceQA is kept for smoke/regression checks and excluded from headline ranking and primary coverage debt. Optional judge metadata can be recorded with --sourceqa-judge-model, but it does not affect the diagnostic score.

Results:

results/eval_scores/<run_id>/<task>/.../results_*.json — raw lm-eval output
results/eval_scores/summary_*.json — flat list of {variant, task, results}
results/eval_traces/<run_id>.jsonl — per-task execution ledger with status, wall time, artifacts, and errors
results/eval_summary_full.csv — every metric × subtask × variant (244+ rows)
results/eval_summary_primary.csv — one row per (variant, task), canonical metric
results/index.json — registry × speed/eval coverage with measured, directional, missing, optional, speed_only, and unsupported statuses
results/eval_catchup_plan.json / .md — ordered commands generated from the current coverage gaps
results/server_logs/<run_id>.log — model server stderr for debugging

After the eval run, scripts/aggregate_evals.py rebuilds the CSVs and the coverage index. The overnight wrapper calls this for you.

Dashboard (13 pages)

uv run streamlit run dashboard/app.py
# Opens on http://127.0.0.1:8502 by default.

Group	Page	What it shows
Status	Model Status	Model-first benchmark progress, task matrix, and weak/missing task list
Status	Model Compare	Side-by-side model comparison for speed, eval scores, and coverage debt
Status	Catalog	Registry × measurement progress bars (entry point)
Speed	Speed Overview	TG/PP bar charts, peak memory, Pareto scatter
Speed	Speed Scaling	Context-length sweep by runtime
Speed	Output Quality (cos sim)	Optional paired response similarity
Speed	Speed Raw	Per-run JSON table + CSV download
Eval	Evals Heatmap	Variant × task primary-score grid
Eval	Evals · Runtime Compare	Score delta within same model+tier across backend/fmt/artifact groups
Eval	Evals · Quantization	8bit vs 4bit accuracy hit per model/runtime
Eval	Evals · Dimension	Per-dim bar charts with stderr
Eval	Evals · LongBench Detail	21 sub-task breakdown
Eval	Evals Raw	Full metrics filterable table + CSV

Methodology

See docs/methodology.md for measurement protocol, sanity checks, scenario matrix rationale, and the chat-vs-loglikelihood split. See docs/model_policy.md for the current local model selection policy and headline eval scores.

Tests

uv run pytest tests/ -v

The pytest suite covers registry validation, manifest/idempotency behavior, speed runners, eval runners, aggregation, ProgramBench import/eval helpers, site data export, and public-site data contracts.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.streamlit		.streamlit
dashboard		dashboard
docs		docs
models		models
report		report
results		results
scripts		scripts
site		site
src/llm_bench		src/llm_bench
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-bench

What this is for

Current scope

Quickstart: local Apple Silicon

Quickstart: OpenAI-compatible endpoint

Optional reports and quality checks

Public benchmark website

Important: stop other GPU/Metal workloads first

What gets measured

Model registry (`models/registry.yaml`)

Idempotency

Repository layout

Multi-dimensional evals (added v0.2)

Dashboard (13 pages)

Methodology

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-bench

What this is for

Current scope

Quickstart: local Apple Silicon

Quickstart: OpenAI-compatible endpoint

Optional reports and quality checks

Public benchmark website

Important: stop other GPU/Metal workloads first

What gets measured

Model registry (models/registry.yaml)

Idempotency

Repository layout

Multi-dimensional evals (added v0.2)

Dashboard (13 pages)

Methodology

Tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Model registry (`models/registry.yaml`)

Packages