Registry-driven LLM benchmark harness for Apple Silicon local runtimes and OpenAI-compatible endpoints.
This repo keeps model and runtime declarations in models/registry.yaml, runs
repeatable speed/memory benchmarks, runs multi-dimensional evals, and publishes
the resulting coverage through Streamlit plus a static TanStack/Cloudflare site.
- Compare local runtime formats such as MLX, GGUF/llama.cpp, DS4, and MTPLX on the same Apple Silicon machine.
- Track prompt-processing speed, generation speed, peak memory, wall time, and benchmark version for each scenario.
- Run accuracy and capability evals through
lm-eval-harnessplus external runners for EvalPlus, LiveCodeBench, BigCodeBench, BFCL, SourceQA, LiveBench, KMMLU-Pro, ProgramBench import/eval, and Terminal-Bench. - Include hosted or remote OpenAI-compatible
/v1endpoints in the same registry and reporting pipeline when local artifacts are not required. - Keep coverage explicit: measured, directional, diagnostic, missing, optional, speed-only, and unsupported rows are surfaced separately.
| Area | Supported today |
|---|---|
| Speed runners | MLX, GGUF/llama.cpp, DS4, MTPLX, OpenAI-compatible endpoints |
| Eval runners | lm-eval-harness, EvalPlus, LiveCodeBench, BigCodeBench-Hard, BFCL, SourceQA, LiveBench subset, KMMLU-Pro, ProgramBench import/eval, Terminal-Bench |
| Reporting | Streamlit dashboard, Quarto report, TanStack Start public site on Cloudflare Workers |
| Primary machine | Apple M5 Max, 128GB unified memory, macOS |
| Registry | Local HF repos, local GGUF files, split GGUF files, MTPLX speed-only variants, hosted endpoints |
git clone https://github.com/haxlys/llm-bench.git ~/llm-bench
cd ~/llm-bench
# System tools (one-time)
brew install llama.cpp
brew install --cask quarto # optional, only for the static report
# Python env
uv sync
# Inspect what the registry knows and what is already present locally.
uv run python scripts/sync_models.py --check
# Download one small/local target first. MLX variants land in the HF cache;
# GGUF variants land under the registry's {gguf_dir}, default ~/models/gguf/.
uv run python scripts/sync_models.py --variant gemma-4-E4B-gguf-q8
# Smoke test (single scenario, ~1 min) — verifies wiring before the full matrix
uv run python scripts/run_bench.py --variant gemma-4-E4B-gguf-q8 --smoke
# Full speed matrix for locally present variants
uv run python scripts/run_bench.py --all-pending
# Visualize
uv run streamlit run dashboard/app.py
# Opens on http://127.0.0.1:8502 by default.Use the full registry download only when you intentionally want the whole local
matrix; it can require tens or hundreds of GB depending on the variants present
in models/registry.yaml.
uv run python scripts/sync_models.py --all-missingFor an existing /v1 server or hosted provider, add an endpoint variant to
models/registry.yaml:
models:
- id: my-endpoint-model
family: hosted
architecture: dense
variants:
- key: my-endpoint-api
fmt: api
backend: openai-compatible
artifact_type: endpoint
path: https://provider.example/v1
api_model: provider/model-id
api_key_env: PROVIDER_API_KEY
quant: hosted
tier: hosted
capabilities: [chat, completions, code_eval_chat, tool_use_eval]Then run a speed smoke and eval smoke without downloading local model weights:
uv sync
export PROVIDER_API_KEY=...
uv run python scripts/run_bench.py --variant my-endpoint-api --smoke
uv sync --extra evals
uv run python scripts/run_evals.py --variant my-endpoint-api --suite smoke --limit 3Endpoint speed rows use wall-clock effective token rates because hosted APIs usually do not expose separate prefill/generation timing or local peak memory.
# Output divergence (quality)
uv sync --extra quality # pulls sentence-transformers
uv run python scripts/compare_quality.py \
--gguf-model ~/models/gguf/gemma-4-26B-A4B-it-Q8_0.gguf
# Static report (requires Quarto)
quarto render report/
open report/_site/index.htmlThe public website lives in site/. It is a TanStack Start app built with the
Cloudflare Vite plugin, deployed to Cloudflare Workers with Static Assets, and
prerendered by TanStack Start during vite build.
Regenerate the typed data export before reviewing or publishing site changes:
uv run python scripts/export_site_public_data.pyInstall frontend dependencies once:
cd site
npm installRun the site locally during development:
cd site
npm run devBuild and preview the production output:
cd site
npm run build
npm run previewDeploy to Cloudflare Workers:
cd site
npm run deployThe Cloudflare Workers configuration sets main to @tanstack/react-start/server-entry with
nodejs_compat. The Cloudflare Vite plugin emits the Workers Static Assets configuration into the
generated output, so wrangler.jsonc intentionally does not hard-code an assets.directory.
TTFT and ITL columns are present in the speed report, but they display not measured until the benchmark runner records those latency fields. Until then,
use TG tok/s, wall time, and peak memory for speed comparisons.
Inference benchmarks are extremely sensitive to Metal contention. Before running:
# Confirm nothing else is holding the GPU
lsof -i :8080 -i :8081 -i :8082 # mlx servers in ~/llm-stack
ps aux | grep -iE "mlx|llama|ds4" | grep -v grepIf you run other MLX/DS4 servers (e.g. ~/llm-stack), pause them during the run. Otherwise expect 2–5× slower numbers, single-instance DS4 lock failures, and possible OOM at the 31B+ class.
Per (model, format, scenario):
| Metric | Source | Notes |
|---|---|---|
pp_tps (prompt processing tok/s) |
mlx-lm verbose / llama-bench JSON |
Synthetic prefill |
tg_tps (generation tok/s) |
mlx-lm verbose / llama-bench JSON |
Greedy, temp=0 |
peak_mem_gb |
/usr/bin/time -l max RSS, max'd with mx.metal.get_peak_memory() for MLX |
Process-level |
wall_s |
/usr/bin/time -l real |
End-to-end including model load |
cos_sim |
paraphrase-multilingual-mpnet-base-v2 embedding |
Quality script only |
Scenarios = prefill ∈ {256, 1024, 4096, 8192} × gen ∈ {128, 512}. 3 measured runs + 1 warmup per scenario.
The single source of truth for what gets benchmarked. Adding a new model:
models:
- id: qwen-3.6-27b
family: qwen
architecture: dense
params_total_b: 27
variants:
- key: qwen-27b-mlx-8bit
fmt: mlx
path: mlx-community/Qwen3.6-27B-8bit
quant: MLX-8bit
tier: 8bit
approx_size_gb: 27
- key: qwen-27b-gguf-q8
fmt: gguf
path: "{gguf_dir}/qwen-3.6-27b-Q8_0.gguf"
quant: Q8_0
tier: 8bit
approx_size_gb: 28
download:
repo: bartowski/qwen-3.6-27b-GGUF
pattern: "*Q8_0*.gguf"Then:
uv run python scripts/sync_models.py --model qwen-3.6-27b
uv run python scripts/run_bench.py --variant qwen-27b-mlx-8bit --variant qwen-27b-gguf-q8
uv run python scripts/run_evals.py --variant qwen-27b-mlx-8bit --variant qwen-27b-gguf-q8 --suite fullUse --task to run an ordered catch-up bucket without re-running the whole
suite:
uv run python scripts/run_evals.py --all-variants --suite full \
--task kmmlu_pro --resilient-ifeval --strict-coverageMTPLX-ready MLX checkpoints can be benchmarked through the normal MLX runner for apples-to-apples autoregressive speed/eval numbers:
uv run python scripts/run_bench.py \
--variant qwen-3.6-27b-mtplx-speed-mlx-4bit \
--variant qwen-3.6-27b-mtplx-optimized-mlx-mixed4To measure MTPLX speculative decoding itself inside the same speed pipeline,
use the paired mtplx-mtp and mtplx-ar variants. The mtplx-mtp rows run
native MTP speculative decoding; the mtplx-ar rows run the same MTPLX runtime
with MTP disabled as the target-only baseline.
These MTPLX MTP/AR variants are speed-only and are skipped by the eval runner;
use their paired flat MLX variants for quality coverage.
uv sync --extra mtplx
uv run mtplx pull Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed
uv run python scripts/run_bench.py --smoke --runs 1 --no-warmup \
--variant qwen-3.6-27b-mtplx-speed-mtplx-mtp \
--variant qwen-3.6-27b-mtplx-speed-mtplx-ar
uv run python scripts/compare_mtplx.pySet MTPLX_MAX=1 to request MTPLX's fan-max path when the local ThermalForge
setup is available. Without it, results represent the normal no-fan runtime.
Currently shipped:
| Key | Model | Format | Quant | Tier |
|---|---|---|---|---|
26B-MoE-mlx-8bit |
gemma-4-26B-A4B-it (MoE) | MLX | 8-bit | 8bit |
26B-MoE-gguf-q8 |
gemma-4-26B-A4B-it (MoE) | GGUF | Q8_0 | 8bit |
26B-MoE-mlx-4bit |
gemma-4-26B-A4B-it (MoE) | MLX | 4-bit | 4bit |
26B-MoE-gguf-q4 |
gemma-4-26B-A4B-it (MoE) | GGUF | Q4_K_M | 4bit |
31B-Dense-mlx-8bit |
gemma-4-31B-it (Dense) | MLX | 8-bit | 8bit |
31B-Dense-gguf-q8 |
gemma-4-31B-it (Dense) | GGUF | Q8_0 | 8bit |
gemma-4-E4B-gguf-q8 |
gemma-4-E4B-it (Dense) | GGUF | Q8_0 | 8bit |
qwen-3.5-4b-gguf-q8 |
qwen-3.5-4B (Dense) | GGUF | Q8_0 | 8bit |
qwen-3.5-9b-gguf-q8 |
qwen-3.5-9B (Dense) | GGUF | Q8_0 | 8bit |
qwen-3.6-35b-a3b-gguf-q4 |
qwen-3.6-35B-A3B (MoE) | GGUF | Q4_K_M | 4bit |
qwen-3-next-80b-a3b-instruct-gguf-q4 |
Qwen3-Next-80B-A3B-Instruct (MoE) | GGUF | Q4_K_M | 4bit |
qwen-3-coder-30b-a3b-instruct-gguf-q4 |
Qwen3-Coder-30B-A3B-Instruct (MoE) | GGUF | Q4_K_M | 4bit |
qwen-3.6-27b-mtplx-speed-mlx-4bit |
qwen-3.6-27B-MTPLX | MLX | 4-bit | 4bit |
qwen-3.6-27b-mtplx-speed-mtplx-mtp |
qwen-3.6-27B-MTPLX | MTPLX | 4-bit MTP-on | 4bit |
qwen-3.6-27b-mtplx-speed-mtplx-ar |
qwen-3.6-27B-MTPLX | MTPLX | 4-bit MTP-off | 4bit |
qwen-3.6-27b-mtplx-optimized-mlx-mixed4 |
qwen-3.6-27B-MTPLX | MLX | mixed 4/8-bit | 4bit |
qwen-3.6-27b-mtplx-optimized-mtplx-mtp |
qwen-3.6-27B-MTPLX | MTPLX | mixed 4/8-bit MTP-on | 4bit |
qwen-3.6-27b-mtplx-optimized-mtplx-ar |
qwen-3.6-27B-MTPLX | MTPLX | mixed 4/8-bit MTP-off | 4bit |
qwen-3-coder-next-gguf-q4 |
Qwen3-Coder-Next (MoE) | GGUF | Q4_K_M | 4bit |
deepseek-v4-flash-gguf-iq2xxs |
DeepSeek-V4-Flash (MoE) | GGUF DS4 imatrix | IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 | 2bit |
gpt-oss-20b-gguf-q4 |
gpt-oss-20b (MoE) | GGUF | Q4_K_M | 4bit |
gpt-oss-120b-gguf-q4 |
gpt-oss-120b (MoE, split GGUF) | GGUF | Q4_K_M | 4bit |
nemotron-3-nano-omni-30b-a3b-reasoning-gguf-q4 |
Nemotron-3-Nano-Omni-30B-A3B-Reasoning (MoE) | GGUF | UD-Q4_K_M | 4bit |
tier pairs MLX-Nbit ↔ Q*_K_M for fair runtime comparisons. The dashboard
Catalog page shows registry × measurement status at a glance.
For generic benchmark use, each variant may also declare:
backend: openai-compatible # runtime adapter; defaults to fmt
artifact_type: endpoint # hf_repo, gguf_file, endpoint, ...
capabilities: [chat, completions, logprobs]
api_model: provider/model-id # optional model= label for endpoint APIs
api_key_env: PROVIDER_API_KEY # optional env var copied to Authorization/OpenAI_API_KEYExisting mlx and gguf variants infer these fields automatically. Speed
benchmark adapters cover MLX, GGUF, DS4, and OpenAI-compatible endpoints.
Unsupported backends are rejected explicitly so new adapters can be added
without silently misrouting results. Endpoint speed uses wall-clock effective
token rates because hosted APIs generally do not expose separate
prefill/generation timings.
Eval runs can use openai-compatible endpoint variants directly; the endpoint
is treated as an existing /v1 server and no local subprocess is spawned.
Every measurement records the current bench_version (currently 0.3).
run_bench.py --skip-existing (default ON) skips combos that already have N
runs at that version; --all-pending runs only what's missing across the
registry. Bumping BENCH_VERSION in src/llm_bench/__init__.py triggers a
full re-measurement when methodology changes.
models/
registry.yaml # single source of truth — add a model here
src/llm_bench/
__init__.py # BENCH_VERSION constant
registry.py # YAML loader + Variant/Model dataclasses
manifest.py # idempotency: which (variant, scenario) is measured
index.py # build results/index.json (registry × status)
eval_plan.py # ordered catch-up plan from coverage gaps
site_data.py # typed public-site data export
reporting.py # shared ordering/report helpers
runners/ # speed/memory benchmark
base.py # BenchResult + /usr/bin/time -l wrapper
mlx_runner.py # mlx_lm.generate subprocess
gguf_runner.py # llama-bench subprocess
ds4_runner.py # DeepSeek V4 Flash-specific ds4-bench adapter
mtplx_runner.py # MTPLX MTP-on / target-only AR speed adapter
openai_runner.py # OpenAI-compatible endpoint speed adapter
evals/ # multi-dim accuracy (lm-eval-harness)
server.py # ModelServer (mlx_lm.server | llama-server)
lmeval.py # lm_eval CLI wrapper
suites.py # SMOKE/FULL task lists, capability gating
aggregate.py # eval JSON → tidy DataFrame
evalplus_runner.py # HumanEval / MBPP via EvalPlus
livecodebench_runner.py # contamination-fresh coding eval adapter
bigcodebench_runner.py # BigCodeBench-Hard adapter
bfcl.py # BFCL v4 tool-use adapter
sourceqa.py # pinned-repo evidence QA diagnostic
livebench_runner.py # LiveBench subset adapter
kmmlu_pro_runner.py # KMMLU-Pro direct runner
programbench_runner.py # ProgramBench import/eval helpers
terminal_bench_runner.py # Terminal-Bench run + import helpers
trace.py # per-task execution ledger
prompts.py # 20 quality-comparison prompts (KO/EN)
scenarios.py # speed scenario matrices
aggregate.py # speed raw JSON → summary CSV
scripts/
sync_models.py # registry-driven hf download
run_bench.py # speed CLI: --variant / --all-pending
run_evals.py # eval CLI: --variant / --all-variants
run_evals_overnight.sh # launchd stop → run → aggregate → restore
preflight.py # limit=2 wiring check for one local variant
compare_quality.py # cos-sim divergence (20 prompts)
compare_mtplx.py # paired MTPLX MTP vs AR summary
aggregate_evals.py # eval JSON → CSVs + index.json
build_index.py # build only the index
plan_eval_catchup.py # write results/eval_catchup_plan.{json,md}
run_programbench.py # run upstream ProgramBench eval + import
run_terminal_bench.py # run Terminal-Bench + import result
import_programbench.py # import existing ProgramBench eval JSON
export_site_public_data.py # regenerate site/public/data and TS fixture
results/
raw/ # per-run speed JSON (gitignored)
summary.csv # speed aggregated (committed)
mtplx_speedups.csv # paired MTPLX MTP/AR summary (committed)
quality_*.json # gitignored
eval_scores/ # lm-eval outputs (gitignored)
eval_traces/ # per-task execution ledger
eval_summary_*.csv # eval aggregated (committed)
index.json # registry × measurement status (committed)
eval_catchup_plan.* # generated catch-up queue (committed)
server_logs/ # gitignored
overnight_logs/ # gitignored
dashboard/
app.py # Streamlit (11 pages, Catalog first)
site/
app/ # TanStack Start public benchmark site
public/data/ # exported benchmark JSON/CSV for the site
wrangler.jsonc # Cloudflare Workers deployment config
report/
_quarto.yml
index.qmd # static HTML report (Quarto)
docs/
methodology.md # measurement protocol
model_policy.md # local model selection policy and caveats
overnight_plan.md # family-batched eval catch-up plan
Runs lm-eval-harness against an OpenAI-compatible server (mlx_lm.server for
MLX, llama-server for GGUF) booted ad-hoc per model variant.
| Dimension | Tasks (chat-compatible) | Loglikelihood-only (gguf only) |
|---|---|---|
| Reasoning | mmlu_generative, gsm8k_cot_zeroshot |
hellaswag, leaderboard_mmlu_pro, leaderboard_gpqa_diamond |
| Korean | kmmlu_direct, hrm8k |
haerae, kobest |
| Code | primary: humaneval / mbpp (EvalPlus), bigcodebench_hard; optional legacy: livecodebench |
— |
| Instruction | leaderboard_ifeval |
— |
| Long context | longbench (21 sub-tasks, EN+ZH) |
— |
| Safety | truthfulqa-multi_gen_en |
toxigen |
| Tool use | bfcl (BFCL v4, opt-in via --include-bfcl) |
— |
| Diagnostic source grounding | sourceqa (pinned-repo evidence QA, deterministic checker) |
— |
| Fresh eval | livebench_subset (LiveBench non-agentic subset) |
— |
| Agentic code | primary: terminal_bench (Docker-backed terminal tasks); optional: programbench eval + result import |
— |
| Korean professional | kmmlu_pro (KMMLU-Pro weighted MCQ) |
— |
The reasoning + instruction additions mirror HF Open LLM Leaderboard v2 (MMLU-Pro / GPQA-Diamond / IFEval). BigCodeBench-Hard is the primary practical code-generation task; LiveCodeBench remains available as a legacy contest-code baseline. Terminal-Bench is the primary agentic code task, and BFCL fills the tool-use dim.
mlx_lm.server does not return token logprobs in /v1/completions, so
loglikelihood-based MCQ tasks (hellaswag, kobest, haerae, toxigen,
leaderboard_mmlu_pro, leaderboard_gpqa_diamond) only run on the GGUF
path. Generative variants are used for the rest so both runtimes get
apples-to-apples coverage.
⚠️ Code-eval safety. Code-family tasks (humaneval,mbpp,bigcodebench_hard,livecodebench) run through dedicated external runners outside lm-eval. They still execute model-generated code or benchmark harness tooling, so use trusted checkpoints and keep sandboxing in mind.--suite fullskips these for unsupported backends and also supports--skip-existinggating for repeatability.
Setup:
uv sync --extra evals
# Optional, for the frontier external runners:
uv pip install bfcl-eval==2025.12.17 # BFCL v4 (tool use)
uv pip install git+https://github.com/LiveCodeBench/LiveCodeBench.git # LiveCodeBench (contamination-free code)
uv pip install bigcodebench --upgrade # BigCodeBench-Hard (practical code)
uv pip install "terminal-bench>=0.2.18" # Terminal-Bench (agentic terminal tasks)
git clone https://github.com/LiveBench/LiveBench.git /path/to/LiveBench # LiveBench subset
export LIVEBENCH_REPO=/path/to/LiveBenchkmmlu_pro uses the datasets and openai packages already installed by
uv sync --extra evals, but the dataset is gated: request access to
LGAI-EXAONE/KMMLU-Pro on Hugging Face and authenticate with hf auth login
or HF_TOKEN before running it.
Smoke (verify wiring, ~10 min, limit=2 per task):
uv run python scripts/run_evals.py --variant 26B-MoE-mlx-8bit --suite smoke --limit 2Frontier external smoke (LiveBench / BigCodeBench-Hard / KMMLU-Pro with a small per-task cap):
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full --limit 2# Optional: resilient instruction eval and strict coverage check
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full \
--resilient-ifeval --strict-coverage--suite full --limit N is the smoke path for these external runners. EvalPlus
is skipped under a limit because its upstream CLI does not provide a compatible
partial matrix, while LiveBench, BigCodeBench-Hard, and KMMLU-Pro do run with
the cap.
For stricter governance on a full matrix run, add --strict-coverage so the
run exits non-zero when any required supported primary task is missing a
completed result (for example, external runner unavailable,
limit-incompatible skip, or hard task error). Optional lanes
(livecodebench, BFCL, LiveBench subset, ProgramBench) are reported in
coverage but do not block the primary matrix. bigcodebench_hard and
terminal_bench are primary rows. MTPLX MTP/AR rows are reported as
speed_only under the mtplx_speedup lane and also do not block coverage.
Terminal-Bench is the maintained agentic terminal benchmark path. It runs tasks inside Docker and talks to the same OpenAI-compatible model server as the rest of llm-bench. The wrapper defaults to one task to keep smoke tests cheap:
uv sync --extra terminalbench
# If Docker uses Colima, expose the active socket to the Python Docker SDK.
export DOCKER_HOST=unix://$HOME/.colima/default/docker.sock
uv run python scripts/run_terminal_bench.py \
--variant 26B-MoE-gguf-q8 \
--task-id hello-world
uv run python scripts/aggregate_evals.pyThe same runner is available from the full eval CLI. In --suite full,
Terminal-Bench is part of the primary matrix and defaults to one sampled task
unless TERMINAL_BENCH_TASK_IDS, TERMINAL_BENCH_N_TASKS, or
TERMINAL_BENCH_FULL=1 is set:
TERMINAL_BENCH_TASK_IDS=hello-world \
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full --task terminal_benchProgramBench is agentic: the model/agent must first produce a complete
<instance_id>/submission.tar.gz codebase, then ProgramBench evaluates it in
Docker. llm-bench wraps the evaluation and import step:
uv sync --extra programbench
uv run python scripts/run_programbench.py \
--variant 26B-MoE-gguf-q8 \
--source-dir /path/to/programbench/submission-run \
--tasks-dir /path/to/ProgramBench/src/programbench/data/tasks \
--workers 4 \
--branch-workers 2 \
--docker-cpus 8 \
--limit 5
uv run python scripts/aggregate_evals.pyIf the ProgramBench eval JSON files were produced elsewhere, import them directly:
uv run python scripts/import_programbench.py \
--variant 26B-MoE-gguf-q8 \
--source-dir /path/to/programbench/evaluated-run \
--tasks-dir /path/to/ProgramBench/src/programbench/data/tasksThe headline ProgramBench metric is resolved_rate,none (fully solved
program-rebuild tasks). almost_resolved_rate,none and
avg_test_pass_rate,none are supporting diagnostics, not the primary ranker.
Pass --tasks-dir when available so ignored ProgramBench branches/tests are
excluded the same way as programbench info.
Full overnight matrix (all locally present variants × full suite) — wrapper script manages optional launchd bootout + run + bootstrap automatically (always restores agents on EXIT, even if eval fails):
# Foreground (watch progress):
bash scripts/run_evals_overnight.sh
# Detached overnight (recommended):
nohup bash scripts/run_evals_overnight.sh > /tmp/llm-evals-overnight.log 2>&1 &
disown
tail -f /tmp/llm-evals-overnight.logSee docs/overnight_plan.md for the family-batched catch-up plan and optional
lane schedule.
To generate the concrete catch-up queue from the current coverage index:
uv run python scripts/plan_eval_catchup.pyEnv overrides:
SUITE=smoke|full(defaultfull)LIMIT=N(per-task sample cap)VARIANTS="26B-MoE-mlx-8bit 26B-MoE-gguf-q8"(subset, default = all)TASKS="kmmlu_pro"(task-filtered catch-up bucket)LLM_BENCH_STRICT_COVERAGE=1— pass--strict-coveragetorun_evals.pyLLM_BENCH_RESILIENT_IFEVAL=1— pass--resilient-ifevaltorun_evals.pyLLM_BENCH_INCLUDE_BFCL=1— pass--include-bfclfor the BFCL optional laneTERMINAL_BENCH_TASK_IDS="hello-world"— comma/space-separated task filterTERMINAL_BENCH_N_TASKS=NorTERMINAL_BENCH_FULL=1— Terminal-Bench task countTERMINAL_BENCH_MODEL=openai/<model>— LiteLLM model label overrideTERMINAL_BENCH_DOCKER_HOST=unix://...— Docker socket overrideLIVE_CODE_BENCH_REPO=/path/to/LiveCodeBench— run source checkout versionLIVE_CODE_BENCH_START_DATE=YYYY-MM-DD,LIVE_CODE_BENCH_END_DATE=YYYY-MM-DD,LIVE_CODE_BENCH_MAX_TOKENS=N— run a reproducible release windowLIVE_CODE_BENCH_RELEASE=release_vX— overriderun_livecodebenchdataset release (default fromscripts/livecodebench_runner.py)LIVE_CODE_BENCH_NOT_FAST=1— use the original non-lite LiveCodeBench code generation benchmark instead of the upstream default fast/lite settingLIVEBENCH_REPO=/path/to/LiveBench,LIVEBENCH_RELEASE=YYYY-MM-DD,LIVEBENCH_MAX_TOKENS=N— LiveBench checkout and release selectionBIGCODEBENCH_EXECUTION=gradio|local|e2b,BIGCODEBENCH_GRADIO_ENDPOINT=https://...— BigCodeBench execution backendKMMLU_PRO_MAX_TOKENS=N— maximum chat tokens for each KMMLU-Pro responseLAUNCH_AGENTS="com.you.foo com.you.bar"— launchd agent labels to stop before the run and restart at the end. Default empty = no launchd management; stop GPU-using processes manually instead.
Each variant boots its own server on port 9090; tasks run sequentially per variant. Expect ~2–3 hours per variant for the full suite.
sourceqa is a lightweight diagnostic runner inspired by repo-search
benchmarks: it clones pinned source repositories, injects curated evidence files
into a chat prompt, and writes deterministic acc,none / recall metrics to the
same results_*.json shape as lm-eval. Because the current task set is small
and saturated, SourceQA is kept for smoke/regression checks and excluded from
headline ranking and primary coverage debt. Optional judge metadata can be
recorded with --sourceqa-judge-model, but it does not affect the diagnostic
score.
Results:
results/eval_scores/<run_id>/<task>/.../results_*.json— raw lm-eval outputresults/eval_scores/summary_*.json— flat list of {variant, task, results}results/eval_traces/<run_id>.jsonl— per-task execution ledger with status, wall time, artifacts, and errorsresults/eval_summary_full.csv— every metric × subtask × variant (244+ rows)results/eval_summary_primary.csv— one row per (variant, task), canonical metricresults/index.json— registry × speed/eval coverage with measured, directional, missing, optional, speed_only, and unsupported statusesresults/eval_catchup_plan.json/.md— ordered commands generated from the current coverage gapsresults/server_logs/<run_id>.log— model server stderr for debugging
After the eval run, scripts/aggregate_evals.py rebuilds the CSVs and the
coverage index. The overnight wrapper calls this for you.
uv run streamlit run dashboard/app.py
# Opens on http://127.0.0.1:8502 by default.| Group | Page | What it shows |
|---|---|---|
| Status | Model Status | Model-first benchmark progress, task matrix, and weak/missing task list |
| Status | Model Compare | Side-by-side model comparison for speed, eval scores, and coverage debt |
| Status | Catalog | Registry × measurement progress bars (entry point) |
| Speed | Speed Overview | TG/PP bar charts, peak memory, Pareto scatter |
| Speed | Speed Scaling | Context-length sweep by runtime |
| Speed | Output Quality (cos sim) | Optional paired response similarity |
| Speed | Speed Raw | Per-run JSON table + CSV download |
| Eval | Evals Heatmap | Variant × task primary-score grid |
| Eval | Evals · Runtime Compare | Score delta within same model+tier across backend/fmt/artifact groups |
| Eval | Evals · Quantization | 8bit vs 4bit accuracy hit per model/runtime |
| Eval | Evals · Dimension | Per-dim bar charts with stderr |
| Eval | Evals · LongBench Detail | 21 sub-task breakdown |
| Eval | Evals Raw | Full metrics filterable table + CSV |
See docs/methodology.md for measurement protocol, sanity checks, scenario matrix rationale, and the chat-vs-loglikelihood split. See docs/model_policy.md for the current local model selection policy and headline eval scores.
uv run pytest tests/ -vThe pytest suite covers registry validation, manifest/idempotency behavior, speed runners, eval runners, aggregation, ProgramBench import/eval helpers, site data export, and public-site data contracts.
MIT