Skip to content

haxlys/llm-bench

Repository files navigation

llm-bench

License: MIT Python 3.13+ bench_version tests platform

Registry-driven LLM benchmark harness for Apple Silicon local runtimes and OpenAI-compatible endpoints.

This repo keeps model and runtime declarations in models/registry.yaml, runs repeatable speed/memory benchmarks, runs multi-dimensional evals, and publishes the resulting coverage through Streamlit plus a static TanStack/Cloudflare site.

What this is for

  • Compare local runtime formats such as MLX, GGUF/llama.cpp, DS4, and MTPLX on the same Apple Silicon machine.
  • Track prompt-processing speed, generation speed, peak memory, wall time, and benchmark version for each scenario.
  • Run accuracy and capability evals through lm-eval-harness plus external runners for EvalPlus, LiveCodeBench, BigCodeBench, BFCL, SourceQA, LiveBench, KMMLU-Pro, ProgramBench import/eval, and Terminal-Bench.
  • Include hosted or remote OpenAI-compatible /v1 endpoints in the same registry and reporting pipeline when local artifacts are not required.
  • Keep coverage explicit: measured, directional, diagnostic, missing, optional, speed-only, and unsupported rows are surfaced separately.

Current scope

Area Supported today
Speed runners MLX, GGUF/llama.cpp, DS4, MTPLX, OpenAI-compatible endpoints
Eval runners lm-eval-harness, EvalPlus, LiveCodeBench, BigCodeBench-Hard, BFCL, SourceQA, LiveBench subset, KMMLU-Pro, ProgramBench import/eval, Terminal-Bench
Reporting Streamlit dashboard, Quarto report, TanStack Start public site on Cloudflare Workers
Primary machine Apple M5 Max, 128GB unified memory, macOS
Registry Local HF repos, local GGUF files, split GGUF files, MTPLX speed-only variants, hosted endpoints

Quickstart: local Apple Silicon

git clone https://github.com/haxlys/llm-bench.git ~/llm-bench
cd ~/llm-bench

# System tools (one-time)
brew install llama.cpp
brew install --cask quarto      # optional, only for the static report

# Python env
uv sync

# Inspect what the registry knows and what is already present locally.
uv run python scripts/sync_models.py --check

# Download one small/local target first. MLX variants land in the HF cache;
# GGUF variants land under the registry's {gguf_dir}, default ~/models/gguf/.
uv run python scripts/sync_models.py --variant gemma-4-E4B-gguf-q8

# Smoke test (single scenario, ~1 min) — verifies wiring before the full matrix
uv run python scripts/run_bench.py --variant gemma-4-E4B-gguf-q8 --smoke

# Full speed matrix for locally present variants
uv run python scripts/run_bench.py --all-pending

# Visualize
uv run streamlit run dashboard/app.py
# Opens on http://127.0.0.1:8502 by default.

Use the full registry download only when you intentionally want the whole local matrix; it can require tens or hundreds of GB depending on the variants present in models/registry.yaml.

uv run python scripts/sync_models.py --all-missing

Quickstart: OpenAI-compatible endpoint

For an existing /v1 server or hosted provider, add an endpoint variant to models/registry.yaml:

models:
  - id: my-endpoint-model
    family: hosted
    architecture: dense
    variants:
      - key: my-endpoint-api
        fmt: api
        backend: openai-compatible
        artifact_type: endpoint
        path: https://provider.example/v1
        api_model: provider/model-id
        api_key_env: PROVIDER_API_KEY
        quant: hosted
        tier: hosted
        capabilities: [chat, completions, code_eval_chat, tool_use_eval]

Then run a speed smoke and eval smoke without downloading local model weights:

uv sync
export PROVIDER_API_KEY=...

uv run python scripts/run_bench.py --variant my-endpoint-api --smoke

uv sync --extra evals
uv run python scripts/run_evals.py --variant my-endpoint-api --suite smoke --limit 3

Endpoint speed rows use wall-clock effective token rates because hosted APIs usually do not expose separate prefill/generation timing or local peak memory.

Optional reports and quality checks

# Output divergence (quality)
uv sync --extra quality          # pulls sentence-transformers
uv run python scripts/compare_quality.py \
  --gguf-model ~/models/gguf/gemma-4-26B-A4B-it-Q8_0.gguf

# Static report (requires Quarto)
quarto render report/
open report/_site/index.html

Public benchmark website

The public website lives in site/. It is a TanStack Start app built with the Cloudflare Vite plugin, deployed to Cloudflare Workers with Static Assets, and prerendered by TanStack Start during vite build.

Regenerate the typed data export before reviewing or publishing site changes:

uv run python scripts/export_site_public_data.py

Install frontend dependencies once:

cd site
npm install

Run the site locally during development:

cd site
npm run dev

Build and preview the production output:

cd site
npm run build
npm run preview

Deploy to Cloudflare Workers:

cd site
npm run deploy

The Cloudflare Workers configuration sets main to @tanstack/react-start/server-entry with nodejs_compat. The Cloudflare Vite plugin emits the Workers Static Assets configuration into the generated output, so wrangler.jsonc intentionally does not hard-code an assets.directory.

TTFT and ITL columns are present in the speed report, but they display not measured until the benchmark runner records those latency fields. Until then, use TG tok/s, wall time, and peak memory for speed comparisons.

Important: stop other GPU/Metal workloads first

Inference benchmarks are extremely sensitive to Metal contention. Before running:

# Confirm nothing else is holding the GPU
lsof -i :8080 -i :8081 -i :8082    # mlx servers in ~/llm-stack
ps aux | grep -iE "mlx|llama|ds4" | grep -v grep

If you run other MLX/DS4 servers (e.g. ~/llm-stack), pause them during the run. Otherwise expect 2–5× slower numbers, single-instance DS4 lock failures, and possible OOM at the 31B+ class.

What gets measured

Per (model, format, scenario):

Metric Source Notes
pp_tps (prompt processing tok/s) mlx-lm verbose / llama-bench JSON Synthetic prefill
tg_tps (generation tok/s) mlx-lm verbose / llama-bench JSON Greedy, temp=0
peak_mem_gb /usr/bin/time -l max RSS, max'd with mx.metal.get_peak_memory() for MLX Process-level
wall_s /usr/bin/time -l real End-to-end including model load
cos_sim paraphrase-multilingual-mpnet-base-v2 embedding Quality script only

Scenarios = prefill ∈ {256, 1024, 4096, 8192} × gen ∈ {128, 512}. 3 measured runs + 1 warmup per scenario.

Model registry (models/registry.yaml)

The single source of truth for what gets benchmarked. Adding a new model:

models:
  - id: qwen-3.6-27b
    family: qwen
    architecture: dense
    params_total_b: 27
    variants:
      - key: qwen-27b-mlx-8bit
        fmt: mlx
        path: mlx-community/Qwen3.6-27B-8bit
        quant: MLX-8bit
        tier: 8bit
        approx_size_gb: 27
      - key: qwen-27b-gguf-q8
        fmt: gguf
        path: "{gguf_dir}/qwen-3.6-27b-Q8_0.gguf"
        quant: Q8_0
        tier: 8bit
        approx_size_gb: 28
        download:
          repo: bartowski/qwen-3.6-27b-GGUF
          pattern: "*Q8_0*.gguf"

Then:

uv run python scripts/sync_models.py --model qwen-3.6-27b
uv run python scripts/run_bench.py --variant qwen-27b-mlx-8bit --variant qwen-27b-gguf-q8
uv run python scripts/run_evals.py --variant qwen-27b-mlx-8bit --variant qwen-27b-gguf-q8 --suite full

Use --task to run an ordered catch-up bucket without re-running the whole suite:

uv run python scripts/run_evals.py --all-variants --suite full \
  --task kmmlu_pro --resilient-ifeval --strict-coverage

MTPLX-ready MLX checkpoints can be benchmarked through the normal MLX runner for apples-to-apples autoregressive speed/eval numbers:

uv run python scripts/run_bench.py \
  --variant qwen-3.6-27b-mtplx-speed-mlx-4bit \
  --variant qwen-3.6-27b-mtplx-optimized-mlx-mixed4

To measure MTPLX speculative decoding itself inside the same speed pipeline, use the paired mtplx-mtp and mtplx-ar variants. The mtplx-mtp rows run native MTP speculative decoding; the mtplx-ar rows run the same MTPLX runtime with MTP disabled as the target-only baseline. These MTPLX MTP/AR variants are speed-only and are skipped by the eval runner; use their paired flat MLX variants for quality coverage.

uv sync --extra mtplx
uv run mtplx pull Youssofal/Qwen3.6-27B-MTPLX-Optimized-Speed
uv run python scripts/run_bench.py --smoke --runs 1 --no-warmup \
  --variant qwen-3.6-27b-mtplx-speed-mtplx-mtp \
  --variant qwen-3.6-27b-mtplx-speed-mtplx-ar
uv run python scripts/compare_mtplx.py

Set MTPLX_MAX=1 to request MTPLX's fan-max path when the local ThermalForge setup is available. Without it, results represent the normal no-fan runtime.

Currently shipped:

Key Model Format Quant Tier
26B-MoE-mlx-8bit gemma-4-26B-A4B-it (MoE) MLX 8-bit 8bit
26B-MoE-gguf-q8 gemma-4-26B-A4B-it (MoE) GGUF Q8_0 8bit
26B-MoE-mlx-4bit gemma-4-26B-A4B-it (MoE) MLX 4-bit 4bit
26B-MoE-gguf-q4 gemma-4-26B-A4B-it (MoE) GGUF Q4_K_M 4bit
31B-Dense-mlx-8bit gemma-4-31B-it (Dense) MLX 8-bit 8bit
31B-Dense-gguf-q8 gemma-4-31B-it (Dense) GGUF Q8_0 8bit
gemma-4-E4B-gguf-q8 gemma-4-E4B-it (Dense) GGUF Q8_0 8bit
qwen-3.5-4b-gguf-q8 qwen-3.5-4B (Dense) GGUF Q8_0 8bit
qwen-3.5-9b-gguf-q8 qwen-3.5-9B (Dense) GGUF Q8_0 8bit
qwen-3.6-35b-a3b-gguf-q4 qwen-3.6-35B-A3B (MoE) GGUF Q4_K_M 4bit
qwen-3-next-80b-a3b-instruct-gguf-q4 Qwen3-Next-80B-A3B-Instruct (MoE) GGUF Q4_K_M 4bit
qwen-3-coder-30b-a3b-instruct-gguf-q4 Qwen3-Coder-30B-A3B-Instruct (MoE) GGUF Q4_K_M 4bit
qwen-3.6-27b-mtplx-speed-mlx-4bit qwen-3.6-27B-MTPLX MLX 4-bit 4bit
qwen-3.6-27b-mtplx-speed-mtplx-mtp qwen-3.6-27B-MTPLX MTPLX 4-bit MTP-on 4bit
qwen-3.6-27b-mtplx-speed-mtplx-ar qwen-3.6-27B-MTPLX MTPLX 4-bit MTP-off 4bit
qwen-3.6-27b-mtplx-optimized-mlx-mixed4 qwen-3.6-27B-MTPLX MLX mixed 4/8-bit 4bit
qwen-3.6-27b-mtplx-optimized-mtplx-mtp qwen-3.6-27B-MTPLX MTPLX mixed 4/8-bit MTP-on 4bit
qwen-3.6-27b-mtplx-optimized-mtplx-ar qwen-3.6-27B-MTPLX MTPLX mixed 4/8-bit MTP-off 4bit
qwen-3-coder-next-gguf-q4 Qwen3-Coder-Next (MoE) GGUF Q4_K_M 4bit
deepseek-v4-flash-gguf-iq2xxs DeepSeek-V4-Flash (MoE) GGUF DS4 imatrix IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8 2bit
gpt-oss-20b-gguf-q4 gpt-oss-20b (MoE) GGUF Q4_K_M 4bit
gpt-oss-120b-gguf-q4 gpt-oss-120b (MoE, split GGUF) GGUF Q4_K_M 4bit
nemotron-3-nano-omni-30b-a3b-reasoning-gguf-q4 Nemotron-3-Nano-Omni-30B-A3B-Reasoning (MoE) GGUF UD-Q4_K_M 4bit

tier pairs MLX-Nbit ↔ Q*_K_M for fair runtime comparisons. The dashboard Catalog page shows registry × measurement status at a glance.

For generic benchmark use, each variant may also declare:

backend: openai-compatible     # runtime adapter; defaults to fmt
artifact_type: endpoint        # hf_repo, gguf_file, endpoint, ...
capabilities: [chat, completions, logprobs]
api_model: provider/model-id   # optional model= label for endpoint APIs
api_key_env: PROVIDER_API_KEY  # optional env var copied to Authorization/OpenAI_API_KEY

Existing mlx and gguf variants infer these fields automatically. Speed benchmark adapters cover MLX, GGUF, DS4, and OpenAI-compatible endpoints. Unsupported backends are rejected explicitly so new adapters can be added without silently misrouting results. Endpoint speed uses wall-clock effective token rates because hosted APIs generally do not expose separate prefill/generation timings. Eval runs can use openai-compatible endpoint variants directly; the endpoint is treated as an existing /v1 server and no local subprocess is spawned.

Idempotency

Every measurement records the current bench_version (currently 0.3). run_bench.py --skip-existing (default ON) skips combos that already have N runs at that version; --all-pending runs only what's missing across the registry. Bumping BENCH_VERSION in src/llm_bench/__init__.py triggers a full re-measurement when methodology changes.

Repository layout

models/
  registry.yaml             # single source of truth — add a model here
src/llm_bench/
  __init__.py               # BENCH_VERSION constant
  registry.py               # YAML loader + Variant/Model dataclasses
  manifest.py               # idempotency: which (variant, scenario) is measured
  index.py                  # build results/index.json (registry × status)
  eval_plan.py              # ordered catch-up plan from coverage gaps
  site_data.py              # typed public-site data export
  reporting.py              # shared ordering/report helpers
  runners/                  # speed/memory benchmark
    base.py                 # BenchResult + /usr/bin/time -l wrapper
    mlx_runner.py           # mlx_lm.generate subprocess
    gguf_runner.py          # llama-bench subprocess
    ds4_runner.py           # DeepSeek V4 Flash-specific ds4-bench adapter
    mtplx_runner.py         # MTPLX MTP-on / target-only AR speed adapter
    openai_runner.py        # OpenAI-compatible endpoint speed adapter
  evals/                    # multi-dim accuracy (lm-eval-harness)
    server.py               # ModelServer (mlx_lm.server | llama-server)
    lmeval.py               # lm_eval CLI wrapper
    suites.py               # SMOKE/FULL task lists, capability gating
    aggregate.py            # eval JSON → tidy DataFrame
    evalplus_runner.py      # HumanEval / MBPP via EvalPlus
    livecodebench_runner.py # contamination-fresh coding eval adapter
    bigcodebench_runner.py  # BigCodeBench-Hard adapter
    bfcl.py                 # BFCL v4 tool-use adapter
    sourceqa.py             # pinned-repo evidence QA diagnostic
    livebench_runner.py     # LiveBench subset adapter
    kmmlu_pro_runner.py     # KMMLU-Pro direct runner
    programbench_runner.py  # ProgramBench import/eval helpers
    terminal_bench_runner.py # Terminal-Bench run + import helpers
    trace.py                # per-task execution ledger
  prompts.py                # 20 quality-comparison prompts (KO/EN)
  scenarios.py              # speed scenario matrices
  aggregate.py              # speed raw JSON → summary CSV
scripts/
  sync_models.py            # registry-driven hf download
  run_bench.py              # speed CLI: --variant / --all-pending
  run_evals.py              # eval CLI: --variant / --all-variants
  run_evals_overnight.sh    # launchd stop → run → aggregate → restore
  preflight.py              # limit=2 wiring check for one local variant
  compare_quality.py        # cos-sim divergence (20 prompts)
  compare_mtplx.py          # paired MTPLX MTP vs AR summary
  aggregate_evals.py        # eval JSON → CSVs + index.json
  build_index.py            # build only the index
  plan_eval_catchup.py      # write results/eval_catchup_plan.{json,md}
  run_programbench.py       # run upstream ProgramBench eval + import
  run_terminal_bench.py     # run Terminal-Bench + import result
  import_programbench.py    # import existing ProgramBench eval JSON
  export_site_public_data.py # regenerate site/public/data and TS fixture
results/
  raw/                      # per-run speed JSON (gitignored)
  summary.csv               # speed aggregated (committed)
  mtplx_speedups.csv        # paired MTPLX MTP/AR summary (committed)
  quality_*.json            # gitignored
  eval_scores/              # lm-eval outputs (gitignored)
  eval_traces/              # per-task execution ledger
  eval_summary_*.csv        # eval aggregated (committed)
  index.json                # registry × measurement status (committed)
  eval_catchup_plan.*       # generated catch-up queue (committed)
  server_logs/              # gitignored
  overnight_logs/           # gitignored
dashboard/
  app.py                    # Streamlit (11 pages, Catalog first)
site/
  app/                      # TanStack Start public benchmark site
  public/data/              # exported benchmark JSON/CSV for the site
  wrangler.jsonc            # Cloudflare Workers deployment config
report/
  _quarto.yml
  index.qmd                 # static HTML report (Quarto)
docs/
  methodology.md            # measurement protocol
  model_policy.md           # local model selection policy and caveats
  overnight_plan.md         # family-batched eval catch-up plan

Multi-dimensional evals (added v0.2)

Runs lm-eval-harness against an OpenAI-compatible server (mlx_lm.server for MLX, llama-server for GGUF) booted ad-hoc per model variant.

Dimension Tasks (chat-compatible) Loglikelihood-only (gguf only)
Reasoning mmlu_generative, gsm8k_cot_zeroshot hellaswag, leaderboard_mmlu_pro, leaderboard_gpqa_diamond
Korean kmmlu_direct, hrm8k haerae, kobest
Code primary: humaneval / mbpp (EvalPlus), bigcodebench_hard; optional legacy: livecodebench
Instruction leaderboard_ifeval
Long context longbench (21 sub-tasks, EN+ZH)
Safety truthfulqa-multi_gen_en toxigen
Tool use bfcl (BFCL v4, opt-in via --include-bfcl)
Diagnostic source grounding sourceqa (pinned-repo evidence QA, deterministic checker)
Fresh eval livebench_subset (LiveBench non-agentic subset)
Agentic code primary: terminal_bench (Docker-backed terminal tasks); optional: programbench eval + result import
Korean professional kmmlu_pro (KMMLU-Pro weighted MCQ)

The reasoning + instruction additions mirror HF Open LLM Leaderboard v2 (MMLU-Pro / GPQA-Diamond / IFEval). BigCodeBench-Hard is the primary practical code-generation task; LiveCodeBench remains available as a legacy contest-code baseline. Terminal-Bench is the primary agentic code task, and BFCL fills the tool-use dim.

mlx_lm.server does not return token logprobs in /v1/completions, so loglikelihood-based MCQ tasks (hellaswag, kobest, haerae, toxigen, leaderboard_mmlu_pro, leaderboard_gpqa_diamond) only run on the GGUF path. Generative variants are used for the rest so both runtimes get apples-to-apples coverage.

⚠️ Code-eval safety. Code-family tasks (humaneval, mbpp, bigcodebench_hard, livecodebench) run through dedicated external runners outside lm-eval. They still execute model-generated code or benchmark harness tooling, so use trusted checkpoints and keep sandboxing in mind. --suite full skips these for unsupported backends and also supports --skip-existing gating for repeatability.

Setup:

uv sync --extra evals

# Optional, for the frontier external runners:
uv pip install bfcl-eval==2025.12.17                                          # BFCL v4 (tool use)
uv pip install git+https://github.com/LiveCodeBench/LiveCodeBench.git          # LiveCodeBench (contamination-free code)
uv pip install bigcodebench --upgrade                                          # BigCodeBench-Hard (practical code)
uv pip install "terminal-bench>=0.2.18"                                        # Terminal-Bench (agentic terminal tasks)
git clone https://github.com/LiveBench/LiveBench.git /path/to/LiveBench        # LiveBench subset
export LIVEBENCH_REPO=/path/to/LiveBench

kmmlu_pro uses the datasets and openai packages already installed by uv sync --extra evals, but the dataset is gated: request access to LGAI-EXAONE/KMMLU-Pro on Hugging Face and authenticate with hf auth login or HF_TOKEN before running it.

Smoke (verify wiring, ~10 min, limit=2 per task):

uv run python scripts/run_evals.py --variant 26B-MoE-mlx-8bit --suite smoke --limit 2

Frontier external smoke (LiveBench / BigCodeBench-Hard / KMMLU-Pro with a small per-task cap):

uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full --limit 2
# Optional: resilient instruction eval and strict coverage check
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full \
  --resilient-ifeval --strict-coverage

--suite full --limit N is the smoke path for these external runners. EvalPlus is skipped under a limit because its upstream CLI does not provide a compatible partial matrix, while LiveBench, BigCodeBench-Hard, and KMMLU-Pro do run with the cap.

For stricter governance on a full matrix run, add --strict-coverage so the run exits non-zero when any required supported primary task is missing a completed result (for example, external runner unavailable, limit-incompatible skip, or hard task error). Optional lanes (livecodebench, BFCL, LiveBench subset, ProgramBench) are reported in coverage but do not block the primary matrix. bigcodebench_hard and terminal_bench are primary rows. MTPLX MTP/AR rows are reported as speed_only under the mtplx_speedup lane and also do not block coverage.

Terminal-Bench is the maintained agentic terminal benchmark path. It runs tasks inside Docker and talks to the same OpenAI-compatible model server as the rest of llm-bench. The wrapper defaults to one task to keep smoke tests cheap:

uv sync --extra terminalbench

# If Docker uses Colima, expose the active socket to the Python Docker SDK.
export DOCKER_HOST=unix://$HOME/.colima/default/docker.sock

uv run python scripts/run_terminal_bench.py \
  --variant 26B-MoE-gguf-q8 \
  --task-id hello-world
uv run python scripts/aggregate_evals.py

The same runner is available from the full eval CLI. In --suite full, Terminal-Bench is part of the primary matrix and defaults to one sampled task unless TERMINAL_BENCH_TASK_IDS, TERMINAL_BENCH_N_TASKS, or TERMINAL_BENCH_FULL=1 is set:

TERMINAL_BENCH_TASK_IDS=hello-world \
uv run python scripts/run_evals.py --variant 26B-MoE-gguf-q8 --suite full --task terminal_bench

ProgramBench is agentic: the model/agent must first produce a complete <instance_id>/submission.tar.gz codebase, then ProgramBench evaluates it in Docker. llm-bench wraps the evaluation and import step:

uv sync --extra programbench
uv run python scripts/run_programbench.py \
  --variant 26B-MoE-gguf-q8 \
  --source-dir /path/to/programbench/submission-run \
  --tasks-dir /path/to/ProgramBench/src/programbench/data/tasks \
  --workers 4 \
  --branch-workers 2 \
  --docker-cpus 8 \
  --limit 5
uv run python scripts/aggregate_evals.py

If the ProgramBench eval JSON files were produced elsewhere, import them directly:

uv run python scripts/import_programbench.py \
  --variant 26B-MoE-gguf-q8 \
  --source-dir /path/to/programbench/evaluated-run \
  --tasks-dir /path/to/ProgramBench/src/programbench/data/tasks

The headline ProgramBench metric is resolved_rate,none (fully solved program-rebuild tasks). almost_resolved_rate,none and avg_test_pass_rate,none are supporting diagnostics, not the primary ranker. Pass --tasks-dir when available so ignored ProgramBench branches/tests are excluded the same way as programbench info.

Full overnight matrix (all locally present variants × full suite) — wrapper script manages optional launchd bootout + run + bootstrap automatically (always restores agents on EXIT, even if eval fails):

# Foreground (watch progress):
bash scripts/run_evals_overnight.sh

# Detached overnight (recommended):
nohup bash scripts/run_evals_overnight.sh > /tmp/llm-evals-overnight.log 2>&1 &
disown
tail -f /tmp/llm-evals-overnight.log

See docs/overnight_plan.md for the family-batched catch-up plan and optional lane schedule.

To generate the concrete catch-up queue from the current coverage index:

uv run python scripts/plan_eval_catchup.py

Env overrides:

  • SUITE=smoke|full (default full)
  • LIMIT=N (per-task sample cap)
  • VARIANTS="26B-MoE-mlx-8bit 26B-MoE-gguf-q8" (subset, default = all)
  • TASKS="kmmlu_pro" (task-filtered catch-up bucket)
  • LLM_BENCH_STRICT_COVERAGE=1 — pass --strict-coverage to run_evals.py
  • LLM_BENCH_RESILIENT_IFEVAL=1 — pass --resilient-ifeval to run_evals.py
  • LLM_BENCH_INCLUDE_BFCL=1 — pass --include-bfcl for the BFCL optional lane
  • TERMINAL_BENCH_TASK_IDS="hello-world" — comma/space-separated task filter
  • TERMINAL_BENCH_N_TASKS=N or TERMINAL_BENCH_FULL=1 — Terminal-Bench task count
  • TERMINAL_BENCH_MODEL=openai/<model> — LiteLLM model label override
  • TERMINAL_BENCH_DOCKER_HOST=unix://... — Docker socket override
  • LIVE_CODE_BENCH_REPO=/path/to/LiveCodeBench — run source checkout version
  • LIVE_CODE_BENCH_START_DATE=YYYY-MM-DD, LIVE_CODE_BENCH_END_DATE=YYYY-MM-DD, LIVE_CODE_BENCH_MAX_TOKENS=N — run a reproducible release window
  • LIVE_CODE_BENCH_RELEASE=release_vX — override run_livecodebench dataset release (default from scripts/livecodebench_runner.py)
  • LIVE_CODE_BENCH_NOT_FAST=1 — use the original non-lite LiveCodeBench code generation benchmark instead of the upstream default fast/lite setting
  • LIVEBENCH_REPO=/path/to/LiveBench, LIVEBENCH_RELEASE=YYYY-MM-DD, LIVEBENCH_MAX_TOKENS=N — LiveBench checkout and release selection
  • BIGCODEBENCH_EXECUTION=gradio|local|e2b, BIGCODEBENCH_GRADIO_ENDPOINT=https://... — BigCodeBench execution backend
  • KMMLU_PRO_MAX_TOKENS=N — maximum chat tokens for each KMMLU-Pro response
  • LAUNCH_AGENTS="com.you.foo com.you.bar" — launchd agent labels to stop before the run and restart at the end. Default empty = no launchd management; stop GPU-using processes manually instead.

Each variant boots its own server on port 9090; tasks run sequentially per variant. Expect ~2–3 hours per variant for the full suite.

sourceqa is a lightweight diagnostic runner inspired by repo-search benchmarks: it clones pinned source repositories, injects curated evidence files into a chat prompt, and writes deterministic acc,none / recall metrics to the same results_*.json shape as lm-eval. Because the current task set is small and saturated, SourceQA is kept for smoke/regression checks and excluded from headline ranking and primary coverage debt. Optional judge metadata can be recorded with --sourceqa-judge-model, but it does not affect the diagnostic score.

Results:

  • results/eval_scores/<run_id>/<task>/.../results_*.json — raw lm-eval output
  • results/eval_scores/summary_*.json — flat list of {variant, task, results}
  • results/eval_traces/<run_id>.jsonl — per-task execution ledger with status, wall time, artifacts, and errors
  • results/eval_summary_full.csv — every metric × subtask × variant (244+ rows)
  • results/eval_summary_primary.csv — one row per (variant, task), canonical metric
  • results/index.json — registry × speed/eval coverage with measured, directional, missing, optional, speed_only, and unsupported statuses
  • results/eval_catchup_plan.json / .md — ordered commands generated from the current coverage gaps
  • results/server_logs/<run_id>.log — model server stderr for debugging

After the eval run, scripts/aggregate_evals.py rebuilds the CSVs and the coverage index. The overnight wrapper calls this for you.

Dashboard (13 pages)

uv run streamlit run dashboard/app.py
# Opens on http://127.0.0.1:8502 by default.
Group Page What it shows
Status Model Status Model-first benchmark progress, task matrix, and weak/missing task list
Status Model Compare Side-by-side model comparison for speed, eval scores, and coverage debt
Status Catalog Registry × measurement progress bars (entry point)
Speed Speed Overview TG/PP bar charts, peak memory, Pareto scatter
Speed Speed Scaling Context-length sweep by runtime
Speed Output Quality (cos sim) Optional paired response similarity
Speed Speed Raw Per-run JSON table + CSV download
Eval Evals Heatmap Variant × task primary-score grid
Eval Evals · Runtime Compare Score delta within same model+tier across backend/fmt/artifact groups
Eval Evals · Quantization 8bit vs 4bit accuracy hit per model/runtime
Eval Evals · Dimension Per-dim bar charts with stderr
Eval Evals · LongBench Detail 21 sub-task breakdown
Eval Evals Raw Full metrics filterable table + CSV

Methodology

See docs/methodology.md for measurement protocol, sanity checks, scenario matrix rationale, and the chat-vs-loglikelihood split. See docs/model_policy.md for the current local model selection policy and headline eval scores.

Tests

uv run pytest tests/ -v

The pytest suite covers registry validation, manifest/idempotency behavior, speed runners, eval runners, aggregation, ProgramBench import/eval helpers, site data export, and public-site data contracts.

License

MIT

About

Apple Silicon LLM benchmark harness for local runtimes and OpenAI-compatible endpoints: speed, memory, eval coverage, and public reports.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors