The LLM benchmark toolkit for pi coding agent.
Find the fastest, cheapest LLM models among all registered providers.
Probes every available model with a real stream() call using a representative prompt, then ranks by latency, cost, and output quality. Designed to feed smart model selection into pi-recap and other pi extensions.
- Universal provider loading — discovers and loads all pi extensions (Alibaba, Kimi, etc.) the same way pi does
- Real probes — fires actual streaming API calls, measures time-to-first-byte and completion
- Quality scoring — classifies responses as ok / multi-sentence / refusal / question / empty
- Cost aware — calculates per-call cost in USD using model pricing
- 30s hard timeout — if the full probe doesn't finish, the incremental CSV already contains every completed probe
- Per-provider concurrency — 8 parallel probes per provider to saturate throughput
- Standalone or extension — runs as CLI script or as a pi slash command (
/bench)
Install into pi's extensions directory:
git clone https://github.com/fornace/pi-bench.git ~/.pi/agent/extensions/pi-benchThen run inside pi:
/bench
Results are saved to bench-results-v6.csv in the extension directory.
cd ~/.pi/agent/extensions/pi-bench
npx -y -p tsx tsx bench.mtsWith custom output directory:
npx -y -p tsx tsx bench.mts --output-dir /tmp/bench-outputimport { runBench, printTable } from "./bench.mts";
const { results, csvPath, stats } = await runBench({
outputDir: "/tmp/bench",
timeoutMs: 30000,
concurrency: 8,
});
console.log(printTable(results));
console.log(`Probed ${stats.final} models → ${csvPath}`);| Column | Description |
|---|---|
| rank | Position in latency ranking (ok models only) |
| id | Model ID |
| provider | Provider name (alibaba-cloud, google-vertex, etc.) |
| api | API type (anthropic-messages, google-vertex, etc.) |
| family | Model family tag (flash, turbo, plus, max, pro, etc.) |
| t_first_byte_ms | Time to first token in ms |
| t_complete_ms | Time to completion in ms |
| output_tokens | Tokens generated |
| cost_usd | Estimated cost in USD |
| status | ok / timeout / error:... / empty |
| quality | ok / multi-sentence / refusal / question / empty |
| sample | First 60 chars of response |
Lists all models that passed the filter, plus dropped models with reasons.
| Constant | Default | Description |
|---|---|---|
PER_CALL_TIMEOUT_MS |
4000 | Max time per individual probe |
TOTAL_RUN_TIMEOUT_MS |
30000 | Hard cap for the entire bench run |
CONCURRENCY_PER_PROVIDER |
8 | Parallel probes per provider |
BATCH_GAP_MS |
200 | Delay between probe batches |
Models are filtered to text-capable candidates only. Blocklisted fragments: embed, audio, tts, whisper, transcribe, dall-e, dalle, imagen, stable-diffusion, midjourney, moderation, guard.
RANK FB TOTAL COST FAMILY PROVIDER ID
1 349ms 589ms ~$0 plus alibaba-cloud qwen-vl-plus
2 436ms 620ms ~$0 plus alibaba-cloud qwen-plus-2025-09-11
3 421ms 679ms ~$0 flash alibaba-cloud qwen-flash
4 427ms 717ms ~$0 turbo alibaba-cloud qwen-turbo
5 488ms 719ms ~$0 plus alibaba-cloud qwen-vl-plus-2025-05-07
Top models are typically Alibaba Cloud Qwen variants at sub-700ms latency and ~$0 cost.
pi-bench is designed to be consumed by other pi extensions. There are three integration patterns:
Import curated data directly from the package — no benchmark run needed:
import { CURATED_CHAIN, BLACKLIST_SEED } from "pi-bench";
// CURATED_CHAIN: ordered list of fast/cheap model IDs, ranked by latest bench
// BLACKLIST_SEED: known-bad models (404s, refusals, empty responses)pi-recap uses this for its model picker chain. When you run a new benchmark, pi-bench updates CURATED_CHAIN and pi-recap picks up the new winners automatically — no config changes needed.
Reuse the interactive model selector from your own extension:
import { showBenchmarkUI } from "pi-bench/ui.js";
// csvPath points to bench-results-v6.csv
const picked = await showBenchmarkUI(ctx, csvPath, "Pick a model");This renders a scrollable, filterable SelectList with all benched models ranked by latency. Returns the selected model ID. Used by pi-recap's /recap → model: ... menu.
The CSV lives in the pi-bench extension directory. Resolve it at runtime:
import { fileURLToPath } from "node:url";
import * as path from "node:path";
const benchDir = path.dirname(fileURLToPath(import.meta.resolve("pi-bench/package.json")));
const csvPath = path.join(benchDir, "bench-results-v6.csv");When pi-bench runs as a slash command (/bench), it detects whether a TUI is available via ctx.hasUI. Without a TUI (headless mode), results are printed to the console. With a TUI, the interactive selector is shown. The same benchmark subprocess runs in both cases — only the output display changes.
MIT
By Francesco Frapporti at Fornace.
- pi-recap — Always-visible session recap panel for pi. Uses pi-bench data to pick the fastest summarization model.
- pi-banana — Generate and edit images inside pi using Google Nano Banana. Banner images for all these packages were created with pi-banana.
- pi-alibaba-models — Complete Alibaba provider for pi: Qwen, DeepSeek, Kimi, GLM, MiniMax with native thinking levels.
- pi-notte-theme — Notte: a true-dark pi theme where darkness has color and text glows like terminal phosphor.
