Skip to content

dotkrnl/ipbr-rank

Repository files navigation

ipbr

Models drift. Agents battle. Math decides. Live LLM coding scoreboard — site at https://ipbr.pages.dev.

A Rust workspace that fetches public LLM benchmarks from verified sources, normalizes them, computes four building-role scores (Idea, Planning, Building, Reviewing), and emits a canonical TOML scoreboard plus a static website. (CLI binary and crate names stay ipbr-rank.)

Fully vibe-coded. No human wrote the scoring weights — Claude, Gemini, GPT, and Kimi argued over every coefficient, every group composition, and every penalty curve in this repo. Round after round of cross-model code review settled the numbers; the human just refereed and pressed merge. The repo's copyright reflects that: the four debating models are the credited authors. See docs/methodology.md for what they landed on.

Quick Start

# One-shot live refresh (sources .env, builds release, runs fetch→score→render)
scripts/refresh.sh                 # writes out/
scripts/refresh.sh --open          # also opens out/site/index.html
scripts/refresh.sh --offline       # use cached responses only
scripts/refresh.sh --only artificial_analysis,lmarena
scripts/refresh.sh --publish       # also deploy out/site to Cloudflare Pages

.env is sourced for credentials:

variable needed for
AA_API_KEY Artificial Analysis fetcher
OPENROUTER_API_KEY OpenRouter pricing/context
HF_TOKEN LMArena via HuggingFace
CLOUDFLARE_ACCOUNT_ID only required for --publish
CLOUDFLARE_PAGES_PROJECT optional, default ipbr

Manual invocation works too:

cargo build --release -p ipbr-rank-cli
./target/release/ipbr-rank --cache cache --out out all

Deployment

The rendered site lives at out/site/ and is fully static (no external network deps; the validator rejects http://, https://, and data: URLs).

scripts/refresh.sh --publish deploys it to Cloudflare Pages via wrangler:

# one-time auth (interactive, opens a browser)
npx wrangler login

# fetch + render + deploy
scripts/refresh.sh --publish

The current production deployment is at https://ipbr.pages.dev. CI re-runs refresh.sh --publish every 10 minutes on main (see .github/workflows/refresh.yml), so the site stays current without manual intervention.

Four Building Roles

  • I (Idea): Creativity, general intelligence, open-ended generation
  • P (Planning): Structured reasoning, function calling, multi-step task decomposition
  • B (Building): Implementation, SWE-style work, terminal tasks, and code-quality benchmarks
  • R (Reviewing): Judging code quality, correctness, preference evaluation

Each role has a single raw score (0–100, based on public benchmarks).

Sources

All data comes from public, verifiable sources. See docs/sources.md for the full list.

Verified sources (always run):

  • OpenRouter API — model discovery, pricing, context windows
  • LM Arena — preference ratings across text, code, and hard prompts
  • Artificial Analysis — intelligence/coding/reasoning/math indices, plus tau2-bench, scicode, ifbench, terminalbench-hard, livecodebench, mmlu-pro, lcr, and gpqa+hle reasoning blend
  • SWE-bench JSON — Verified + Multilingual leaderboards (single fetch, both fed into the SWE composite)
  • SWE-bench Pro (Scale) — harder, multi-file SWE-bench (1.8k tasks across 41 repos), also fed into the SWE composite
  • SWE Atlas (Scale) — codebase Q&A, test writing, and refactoring leaderboards, collapsed into a SWE Atlas composite
  • SWE-rebench — continuously-refreshed agentic SWE leaderboard, rolling-window resolved rate
  • LiveCodeBench — competitive-programming pass@1 (ingested for back-compat; retired from BUILD weighting after the upstream JSON froze at mid-2025 frontier — see docs/sources.md)
  • GSO — "Generalized Software Optimization" track from the LiveCodeBench operators; replaces LiveCodeBench in BUILD using the contamination-resistant score_hack_control field
  • Terminal-Bench 2.0 — agentic terminal task leaderboard
  • Terminal-Bench 2.1 — newer narrow Terminal-Bench track covering current frontier agent/model combinations
  • BFCL V4 — Berkeley function/tool-calling leaderboard
  • Sonar Code Quality — functional pass rate plus issue, bug, and vulnerability density (the only public benchmark that measures generated-code quality directly)
  • MCP-Atlas (Scale) — real Model Context Protocol tool-orchestration over 36 servers / 220 tools / 1k tasks
  • HiL-Bench (Scale) — human-in-the-loop escalation accuracy for ambiguous or blocked agent tasks
  • ARC-AGI v2 — novel pattern-induction benchmark from ARC Prize (semi-private track)
  • Manual overrides (data/score_overrides.toml) — hand-curated vendor-published metric values (SWE-bench, Terminal-Bench, GDPval, MCP-Atlas, and reported-only launch-card metrics) for models the public leaderboards have not yet rated

Math Summary

Normalization

Each benchmark metric is percentile-normalized within the active model population (5th/95th boundaries, log-scaled for cost/speed/latency). Operational metrics (speed/cost/TTFT/context window) use a tail-penalty curve instead — top 80 % of the population maps into 70-100 (mild differentiation) and only the bottom 20 % drops sharply, because users perceive operational speed in tiers, not linearly.

Synthesis Penalty

Values that came in via conservative sibling synthesis are blended toward 50 by 15 % so they read as a softer signal than direct measurements: final = score × 0.85 + 50 × 0.15. Same-vendor, same-series version-advance fills carry no synthesis penalty. Synthesized metrics still contribute; the category in data/synthesis_aliases.toml controls whether they are discounted.

Manual overrides from data/score_overrides.toml are also softened after normalization, but less aggressively: final = score × 0.90 + 50 × 0.10. They are public, cited values, yet still hand-curated rather than directly ingested leaderboard rows.

Group Aggregation

Metrics are grouped into CRE, GEN, PLAN, BUILD, LM_ARENA_REVIEW_PROXY, OPS_long, OPS_precision, and OPS_review. Each group is a weighted average of its metrics. When a model is missing metrics, the aggregator blends smoothly from shrink-to-50 to trusting the present-weight mean across 60-80 % group coverage; at ≥80 % coverage, peripheral missing metrics no longer penalize otherwise well-covered models.

Final Scores

Each role score is a weighted average of groups. AISL was removed from scoring after local reproduction showed its benchmark surface was not representative enough and was too noise-prone. Its former weight is redistributed into the remaining non-operational public benchmark groups. Operational metrics (speed, cost, context window) carry 0.08 — paired with the tail-penalty curve, this means "fast enough" models cluster within a 1-2 point spread but genuinely slow models lose 4-6 points:

  • I_raw = 0.62×CRE + 0.30×GEN + 0.08×OPS_long
  • P_raw = 0.55×PLAN + 0.37×GEN + 0.08×OPS_precision
  • B_raw = 0.84×BUILD + 0.08×PLAN + 0.08×OPS_precision
  • R = 0.18×LM_ARENA_REVIEW_PROXY + 0.36×BUILD + 0.38×PLAN + 0.08×OPS_review

See docs/methodology.md for the complete mathematical derivation and all coefficients.

Sample Output (TOML)

schema_version = "1.0.0"
generated_at = "2026-04-26T11:35:46Z"
generator = "ipbr-rank 0.1.0"
methodology = "v1"

[[models]]
canonical_id = "anthropic/claude-opus-4.7"
display_name = "Claude Opus 4.7"
vendor = "anthropic"
thinking_effort = "default"
aliases = ["opus 4.7", "claude-opus-4-7"]
sources = ["openrouter", "lmarena", "artificial_analysis"]

[models.scores]
i_raw = 78.4
p_raw = 81.1
b_raw = 79.6
r = 84.0
# i_adj/p_adj/b_adj retained as raw aliases for API back-compat.
i_adj = 78.4
p_adj = 81.1
b_adj = 79.6

[models.groups]
CRE = 80.2
GEN = 79.1
# ...

[models.metrics]
LMArenaText = 82.5
SWEBenchVerified = 76.0
# ...

[models.missing]
metrics = []
groups_shrunk = []

See docs/output-schema.md for the complete TOML schema reference.

CLI Reference

ipbr-rank [OPTIONS] <COMMAND>

Commands:
  fetch            Download all enabled sources into --cache
  score            Read --cache, write scoreboard.toml + missing.toml + coefficients.toml
  render           Read scoreboard.toml, write static site to out/site/
  all              fetch -> score -> render (default)
  verify-sources   Run contract tests against live endpoints
  list-models      Emit canonical IDs + vendor from required_aliases.toml
  triage           List unmatched leaderboard rows from the cache (--min-count N)

Options:
  --out DIR                     Output directory [default: out]
  --coefficients PATH           Override embedded coefficients.toml
  --aliases PATH                Override embedded required_aliases.toml
  --cache DIR                   Cache directory for fetched responses
  --offline                     Fail if any source is not in --cache
  --only SOURCE,SOURCE          Fetch only specific sources
  --aa-api-key-file PATH        File containing AA_API_KEY
  --openrouter-api-key-file PATH
  --hf-token-file PATH
  --now ISO8601                 Override generated_at timestamp (for tests)

Cache & TTL

--cache DIR activates a persistent on-disk cache. Each source declares its own freshness window (Source::cache_ttl); when the cache file's mtime is within that window, the live fetch is skipped. --offline always reads from cache regardless of mtime.

source TTL rationale
openrouter, lmarena, artificial_analysis 24h daily refresh
livecodebench, gso 2d weekly-ish leaderboard refreshes
swebench, swebench_pro, swerebench, terminal_bench, mcp_atlas, arc_agi, sonar 7d infrequent updates

To force a refresh of one source, delete its cache file (or touch -t it to the past) and rerun.

The HTTP layer also retries on 429 Too Many Requests and 5xx with exponential backoff (500 ms → 60 s, up to 6 attempts), honoring Retry-After when present — the HuggingFace datasets-server in particular rate-limits aggressively while paginating LMArena, so set HF_TOKEN when available.

Offline Mode (for CI/tests)

# Deterministic golden test against fixtures
ipbr-rank all \
  --offline \
  --cache data/fixtures \
  --out tests/golden/out \
  --now 2026-01-01T00:00:00Z

Overriding Coefficients

# Copy embedded coefficients
cp data/coefficients.toml my_coefficients.toml

# Edit weights, then:
ipbr-rank all --coefficients my_coefficients.toml

# The effective coefficients are echoed to out/coefficients.toml

Adding a Source

See docs/adding-a-source.md for the verification protocol and implementation checklist.

Architecture

ipbr-rank/
├── crates/
│   ├── core/          # Pure math: data model, normalization, scoring
│   ├── sources/       # Per-source fetchers behind trait Source
│   ├── render/        # TOML + static HTML emission
│   └── cli/           # Binary orchestration
├── data/
│   ├── coefficients.toml       # All weights and metric definitions
│   ├── required_aliases.toml   # Canonical ID → vendor + alias list
│   └── fixtures/               # Snapshotted responses for offline tests
├── docs/
│   ├── methodology.md          # Full math explanation
│   ├── sources.md              # One section per source
│   ├── adding-a-source.md      # Verification protocol
│   └── output-schema.md        # TOML schema reference
└── scripts/
    └── refresh.sh              # One-shot fetch → score → render (+ optional --publish)

The static site (theme, scripts, HTML) is generated entirely from Rust in crates/render/src/site/ — there are no external template files.

Pre-commit

A .pre-commit-config.yaml runs cargo fmt --check, cargo clippy -D warnings, scripts/check-docs.sh, and basic repo-hygiene hooks.

pip install pre-commit  # one-time
pre-commit install      # installs the git hook
pre-commit run --all-files  # one-shot full sweep

Testing

# All unit + contract + golden tests
cargo test --workspace

# Live smoke (best-effort, network-dependent)
cargo test --workspace --features live

# Update golden files (review diff before committing)
UPDATE_GOLDEN=1 cargo test

License

Released under the MIT License — see LICENSE.

About

live llm coding scores: models drift. agents battle. math decides.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages