The public scoreboard for loop engineering.
Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.
No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.
pip install loopbench loopgym
loopbench listRun your first score · Leaderboard · Suite overview
You submit a loop specification (LSS YAML). LoopBench:
- Runs it through LoopGym on fixed task instances
- Computes Success@k and LES_obs across eight categories
- Validates your
results.jsonagainst a published schema - Ranks you on the public leaderboard
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.jsonflowchart LR
YOU["Your LSS spec"]
LB["LoopBench<br/>tasks · scoring · conformance"]
LG["LoopGym<br/>SimEnv execution"]
OUT["results.json → leaderboard"]
YOU --> LB
LB --> LG
LG --> LB
LB --> OUT
| Layer | Owns | Repo |
|---|---|---|
| Spec | LSS schema, LES formulas | Loop Core Engineering |
| Data | Trajectories (holdout v0.2) | LoopNet |
| Runtime | env.run_episode() |
LoopGym |
| Observability | LTF traces, iteration metrics | loop-observability |
| Measurement | Tasks, LES_obs, anti-gaming | LoopBench |
LoopBench defines and scores. LoopGym runs. Never the other way around.
New to the stack? Start with the LoopNet end-to-end tutorial.
| ID | Name | What it exposes |
|---|---|---|
LB-CR-1 |
Code repair | Can your loop fix broken code under verify pressure? |
LB-RS-1 |
Research synthesis | Quality vs. cost on structured briefs |
LB-MA-1 |
Multi-agent debate | Autonomy + coordination under evaluator scrutiny |
Five seeds per task. Details in tasks/.
pip install loopbench loopgym
loopbench list
loopbench run \
--task LB-CR-1 \
--spec submissions/examples/spec-fast-loop.yaml \
--seeds 0,1,2,3,4 \
-o results.json
loopbench validate results.jsonSubmit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.
v0.1 accepts SimEnv submissions only (fully reproducible, no API keys). LiveEnv tier: v0.2.
| Metric | Meaning |
|---|---|
| Success@k | Fraction of instances reaching goal threshold |
| LES_obs | Observed composite ∈ [0, 1] — eight categories |
| Cost | Estimated USD from LSS cost limits |
| Robustness | Quality retention across seeds |
Display scale 0–100 is optional (les × 100).
| You are… | LoopBench gives you… |
|---|---|
| Loop designer | A number you can improve release-over-release |
| Framework author | A neutral arena — not your own benchmark |
| Researcher | Reproducible tasks + published submission schema |
| Team lead | Comparable scores across designs and vendors |
@software{loopbench2026,
title={LoopBench: Benchmark Suite for Loop Engineering},
author={Malpani, Kanak},
year={2026},
url={https://pypi.org/project/loopbench/}
}MIT · v0.1 · Contributing · Security · Status