Skip to content

KanakMalpani/LoopBench

Repository files navigation

LoopBench

The public scoreboard for loop engineering.

Fixed tasks. Fixed seeds. Observed LES. Submissions anyone can audit.

No hand-waved demos — bring an LSS spec, get a number, climb the leaderboard.


CI PyPI License: MIT Tasks Suite


pip install loopbench loopgym
loopbench list

Run your first score · Leaderboard · Suite overview


LoopBench: install, list tasks, run, validate, rank

What LoopBench measures

You submit a loop specification (LSS YAML). LoopBench:

  1. Runs it through LoopGym on fixed task instances
  2. Computes Success@k and LES_obs across eight categories
  3. Validates your results.json against a published schema
  4. Ranks you on the public leaderboard
loopbench run --task LB-CR-1 --spec your-loop.yaml --seeds 0,1,2,3,4 -o results.json
loopbench validate results.json
loopbench rank leaderboard/entries.json

The measurement stack

flowchart LR
  YOU["Your LSS spec"]
  LB["LoopBench<br/>tasks · scoring · conformance"]
  LG["LoopGym<br/>SimEnv execution"]
  OUT["results.json → leaderboard"]

  YOU --> LB
  LB --> LG
  LG --> LB
  LB --> OUT
Loading
Layer Owns Repo
Spec LSS schema, LES formulas Loop Core Engineering
Data Trajectories (holdout v0.2) LoopNet
Runtime env.run_episode() LoopGym
Observability LTF traces, iteration metrics loop-observability
Measurement Tasks, LES_obs, anti-gaming LoopBench

LoopBench defines and scores. LoopGym runs. Never the other way around.

New to the stack? Start with the LoopNet end-to-end tutorial.


Tasks (v0.1)

ID Name What it exposes
LB-CR-1 Code repair Can your loop fix broken code under verify pressure?
LB-RS-1 Research synthesis Quality vs. cost on structured briefs
LB-MA-1 Multi-agent debate Autonomy + coordination under evaluator scrutiny

Five seeds per task. Details in tasks/.


Score in 2 minutes

pip install loopbench loopgym

loopbench list

loopbench run \
  --task LB-CR-1 \
  --spec submissions/examples/spec-fast-loop.yaml \
  --seeds 0,1,2,3,4 \
  -o results.json

loopbench validate results.json

Submit to the leaderboard: open a PR adding your entry to leaderboard/entries.json.

v0.1 accepts SimEnv submissions only (fully reproducible, no API keys). LiveEnv tier: v0.2.


Metrics explained

Metric Meaning
Success@k Fraction of instances reaching goal threshold
LES_obs Observed composite ∈ [0, 1]eight categories
Cost Estimated USD from LSS cost limits
Robustness Quality retention across seeds

Display scale 0–100 is optional (les × 100).


Who this is for

You are… LoopBench gives you…
Loop designer A number you can improve release-over-release
Framework author A neutral arena — not your own benchmark
Researcher Reproducible tasks + published submission schema
Team lead Comparable scores across designs and vendors

Citation

@software{loopbench2026,
  title={LoopBench: Benchmark Suite for Loop Engineering},
  author={Malpani, Kanak},
  year={2026},
  url={https://pypi.org/project/loopbench/}
}

MIT · v0.1 · Contributing · Security · Status

About

MLPerf-style benchmark suite for Loop Engineering

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors