WorldBench catches when a robot world model looks right but is actually wrong: it checks whether generated futures follow robot actions, contact physics, temporal consistency, and object permanence.
git clone https://github.com/tigee1311/worldbench.git
cd worldbench
python3 --version
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --help
worldbench demo
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench benchmark --demo
worldbench dashboard .worldbench/runs/latest/result.jsonWorldBench Report
Overall Score: 42/100
Action Consistency: 31/100
Contact Realism: 20/100
Object Permanence: 55/100
Main failure:
The model generates plausible frames but ignores the robot action sequence.
Not another world model. The test suite for world models.
Features • Quickstart • CLI • Python SDK • Metrics • Roadmap
git clone https://github.com/tigee1311/worldbench.git
cd worldbench
python3 --version
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --help
worldbench demo
worldbench validate examples/demo_dataset
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/good_model
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench benchmark --demo
worldbench report .worldbench/runs/latest/result.json
worldbench dashboard .worldbench/runs/latest/result.jsonworldbench eval writes timestamped runs under .worldbench/runs/ and also updates .worldbench/runs/latest/result.json for quick iteration.
WorldBench requires Python 3.10+. If your system python3 is Python 3.9 or lower, install Python 3.11 and create the virtual environment with python3.11 -m venv .venv.
Use python -m pip instead of pip so the package installs into the active virtual environment. This avoids pip: command not found and prevents installing into the wrong Python.
WorldBench is a Python SDK, CLI, and local dashboard for robotics AI teams building or evaluating world models. It takes a robot rollout dataset plus predicted future frames and produces:
- Control-aware metric scores
- Per-episode failure evidence
- Benchmark-style model comparisons
- Synthetic benchmark scenario results
- Markdown reports
- A zero-dependency local HTML dashboard
- An experimental LeRobot-style local folder import
- A synthetic demo that works without robots, GPUs, or model training
Input:
- robot rollout frames
- action logs
- state data
- predicted future frames
Output:
- control-aware scores
- failure evidence
- Markdown reports
- local dashboard
- model comparison results
Robotics world models can make futures that look realistic while still being wrong for control. A prediction is not useful if it moves opposite the commanded action, teleports a cube before contact, drops a task object, or flickers across the rollout.
WorldBench focuses on the failure modes that matter when a robot planner consumes generated futures. It is for world-model builders, robotics ML researchers, and evaluation engineers who need more than a pretty-video metric before trusting predictions in planning loops.
Traditional video metrics can say a prediction is good even when it is useless for robotics.
A world model can score high visually while:
- moving the robot opposite the commanded action
- teleporting objects before contact
- dropping task-relevant objects
- flickering across frames
- breaking state/action alignment
WorldBench adds control-aware metrics for robotics world models.
WorldBench is currently installed from a source checkout:
git clone https://github.com/tigee1311/worldbench.git
cd worldbench
python3 --version
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --helpFuture PyPI releases may support:
python -m pip install worldbenchWorldBench is not assumed to be published on PyPI yet. If the worldbench package name is unavailable on PyPI, the package may ship as worldbench-ai.
For tests and local development:
python -m pip install -e ".[dev]"
python -m pytestscikit-image is optional for SSIM:
python -m pip install -e ".[vision]"If scikit-image is not installed, WorldBench uses a lightweight NumPy fallback.
If python3 --version shows Python 3.9 or older, install Python 3.11:
brew install python@3.11Then recreate the virtual environment:
rm -rf .venv
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --helpIf Homebrew is unavailable, install Python 3.11 from python.org, then create the virtual environment with that Python.
Run these from the repository root after activating .venv:
python --version
which python
python -m pip --version
worldbench --helpIf Python is below 3.10, recreate the virtual environment with Python 3.11.
zsh: command not found: pip
Use:
python3 -m pip --version
python3 -m pip install --upgrade pipInside the virtual environment, use:
python -m pip install -e ".[dev,video]"zsh: command not found: worldbench
This means WorldBench was not installed into your active environment. Run:
source .venv/bin/activate
python -m pip install -e ".[dev,video]"
worldbench --helprequires a different Python: 3.9.6 not in >=3.10
Install Python 3.11, then recreate the virtual environment:
brew install python@3.11
rm -rf .venv
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,video]"does not appear to be a Python project: neither setup.py nor pyproject.toml found
You are in the wrong folder. Go to the repo root, where pyproject.toml exists:
cd ~/worldbench
lsYou should see:
README.md
pyproject.toml
worldbench/
examples/
scripts/
Then install again:
python -m pip install -e ".[dev,video]"Use these commands after installation:
worldbench demo
worldbench validate examples/demo_dataset
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/good_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench report .worldbench/runs/latest/result.json
worldbench dashboard .worldbench/runs/latest/result.jsonWhat each command does:
worldbench democreates a synthetic rollout with good and bad predictions.worldbench validate examples/demo_datasetchecks that frames, actions, states, and metadata exist.worldbench eval ... bad_modelscores the bad prediction.worldbench eval ... good_modelscores the good prediction.worldbench compare ...shows why the good model is more reliable.worldbench report ...writes a Markdown report for the latest run.worldbench dashboard ...opens a local debugging view.
worldbench init <path>
worldbench demo
worldbench validate <dataset_path>
worldbench eval <dataset_path> --predictions <predictions_path>
worldbench compare <dataset_path> --models good_model bad_model
worldbench compare <run_a/result.json> <run_b/result.json>
worldbench benchmark --demo
worldbench benchmark benchmarks/
worldbench import-lerobot <input_path> --out <output_path>
worldbench import-lerobot --demo --out examples/lerobot_push_cube
worldbench report <result_json>
worldbench dashboard <result_json_or_dataset_path>Example:
worldbench demo
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench benchmark --demo
worldbench report .worldbench/runs/latest/result.json
worldbench dashboard .worldbench/runs/latest/result.jsonworldbench compare examples/demo_dataset --models good_model bad_model evaluates both model folders, prints the largest metric gaps, and writes .worldbench/comparisons/latest/comparison.json plus .worldbench/comparisons/latest/comparison.md.
from worldbench import WorldBench, WorldModelRun
bench = WorldBench(dataset="examples/demo_dataset")
result = bench.evaluate(predictions="examples/demo_dataset/good_model")
result.print_summary()
result.save_report("report.md")Convenience API:
from worldbench import evaluate, load_dataset
dataset = load_dataset("examples/demo_dataset")
result = evaluate(dataset)
print(result.score)Composable metrics:
from worldbench import Metrics, WorldBench
bench = WorldBench("examples/demo_dataset")
result = bench.run(
metrics=[
Metrics.visual_similarity(),
Metrics.action_consistency(),
Metrics.temporal_stability(),
],
predictions="examples/demo_dataset/good_model",
)dataset/
episode_001/
frames/
000.png
001.png
002.png
predictions/
000.png
001.png
002.png
actions.json
states.json
metadata.json
actions.json:
[
{"t": 0, "action": "move_right", "dx": 1.0, "dy": 0.0, "gripper": "open"},
{"t": 1, "action": "move_right", "dx": 1.0, "dy": 0.0, "gripper": "open"},
{"t": 2, "action": "close_gripper", "dx": 0.0, "dy": 0.0, "gripper": "closed"}
]states.json:
[
{"t": 0, "robot_x": 20, "robot_y": 50, "object_x": 80, "object_y": 50},
{"t": 1, "robot_x": 30, "robot_y": 50, "object_x": 80, "object_y": 50},
{"t": 2, "robot_x": 40, "robot_y": 50, "object_x": 80, "object_y": 50}
]metadata.json:
{
"name": "push_cube_demo",
"robot": "synthetic_2d_arm",
"task": "push cube",
"fps": 5,
"description": "Synthetic robot rollout for world-model evaluation"
}Prediction folders can be dataset-native:
episode_001/predictions/000.png
or model-run style:
predictions/episode_001/000.png
WorldBench includes an experimental LeRobot-style local folder converter. This is not official LeRobot support; it is a simple bridge for folders shaped like images/, actions.json, states.json, and metadata.json.
worldbench import-lerobot --demo --out examples/lerobot_push_cube
worldbench validate examples/lerobot_push_cubeInput:
input_path/
images/
000.png
001.png
002.png
actions.json
states.json
metadata.json
Output:
output_path/
episode_001/
frames/
000.png
001.png
002.png
actions.json
states.json
metadata.json
WorldBench includes a lightweight synthetic benchmark suite for common robotics world-model failure modes:
- action mismatch
- pre-contact object motion
- object disappearance
- temporal flicker
- push-cube interaction dynamics
worldbench benchmark --demoworldbench benchmark --demo writes .worldbench/benchmarks/latest/benchmark.json and .worldbench/benchmarks/latest/benchmark.md.
| Metric | Weight | What it checks |
|---|---|---|
| Visual similarity | 25% | MSE, PSNR, and SSIM-style structure against ground-truth frames. |
| Action consistency | 30% | Whether visual robot motion follows action logs such as move_right or move_left. |
| Temporal stability | 20% | Flicker, sudden jumps, and unstable frame-to-frame deltas. |
| Object permanence | 15% | Whether the main task object remains visible and stable. |
| Contact realism | 10% | Whether object motion starts before plausible robot/object contact. |
The default overall score is a weighted average across these metrics.
| Model | Overall | Action consistency | Contact realism | Object permanence |
|---|---|---|---|---|
| good_model | 88 | 91 | 84 | 95 |
| bad_model | 42 | 31 | 20 | 55 |
This toy benchmark is generated by worldbench demo, but it shows the type of failure WorldBench is designed to catch: realistic-looking predictions that do not follow robot actions or contact physics.
Sample reports:
| Feature | Status |
|---|---|
| Synthetic demo dataset | Supported |
| Good vs bad model comparison | Supported |
| CLI evaluation | Supported |
| Markdown reports | Supported |
| Local dashboard | Supported |
| Action consistency scoring | Supported |
| Object permanence scoring | Supported |
| Contact realism scoring | Supported |
| Model comparison command | Supported |
| Experimental LeRobot-style import | Experimental |
| ROS bag import | Planned |
| ManiSkill/RLBench adapters | Planned |
| Real robot rollout support | Planned |
| Cloud run sharing | Planned |
| Benchmark leaderboard | Planned |
WorldBench starts with synthetic rollouts so failure modes are easy to see. The next steps are experimental LeRobot-style import improvements, ROS bag import, ManiSkill/RLBench adapters, real robot rollout examples, and benchmark leaderboards.
Release materials live in:
The publishing notes include TestPyPI and PyPI commands for maintainers. WorldBench does not require cloud services to run locally.
WorldBench is currently an open-source robotics world-model evaluation toolkit in this repository. The name may overlap with research benchmarks using the same name, and the project may be renamed later if needed.
WorldBench is intentionally small and easy to inspect. Useful contributions include:
- New control-aware metrics
- Dataset import adapters
- Better synthetic rollout scenarios
- Dashboard/report polish
- Tests for metric edge cases
Before opening a PR:
python -m pip install -e ".[dev]"
python -m pytestApache-2.0. See LICENSE.


