Skip to content

tigee1311/worldbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WorldBench

Tests Python License

Evaluate robotics world models with one command.

WorldBench catches when a robot world model looks right but is actually wrong: it checks whether generated futures follow robot actions, contact physics, temporal consistency, and object permanence.

WorldBench demo showing robot world-model evaluation

git clone https://github.com/tigee1311/worldbench.git
cd worldbench
python3 --version
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --help
worldbench demo
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench benchmark --demo
worldbench dashboard .worldbench/runs/latest/result.json
WorldBench Report
Overall Score: 42/100
Action Consistency: 31/100
Contact Realism: 20/100
Object Permanence: 55/100

Main failure:
The model generates plausible frames but ignores the robot action sequence.

Not another world model. The test suite for world models.

Features • Quickstart • CLI • Python SDK • Metrics • Roadmap

Quickstart

git clone https://github.com/tigee1311/worldbench.git
cd worldbench

python3 --version
python3 -m venv .venv
source .venv/bin/activate

python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"

worldbench --help

worldbench demo
worldbench validate examples/demo_dataset
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/good_model
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench benchmark --demo
worldbench report .worldbench/runs/latest/result.json
worldbench dashboard .worldbench/runs/latest/result.json

worldbench eval writes timestamped runs under .worldbench/runs/ and also updates .worldbench/runs/latest/result.json for quick iteration.

WorldBench requires Python 3.10+. If your system python3 is Python 3.9 or lower, install Python 3.11 and create the virtual environment with python3.11 -m venv .venv.

Use python -m pip instead of pip so the package installs into the active virtual environment. This avoids pip: command not found and prevents installing into the wrong Python.

What It Does

WorldBench is a Python SDK, CLI, and local dashboard for robotics AI teams building or evaluating world models. It takes a robot rollout dataset plus predicted future frames and produces:

  • Control-aware metric scores
  • Per-episode failure evidence
  • Benchmark-style model comparisons
  • Synthetic benchmark scenario results
  • Markdown reports
  • A zero-dependency local HTML dashboard
  • An experimental LeRobot-style local folder import
  • A synthetic demo that works without robots, GPUs, or model training

Input and Output

Input:

  • robot rollout frames
  • action logs
  • state data
  • predicted future frames

Output:

  • control-aware scores
  • failure evidence
  • Markdown reports
  • local dashboard
  • model comparison results

Why WorldBench?

Robotics world models can make futures that look realistic while still being wrong for control. A prediction is not useful if it moves opposite the commanded action, teleports a cube before contact, drops a task object, or flickers across the rollout.

WorldBench focuses on the failure modes that matter when a robot planner consumes generated futures. It is for world-model builders, robotics ML researchers, and evaluation engineers who need more than a pretty-video metric before trusting predictions in planning loops.

Why Not Just SSIM/PSNR?

Traditional video metrics can say a prediction is good even when it is useless for robotics.

A world model can score high visually while:

  • moving the robot opposite the commanded action
  • teleporting objects before contact
  • dropping task-relevant objects
  • flickering across frames
  • breaking state/action alignment

WorldBench adds control-aware metrics for robotics world models.

Installation

WorldBench is currently installed from a source checkout:

git clone https://github.com/tigee1311/worldbench.git
cd worldbench
python3 --version
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --help

Future PyPI releases may support:

python -m pip install worldbench

WorldBench is not assumed to be published on PyPI yet. If the worldbench package name is unavailable on PyPI, the package may ship as worldbench-ai.

For tests and local development:

python -m pip install -e ".[dev]"
python -m pytest

scikit-image is optional for SSIM:

python -m pip install -e ".[vision]"

If scikit-image is not installed, WorldBench uses a lightweight NumPy fallback.

macOS Setup

If python3 --version shows Python 3.9 or older, install Python 3.11:

brew install python@3.11

Then recreate the virtual environment:

rm -rf .venv
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev,video]"
worldbench --help

If Homebrew is unavailable, install Python 3.11 from python.org, then create the virtual environment with that Python.

Environment Check

Run these from the repository root after activating .venv:

python --version
which python
python -m pip --version
worldbench --help

If Python is below 3.10, recreate the virtual environment with Python 3.11.

Troubleshooting

zsh: command not found: pip

Use:

python3 -m pip --version
python3 -m pip install --upgrade pip

Inside the virtual environment, use:

python -m pip install -e ".[dev,video]"

zsh: command not found: worldbench

This means WorldBench was not installed into your active environment. Run:

source .venv/bin/activate
python -m pip install -e ".[dev,video]"
worldbench --help

requires a different Python: 3.9.6 not in >=3.10

Install Python 3.11, then recreate the virtual environment:

brew install python@3.11
rm -rf .venv
python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -e ".[dev,video]"

does not appear to be a Python project: neither setup.py nor pyproject.toml found

You are in the wrong folder. Go to the repo root, where pyproject.toml exists:

cd ~/worldbench
ls

You should see:

README.md
pyproject.toml
worldbench/
examples/
scripts/

Then install again:

python -m pip install -e ".[dev,video]"

Live Demo Flow

Use these commands after installation:

worldbench demo
worldbench validate examples/demo_dataset
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/good_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench report .worldbench/runs/latest/result.json
worldbench dashboard .worldbench/runs/latest/result.json

What each command does:

  • worldbench demo creates a synthetic rollout with good and bad predictions.
  • worldbench validate examples/demo_dataset checks that frames, actions, states, and metadata exist.
  • worldbench eval ... bad_model scores the bad prediction.
  • worldbench eval ... good_model scores the good prediction.
  • worldbench compare ... shows why the good model is more reliable.
  • worldbench report ... writes a Markdown report for the latest run.
  • worldbench dashboard ... opens a local debugging view.

CLI Usage

worldbench init <path>
worldbench demo
worldbench validate <dataset_path>
worldbench eval <dataset_path> --predictions <predictions_path>
worldbench compare <dataset_path> --models good_model bad_model
worldbench compare <run_a/result.json> <run_b/result.json>
worldbench benchmark --demo
worldbench benchmark benchmarks/
worldbench import-lerobot <input_path> --out <output_path>
worldbench import-lerobot --demo --out examples/lerobot_push_cube
worldbench report <result_json>
worldbench dashboard <result_json_or_dataset_path>

Example:

worldbench demo
worldbench eval examples/demo_dataset --predictions examples/demo_dataset/bad_model
worldbench compare examples/demo_dataset --models good_model bad_model
worldbench benchmark --demo
worldbench report .worldbench/runs/latest/result.json
worldbench dashboard .worldbench/runs/latest/result.json

worldbench compare examples/demo_dataset --models good_model bad_model evaluates both model folders, prints the largest metric gaps, and writes .worldbench/comparisons/latest/comparison.json plus .worldbench/comparisons/latest/comparison.md.

Python SDK Usage

from worldbench import WorldBench, WorldModelRun

bench = WorldBench(dataset="examples/demo_dataset")
result = bench.evaluate(predictions="examples/demo_dataset/good_model")
result.print_summary()
result.save_report("report.md")

Convenience API:

from worldbench import evaluate, load_dataset

dataset = load_dataset("examples/demo_dataset")
result = evaluate(dataset)
print(result.score)

Composable metrics:

from worldbench import Metrics, WorldBench

bench = WorldBench("examples/demo_dataset")
result = bench.run(
    metrics=[
        Metrics.visual_similarity(),
        Metrics.action_consistency(),
        Metrics.temporal_stability(),
    ],
    predictions="examples/demo_dataset/good_model",
)

Dataset Format

dataset/
  episode_001/
    frames/
      000.png
      001.png
      002.png
    predictions/
      000.png
      001.png
      002.png
    actions.json
    states.json
    metadata.json

actions.json:

[
  {"t": 0, "action": "move_right", "dx": 1.0, "dy": 0.0, "gripper": "open"},
  {"t": 1, "action": "move_right", "dx": 1.0, "dy": 0.0, "gripper": "open"},
  {"t": 2, "action": "close_gripper", "dx": 0.0, "dy": 0.0, "gripper": "closed"}
]

states.json:

[
  {"t": 0, "robot_x": 20, "robot_y": 50, "object_x": 80, "object_y": 50},
  {"t": 1, "robot_x": 30, "robot_y": 50, "object_x": 80, "object_y": 50},
  {"t": 2, "robot_x": 40, "robot_y": 50, "object_x": 80, "object_y": 50}
]

metadata.json:

{
  "name": "push_cube_demo",
  "robot": "synthetic_2d_arm",
  "task": "push cube",
  "fps": 5,
  "description": "Synthetic robot rollout for world-model evaluation"
}

Prediction folders can be dataset-native:

episode_001/predictions/000.png

or model-run style:

predictions/episode_001/000.png

Experimental Adapters

LeRobot-Style Import

WorldBench includes an experimental LeRobot-style local folder converter. This is not official LeRobot support; it is a simple bridge for folders shaped like images/, actions.json, states.json, and metadata.json.

worldbench import-lerobot --demo --out examples/lerobot_push_cube
worldbench validate examples/lerobot_push_cube

Input:

input_path/
  images/
    000.png
    001.png
    002.png
  actions.json
  states.json
  metadata.json

Output:

output_path/
  episode_001/
    frames/
      000.png
      001.png
      002.png
    actions.json
    states.json
    metadata.json

Benchmarks

WorldBench includes a lightweight synthetic benchmark suite for common robotics world-model failure modes:

  • action mismatch
  • pre-contact object motion
  • object disappearance
  • temporal flicker
  • push-cube interaction dynamics
worldbench benchmark --demo

worldbench benchmark --demo writes .worldbench/benchmarks/latest/benchmark.json and .worldbench/benchmarks/latest/benchmark.md.

Metrics

Metric Weight What it checks
Visual similarity 25% MSE, PSNR, and SSIM-style structure against ground-truth frames.
Action consistency 30% Whether visual robot motion follows action logs such as move_right or move_left.
Temporal stability 20% Flicker, sudden jumps, and unstable frame-to-frame deltas.
Object permanence 15% Whether the main task object remains visible and stable.
Contact realism 10% Whether object motion starts before plausible robot/object contact.

The default overall score is a weighted average across these metrics.

Example Outputs

Example Benchmark

Model Overall Action consistency Contact realism Object permanence
good_model 88 91 84 95
bad_model 42 31 20 55

This toy benchmark is generated by worldbench demo, but it shows the type of failure WorldBench is designed to catch: realistic-looking predictions that do not follow robot actions or contact physics.

Sample reports:

Screenshots

WorldBench dashboard screenshot

WorldBench report screenshot

Supported Now Vs Roadmap

Feature Status
Synthetic demo dataset Supported
Good vs bad model comparison Supported
CLI evaluation Supported
Markdown reports Supported
Local dashboard Supported
Action consistency scoring Supported
Object permanence scoring Supported
Contact realism scoring Supported
Model comparison command Supported
Experimental LeRobot-style import Experimental
ROS bag import Planned
ManiSkill/RLBench adapters Planned
Real robot rollout support Planned
Cloud run sharing Planned
Benchmark leaderboard Planned

Scaling Path

WorldBench starts with synthetic rollouts so failure modes are easy to see. The next steps are experimental LeRobot-style import improvements, ROS bag import, ManiSkill/RLBench adapters, real robot rollout examples, and benchmark leaderboards.

Release and Publishing

Release materials live in:

The publishing notes include TestPyPI and PyPI commands for maintainers. WorldBench does not require cloud services to run locally.

Name Note

WorldBench is currently an open-source robotics world-model evaluation toolkit in this repository. The name may overlap with research benchmarks using the same name, and the project may be renamed later if needed.

Contributing

WorldBench is intentionally small and easy to inspect. Useful contributions include:

  • New control-aware metrics
  • Dataset import adapters
  • Better synthetic rollout scenarios
  • Dashboard/report polish
  • Tests for metric edge cases

Before opening a PR:

python -m pip install -e ".[dev]"
python -m pytest

License

Apache-2.0. See LICENSE.

About

Control-aware evaluation for robotics world models

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages