DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou* , Hao Shao* , Letian Wang , Zhuofan Zong , Hongsheng Li , Steven L. Waslander ("*" denotes equal contribution)

Updates

[03/2026] Evaluation code released.
[01/2026] DrivingGen is aceepted to ICLR 2026.
[01/2026] We release our paper on arXiv and our dataset on Hugging Face.

Overview

DrivingGen is the first comprehensive benchmark for generative driving world models. It combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources — spanning varied weather, time of day, geographic regions, and complex maneuvers. DrivingGen evaluates models from both a visual perspective (the realism and overall quality of generated videos) and a robotics perspective (the physical plausibility, consistency, and accuracy of generated trajectories).

Code Flow

The DrivingGen evaluation pipeline proceeds in six sequential stages. The diagram below shows the data flow through the codebase.

Dataset Download
      │
      ▼
Video Generation (your model)
      │
      ├──► Ego Trajectory Extraction ──────────────────────────────┐
      │                                                             │
      └──► Agent Trajectory Extraction ─────────────────────────── ▼
                                                           Trajectory Metrics
                                                           (alignment, consistency,
                                                            quality, distribution)
      │
      ▼
Video Metric Evaluation
(FVD, P2020, CLIP-IQA+, DINOv3, Cosmos-Reason1)

Stage 1 — Dataset Download

Script: drivinggen/down_dataset.py

Downloads the DrivingGen dataset from Hugging Face (yangzhou99/DrivingGen) into ./data/. The dataset contains:

ego_condition.json / open_domain.json — lists of scene IDs
data/{scene_id}/imgs/{scene_id}.jpg — first-frame conditioning images
data/{scene_id}/caption/{scene_id}.txt — text prompts for each scene
data/videos-fvd/ — ground-truth 100-frame clips for FVD computation
Pre-computed ego and agent ground-truth trajectories in .pkl files

Stage 2 — Video Generation

Script: drivinggen/infer_example_wan.py (example using Wan2.2-I2V-A14B)

JSON metadata (scene IDs)
        │
        ▼
Load conditioning image (first frame)  →  Resize to fit 576×1024 area
        │
        ▼
Load text caption from .txt file
        │
        ▼
Wan2.2 I2V Pipeline (HuggingFace Diffusers)
  - 101 frames, 576×1024 resolution
  - 40 DDIM inference steps, CFG scale 3.5
  - Fixed seed (generator=0) for reproducibility
        │
        ▼
Output: video.mp4  +  images/00000.png … 00100.png
Saved to: {out_dir}/{split}/{scene_id}/{model}/{exp_id}/

The pipeline patches torch.linalg.solve with a CPU fallback to handle cuSolver instability that can occur during diffusion sampling with bfloat16.

Stage 3 — Ego Trajectory Extraction

Script: drivinggen/func/extract_traj_ego_unidepth.py

Estimates the ego-vehicle (camera) trajectory from the generated video frames using monocular depth + visual SLAM.

Generated video frames (PNG sequence)
        │
        ▼
YOLOv10x detection  ──►  Movable object mask (H×W)
        │                 (masks out cars/people so SLAM ignores dynamic regions)
        ▼
UniDepthV2-ViT-L14 inference per frame
  - Input:  RGB frame  (3, H, W)
  - Output: depth map  (1, H, W)   [metres]
            intrinsics (3, 3)
            point cloud (3, H, W)
        │
        ▼
Visual SLAM  [drivinggen/func/visual_slam/]
  dataset.py:
    DatasetHandler — wraps lists of (RGB, depth, intrinsics)
                     converts RGB→grayscale for feature detection
  vo.py:
    1. SIFT feature extraction per frame  (nfeatures=2500)
    2. FLANN kNN matching between consecutive frames (k=2 neighbours)
    3. Lowe's ratio test filtering  (threshold=0.7)
    4. PnP-RANSAC pose estimation:
         - Backproject 2D keypoints using depth: p_c = K⁻¹ · [u,v,1] · Z
         - solvePnPRansac(objectpoints, imagepoints, K) → rvec, tvec
         - Refine with solvePnP on inliers
    5. Accumulate: T_world = T_world · T_local⁻¹  (4×4 homogeneous)
    6. Fallback for failed frames: random yaw perturbation, same step length
        │
        ▼
Output saved to:  {outdir}/{scene}/{model}/unidepth-estimate_ego_traj.pkl
  - locs:       (N, 2)  XZ ground-plane coordinates [metres]
  - poses_3x3:  list of N × (R: 3×3, t: 3,) world-from-camera poses

Task parallelism: set_task_list() reads RANK / WORLD_SIZE env vars to shard scenes across multiple GPUs/nodes.

Stage 4 — Agent Trajectory Extraction

Script: drivinggen/func/extract_traj_agent_unidepth.py

Extracts trajectories of other traffic agents (vehicles, cyclists, pedestrians) visible in the generated video.

Generated video frames (PNG sequence)
        │
        ▼
Frame 0 only:
  YOLOv10x detection  →  agent bounding boxes (x1,y1,x2,y2) + class labels
  Confidence threshold: ≥ 0.3 for vehicles/persons
        │
        ▼
UniDepthV2-ViT-L14 on every frame
  (same as Stage 3 — depth + intrinsics per frame)
        │
        ▼
SAMURAI tracker  [third_parties/samurai/]
  Input:  video frames + initial bounding box from frame 0
  Output: tracked (frame_id, bbox) sequence across all 101 frames
  (One SAMURAI run per detected agent)
        │
        ▼
Per-frame depth estimation for each tracked agent:
  estimate_depth_from_mask(depth_map, segmentation_mask)
    - Uses top-50% of the bounding box region (closer to camera = more reliable)
    - Returns median depth [metres]
        │
        ▼
3D reconstruction using ego camera poses:
  cam_coord  = K⁻¹ · [u, v, 1] · Z
  world_coord = R · cam_coord + t
  → Keep only (X, Z) for top-down 2D trajectory
        │
        ▼
Output saved to:
  *-estimate_agents_traj.pkl   — list of (frame_ids, [(x,z)…], class_label)
  *-estimate_agents_bbox.pkl   — tracked bounding boxes per agent
  *-estimate_agents_bbox_label.pkl — class labels per initial detection

Stage 5 — Video Metric Evaluation

Top-level script: drivinggen/z-sample_fvd.py

This orchestration script loads the generated video frames and dispatches to individual metric modules, selected via --metric {fvd|obj_q|sub_q|v_consist|a_consist|a_missing|all}.

5.1 Distribution — FVD

Module: drivinggen/videos/video_distribution.py

Generated videos (100 frames each, symlinked)     GT videos (100 frames each)
        │                                                   │
        └───────────────────────────────────────────────────┘
                              │
                              ▼
              StyleGAN-V  calc_metrics_('fvd2048_100f')
              (I3D network: (B,T,C,H,W) → feature vectors)
                              │
                              ▼
              Fréchet Video Distance (FVD) — scalar ↓ is better

5.2 Objective Quality — P2020

Module: drivinggen/videos/video_obj_q.py → drivinggen/videos/p2020_v2.py

Video frames (list of image paths)
        │
        ├─ Per-frame metrics (p2020_v2.single_frame_metrics):
        │    • MTF50/MTF10 — spatial frequency response (sharpness)
        │    • Contrast Transfer Accuracy
        │    • Edge Rise Time
        │    • Total Distortion — radial distortion proxy
        │    • Flare Attenuation — center/periphery brightness ratio
        │    • Gradient Entropy — texture richness
        │    • Blur Extent — high/low frequency energy ratio
        │    • Chroma Aberration — R-B channel edge deviation
        │
        └─ Per-video metrics (p2020_v2.video_metrics):
             • MMP (Modulation Mitigation Probability) — sliding-window FFT
               measures temporal flicker/alias energy ratio ↑ is better

Parallel processing via ProcessPoolExecutor.
Final score: mean of mmp_alias across all videos.

5.3 Subjective Quality — CLIP-IQA+

Module: drivinggen/videos/video_sub_q.py

Video frames (image paths)
        │
        ▼
Resize each frame to 512×512 → ToTensor → stack batch
        │
        ▼
CLIP-IQA+ model (pyiqa)
  — no-reference quality metric trained on human opinion scores (MOS)
  — CLIP visual encoder + lightweight IQA regression head
        │
        ▼
Mean score across all frames and videos  ↑ is better (0–1 range)

5.4 Scene (Visual) Consistency — DINOv3

Module: drivinggen/videos/video_v_consist.py

Video frames (image paths)
        │
        ▼
SEA-RAFT optical flow  (RAFT-based, pretrained on TartanAir+KITTI)
  Input:  consecutive frame pair  (1, 3, H, W) each
  Output: flow  (1, 2, H, W)  → median magnitude per pair → (T-1,) array
        │
        ▼
Arc-length keyframe selection:
  m_bar = mean optical flow magnitude
  K = lerp(min_k=3, max_k=20, clamp((m_bar - v_low) / (v_high - v_low)))
  Select K keyframes equidistant in cumulative arc-length space
        │
        ▼
DINOv3-ViT-H16+ (pooler_output  →  1024-d embedding per frame)
        │
        ▼
Adjacent cosine similarity:  S = mean(cos(F_t, F_{t+1}))  ↑ is better

5.5 Agent Appearance Consistency — DINOv3

Module: drivinggen/videos/video_a_consist.py

Per-agent tracked bounding boxes  [(frame_id, (x1,y1,x2,y2)), ...]
        │
        ▼
Crop agent region from each frame (min_size=32 px guard)
        │
        ▼
DINOv3-ViT-H16+  →  pooler_output  (1024-d per crop)
Cached to .pkl to avoid redundant forward passes
        │
        ├─ Reference consistency R = mean cosine sim to frame-0 embedding
        └─ Adjacent consistency  A = mean cosine sim between consecutive frames
        │
        ▼
Score = (R + A) / 2  per agent → mean over all agents and scenes  ↑ is better

5.6 Agent Missing Detection — Cosmos-Reason1

Module: drivinggen/videos/video_a_missing.py

Per-agent tracked bounding boxes
        │
        ▼
For each agent that disappears before frame 100:
  Extract 6 annotated frames:
    - 2 frames from agent's first appearance (green-boxed)
    - 2 frames from agent's last appearance (green-boxed)
    - 2 frames immediately after disappearance
        │
        ▼
Cosmos-Reason1-7B  (NVIDIA, vllm inference)
  Prompt: classify disappearance as Natural (occlusion, exit frame)
          vs Unnatural (abrupt/non-physical)
  Output: "Natural" → not missing penalty
          "Unnatural" → counted as unnatural disappearance
        │
        ▼
Missing rate = fraction of agents with unnatural disappearance
Score = 1 - missing_rate  ↑ is better

Stage 6 — Trajectory Metric Evaluation

Uses pre-extracted ego/agent trajectories from Stages 3–4.

6.1 Alignment Metrics — `drivinggen/trajs/traj_alignment.py`

All metrics operate on (N, T, 2) arrays (N trajectories, T timesteps, XZ coords).

Metric	Formula	Meaning
ADE	mean‖pred_t − gt_t‖ over t	Average displacement at every step
FDE	‖pred_T − gt_T‖	Displacement only at final step
Success Rate	FDE < 3.0 m	Fraction of near-perfect predictions
Hausdorff	max(min_j d(p_i,g_j), min_i d(p_i,g_j))	Worst-case coverage
DTW	dynamic programming alignment cost	Shape similarity ignoring temporal shift

6.2 Trajectory Consistency — `drivinggen/trajs/traj_consistency.py`

Trajectory (N, T, 2)
        │
        ▼
velocity v_t = ‖Δxy_t‖ / dt          (N, T-1)
acceleration a_t = Δv_t / dt          (N, T-2)
        │
        ▼
S = 0.5 · [exp(−σ_v / (μ_v + ε)) + exp(−σ_a / (μ_a + ε))]
                                       ↑ is better (0–1)

6.3 Trajectory Quality — `drivinggen/trajs/traj_quality.py`

Trajectory (N, T, 2)
        │
        ├─ Comfort Score:
        │    jerk       = d³x/dt³   (3rd derivative of position)
        │    yaw-rate   = dθ/dt     (heading rate from velocity direction)
        │    S_j = 1 / (1 + jerk_per_meter)    (higher = smoother)
        │    S_comf = geometric_mean(S_j, S_a, S_y)
        │
        ├─ Curvature RMS:
        │    κ(t) = |ẋÿ − ẏẍ| / (ẋ² + ẏ²)^1.5  (Frenet curvature)
        │    S_curv = 1 / (1 + RMS(κ))
        │
        └─ Speed Score:
             S_speed = log(1 + v_mean) / log(1 + v_max)  ∈ (0, 1)
             (higher = closer to typical driving speed)

6.4 Trajectory Distribution — FTD — `drivinggen/trajs/traj_distribution.py`

Predicted/GT trajectories
        │
        ▼
Sliding 11-frame windows (stride=10):
  Each window → MTR coordinate normalization (ego-centric frame)
              → Feature tensor:  (1, 1, 11, 29+T)
                (pos 6d + type 5d + time embed T+1 d + heading 2d + vel 2d + acc 2d)
        │
        ▼
MTR context encoder (agent_polyline_encoder):
  PointNet-style MLP on each 11-frame window
  Output: 256-d feature vector per window
  Mean pooled across windows → 1 feature per trajectory
        │
        ▼
Stack all features:  Fp (N_pred, 256)   Fg (N_gt, 256)
        │
        ▼
Fréchet Distance  =  ‖μ_p − μ_g‖² + Tr(Σ_p + Σ_g − 2√(Σ_p Σ_g))
FTD  ↓ is better

Setup Instructions

1. Clone the Repository

git clone https://github.com/youngzhou1999/DrivingGen.git
cd DrivingGen

2. Environment Setup

conda create -n drivinggen python=3.10
conda activate drivinggen

3. Install Dependencies

We recommend using the provided environment.yml for a full environment setup:

conda env create -f environment.yml
conda activate drivinggen

Alternatively, if you prefer installing into an existing environment:

pip install -r requirements.txt

Then install third-party packages:

cd third_parties/UniDepth && pip install -e . && cd ../..
cd third_parties/yolov10 && pip install -e . && cd ../..
cd third_parties/samurai && pip install -e . && cd ../..

4. Download Dataset

Download the DrivingGen dataset from Hugging Face. First, update your Hugging Face token in drivinggen/down_dataset.py, then run:

bash scripts/0-down_data.sh

The dataset will be downloaded to ./data/.

Video Generation

This section guides you through generating videos for evaluation using your own world generation model. We provide an example using Wan2.2-14B (Image-to-Video).

1. Run Inference

Configure and run video generation:

bash scripts/1-example_infer_model.sh

Key parameters in the script:

video_path=data/ego_condition.json   # Input metadata
out_dir=cache/infer_results          # Output directory
split=ego_condition                  # Data split (ego_condition / open_domain)
model=wan2.2-14b                     # Model name
exp_id=default_prompt                # Experiment ID

The generated videos (101 frames at 10 fps, 576x1024 resolution) will be saved as both MP4 videos and individual PNG frames.

2. Extract Ego Trajectory

Extract ego vehicle trajectory from generated videos using UniDepthV2 and Visual SLAM:

bash scripts/2-get_ego_traj.sh

3. Extract Agent Trajectories

Extract agent trajectories using YOLOv10 detection and depth estimation:

bash scripts/3-get_agent_traj.sh

Evaluation

DrivingGen evaluates generated videos using comprehensive video metrics and trajectory metrics.

Video Metrics

Evaluates visual quality and temporal coherence of generated videos:

Category	Metrics
Distribution	FVD (Frechet Video Distance)
Objective Quality	IEEE P2020 automotive imaging metrics (sharpness, exposure, contrast, color, noise, artifacts, texture, temporal)
Subjective Quality	CLIP-IQA+ based assessment
Scene Consistency	DINOv3 feature-based consistency
Agent Consistency	Agent appearance consistency and missing detection
Perceptual	LPIPS, SSIM

Run video evaluation:

bash scripts/4-get_video_metrics.sh

Trajectory Metrics

Evaluates the physical plausibility and accuracy of generated trajectories:

Category	Metrics
Distribution	FTD (Frechet Trajectory Distance) via Motion Transformer encoder
Alignment	ADE, FDE, Success Rate, Hausdorff Distance, DTW
Quality	Comfort Score (jerk, acceleration, yaw rate), Curvature RMS, Speed Score
Consistency	Velocity Consistency, Acceleration Consistency

Run trajectory evaluation:

bash scripts/5-get_traj_metrics.sh

Results will be saved to cache/eval_logs/.

Benchmarked Models

DrivingGen benchmarks 14 state-of-the-art models across three categories:

Category	Models
General Video World Models	Gen-3, Kling, CogVideoX, Wan, HunyuanVideo, LTX-Video, SkyReels
Physical World Models	Cosmos-Predict1, Cosmos-Predict2
Driving-Specific World Models	Vista, DrivingDojo, GEM, VaViM, UniFuture

Citation

If you find our research useful, please cite us as:

@misc{zhou2026drivinggencomprehensivebenchmarkgenerative,
      title={DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving},
      author={Yang Zhou and Hao Shao and Letian Wang and Zhuofan Zong and Hongsheng Li and Steven L. Waslander},
      year={2026},
      eprint={2601.01528},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.01528},
}

License

All code within this repository is under Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
drivinggen		drivinggen
external_scripts		external_scripts
scripts		scripts
third_parties		third_parties
.gitignore		.gitignore
DrivingGen.pdf		DrivingGen.pdf
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou* , Hao Shao* , Letian Wang , Zhuofan Zong , Hongsheng Li , Steven L. Waslander ("*" denotes equal contribution)

Table of Contents

Updates

Overview

Code Flow

Stage 1 — Dataset Download

Stage 2 — Video Generation

Stage 3 — Ego Trajectory Extraction

Stage 4 — Agent Trajectory Extraction

Stage 5 — Video Metric Evaluation

5.1 Distribution — FVD

5.2 Objective Quality — P2020

5.3 Subjective Quality — CLIP-IQA+

5.4 Scene (Visual) Consistency — DINOv3

5.5 Agent Appearance Consistency — DINOv3

5.6 Agent Missing Detection — Cosmos-Reason1

Stage 6 — Trajectory Metric Evaluation

6.1 Alignment Metrics — drivinggen/trajs/traj_alignment.py

6.2 Trajectory Consistency — drivinggen/trajs/traj_consistency.py

6.3 Trajectory Quality — drivinggen/trajs/traj_quality.py

6.4 Trajectory Distribution — FTD — drivinggen/trajs/traj_distribution.py

Setup Instructions

1. Clone the Repository

2. Environment Setup

3. Install Dependencies

4. Download Dataset

Video Generation

1. Run Inference

2. Extract Ego Trajectory

3. Extract Agent Trajectories

Evaluation

Video Metrics

Trajectory Metrics

Benchmarked Models

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

6.1 Alignment Metrics — `drivinggen/trajs/traj_alignment.py`

6.2 Trajectory Consistency — `drivinggen/trajs/traj_consistency.py`

6.3 Trajectory Quality — `drivinggen/trajs/traj_quality.py`

6.4 Trajectory Distribution — FTD — `drivinggen/trajs/traj_distribution.py`

Packages