Autonomous driving planners are typically evaluated using aggregate metrics such as collision rate, route completion, and comfort, which do not explicitly measure compliance with traffic rules. As a result, planners can achieve high benchmark scores while still exhibiting unsafe or illegal behaviors, limiting their applicability to real-world deployment. To address this gap, we introduce TrafficRuleBench, a large-scale, rule-centric benchmark for systematic and interpretable evaluation of traffic-rule compliance in autonomous driving. Our framework combines real-map-based simulation for realistic road layouts with rule-targeted procedural scenario generation for scalable and balanced coverage of underrepresented rules. We implement traffic rules corresponding to 45 traffic signs, each equipped with an automatic rule checker for detecting violations during closed-loop execution. This design yields 15,200 diverse road scenes and 18 distinct testing scenario types, enabling controlled evaluation of rule-specific planner behavior. We construct 5,400 testing scenes and demonstrate that current autonomous driving planners can exhibit poor traffic-rule compliance despite strong performance on standard evaluation metrics. To address this limitation, we transform existing planners into rule-compliant trajectory experts via explicit traffic-sign constraints, enabling scalable generation of high-quality oracle trajectories for fine-tuning. Code and data are publicly available at github.com/emb-ai/traffic-rule-bench and huggingface.co/datasets/emb-ai/traffic-rule-bench.
| Family | Baseline | --policy |
Notes |
|---|---|---|---|
| Base | IDM (5 ego variants) | idm |
+ --ego-variant default/s1..s4 |
| Base | PPO | ppo_expert |
|
| Base | CaRL | carl |
--model-path required |
| Base | PlanT2 | plant2 |
--model-path required |
| Fine-tuned | PlanT2 fine-tuned on TrafficRuleBench | plant2 |
--model-path to fine-tuned .pt |
| Rule-augmented | IDM + rule overlay | comprehensive_rule_expert |
+ --ego-variant |
| Rule-augmented | PPO + rule overlay | rule_compliant |
|
| Rule-augmented | CaRL + rule overlay | carl_rule |
--model-path required |
| Rule-augmented | PlanT2 + rule overlay | plant2_rule |
--model-path required |
Submodules:
- MetaDrive — simulation backend with sign-aware extensions
- PlanT2 — PlanT2 policy and training pipeline
- CaRL — CaRL policy
-
Clone the repository with submodules:
git clone --recurse-submodules https://github.com/emb-ai/traffic-rule-bench cd traffic-rule-bench git submodule update --init --recursive -
Create the main conda env:
conda create --name metadrive_signs python=3.10 conda activate metadrive_signs pip install -e metadrive pip install eclipse-sumo sumolib pyproj stable_baselines3 pip install pandas "geopandas<1.0" gym timm pip install -e pdd-bench -
(Optional) PlanT2 env for
plant2/plant2_rulebaselines:cd plant2 conda env update -f environment.yml --prune conda activate plant2 pip install gymnasium panda3d panda3d-gltf progressbar pygame sumolib einops pip install -e ../metadrive cd ..
TrafficRuleBench uses three checkpoint sets. Base CaRL and PlanT2 weights come from the original authors' releases; the fine-tuned PlanT2 weights come from our HuggingFace model hub.
| Model | Source | Default location |
|---|---|---|
| CaRL (base) | autonomousvision/CaRL — see their release/checkpoints | pdd-bench/checkpoints/CaRL/model_best.pth |
| PlanT2 (base, pretrain) | emb-ai/plant2 / autonomousvision/plant | pdd-bench/checkpoints/plant2/epoch%3D029_final_3.ckpt |
| PlanT2 (fine-tuned) | 🤗 emb-ai/traffic-rule-bench-models | pdd-bench/checkpoints/plant2/plant2_supervised_2nd_final.pt |
Download the fine-tuned PlanT2 checkpoint:
pip install huggingface_hub
huggingface-cli download emb-ai/traffic-rule-bench-models --local-dir pdd-bench/checkpointsFor CaRL and the PlanT2 pretrain weights, follow the download instructions in each respective upstream repository and place the resulting files under pdd-bench/checkpoints/.
SUMO road layouts (.net.xml) and per-sign test manifests (.jsonl) live in the HuggingFace dataset emb-ai/traffic-rule-bench. Download both into pdd-bench/:
huggingface-cli download emb-ai/traffic-rule-bench \
--repo-type dataset \
--local-dir pdd-benchThis produces:
pdd-bench/
├── scenes/{sign_code}/sign_NNNNNN/*.net.xml
└── test/{sign_code}/*.jsonl
Pass a manifest to the runner via --manifest pdd-bench/test/<sign>/<file>.jsonl. Scripts default to pdd-bench/scenes for --scenes-root.
Each manifest in pdd-bench/test/<sign>/<file>.jsonl is a self-contained set of scenes for one sign. Run a baseline against any manifest with run_benchmark_mini.py. Each run produces episodes_<policy>.jsonl plus per-episode replay.json sidecars (needed for the metrics pipeline).
cd pdd-bench/scripts/per_sign_bench
python run_benchmark_mini.py \
--policy idm \
--run-name idm_2_5 \
--manifest ../../test/2_5/real_manifest.jsonl \
--emit-replay-sidecarModels that require checkpoints (carl, plant2, *_rule) need --model-path:
python run_benchmark_mini.py \
--policy plant2 \
--run-name plant2_2_5 \
--manifest ../../test/2_5/real_manifest.jsonl \
--model-path ../../checkpoints/plant2/epoch%3D029_final_3.ckpt \
--emit-replay-sidecarcd pdd-bench/scripts/per_sign_bench
for f in ../../test/*/*.jsonl; do
sign=$(basename "$(dirname "$f")")
src=$(basename "$f" .jsonl)
python run_benchmark_mini.py \
--policy idm \
--run-name "idm_${sign}_${src}" \
--manifest "$f" \
--emit-replay-sidecar
doneRepeat the loop for each baseline you want to evaluate (comprehensive_rule_expert, carl, plant2, etc.).
Yield (2.4) uses a dedicated runner that adds yield-specific termination conditions and an optional top-down GIF recorder:
cd pdd-bench/scripts/per_sign_bench
python yield_run_benchmark_mini_plant2.py \
--policy idm \
--run-name idm_yield \
--manifest /path/to/2_4_manifest.jsonl \
--sign-type 2.4 \
--emit-replay-sidecar \
--save-gifs # optional: record top-down GIFsbash pdd-bench/scripts/per_sign_bench/run_metrics_single_run.sh \
--run-dir eval_out/runs/idm_2_5 \
--out-dir eval_out/metrics_idm_2_5 \
--policy idmOutputs:
metrics_per_episode.csv— episode-level tableaggregations/agg_per_baseline.csv— per-baseline summaryreports/report_cumulative.md— markdown report tablereports/report_cumulative_categories.md— per-category breakdown
ROOT=eval_out bash pdd-bench/scripts/per_sign_bench/run_full_metrics_pipeline.sh
# Skip consolidation if replay jsonl files already exist:
SKIP_CONSOLIDATE=1 ROOT=eval_out bash pdd-bench/scripts/per_sign_bench/run_full_metrics_pipeline.shThe pipeline runs:
consolidate_replays.py— mergesreplay.jsonsidecars →<baseline>_replays.jsonlbuild_episode_metrics_csv.py— buildsmetrics_per_episode.csvbuild_oracle_baseline.py— addsoracle_rulesynthetic baselineaggregate_episode_metrics.py— aggregates to per-baseline and per-sign CSVs
python3 pdd-bench/scripts/per_sign_bench/build_oracle_baseline.py \
--csv eval_out/metrics_per_episode.csv| Metric | Description |
|---|---|
target_compliant_event |
Ego obeyed the target sign within its zone (primary rule metric) |
arrived_dest |
Reached the destination |
route_completion |
Percent of route covered (0–100) |
total_violations |
Total traffic rule violations (all signs) |
comfort |
Standard nuplan based kinematic smoothness ratio |
TODO