Code for the SafetyALFRED paper, published in the 2026 Findings of the Association for Computational Linguistics.
The repository contains:
src/pipeline_bundle/— self-contained PDDL → plan → THOR rendering pipeline used to generate SafetyALFRED trajectories.src/inference/— vLLM-based inference scripts that run vision-language models on the QA and embodied-action evaluations.scripts/— SLURM batch wrappers that launch the inference scripts for each model family / mode.src/evaluation/— analysis scripts that consume the inference outputs and produce safety/accuracy metrics, alignment heatmaps, and failure-mode tables.pddl_trajs/,safety_trajs/,dataset/— original ALFRED PDDL trajectories, safety trajectories generated by the SafetyALFRED authors, and the released dataset assets.
git clone https://github.com/sled-group/SafetyALFRED.git
cd SafetyALFREDWe recommend conda / Miniconda:
conda create --name SafetyALFRED python==3.7.16
conda activate SafetyALFRED
pip install -r requirements.txtThe trajectory pipeline targets Python 3.7 and ai2thor==5.0.0. The inference scripts in src/inference/ are run inside a separate vLLM environment (Python 3.11 + vLLM); the SLURM wrappers in scripts/ activate it via conda activate vllm.
The full rendering pipeline lives at src/pipeline_bundle/. All absolute paths inside that bundle have been rewritten to resolve relative to its own directory, so the bundle can be moved or extracted anywhere.
src/pipeline_bundle/
├── alfred/gen/ # scripts you call directly
│ ├── pipeline_pddl_to_video_thor5.py
│ ├── test_pipeline_safety_trajs.py
│ └── convert_plan_to_traj.py
├── E.T./alfred/ # framework imported as `alfred.*`
│ ├── env/ # ThorEnv (Thor 5.0 wrapper)
│ └── gen/ # constants, utils, graph, game_states,
│ # agents, planner, layouts, ff_planner,
│ # generate_problem_pddl_full_thor5.py,
│ # safety_initialization.py,
│ # render_plan_with_navigation.py
└── alfred_git/alfred/data/DANLI/pddl/
├── domain.pddl # PDDL domain
├── planner.py # planner wrapper
└── fast-downward-24.06.1/ # Fast Downward (you compile this — see step 1)
The bundle ships code only — not the Python environment or the planner binary.
- Fast Downward planner. Download the Fast Downward 24.06.1 release from the official GitHub repo, unpack it into
src/pipeline_bundle/alfred_git/alfred/data/DANLI/pddl/, rename the resulting folder tofast-downward-24.06.1, and follow the planner's BUILD.md instructions to compile it. - Python venv with
ai2thor==5.0.0,numpy,Pillow,termcolor,requests,scikit-video(used byalfred.gen.utils.video_util), andopencv-python. Therequirements.txtat the repo root covers these. ffmpegon$PATH— used byvideo_util.VideoSaverto stitch frames into mp4.- An X server / Xvfb for THOR rendering. THOR will fail without one. The pipeline scripts default to
--x_display 7; pass whatever display your X server uses. - Trajectory JSONs. No script ships data; each one expects you to point it at trajectories you already have. The originals are in
pddl_trajs/; the SafetyALFRED-authored variants are insafety_trajs/.
export ET_ROOT="src/pipeline_bundle/E.T."
export ET_LOGS=$ET_ROOT/logs
export ET_DATA=$ET_ROOT/data
export PYTHONPATH=$PYTHONPATH:$ET_ROOTcd src/pipeline_bundle/alfred/gen
python pipeline_pddl_to_video_thor5.py \
--traj_json /path/to/traj_data.json \
--output_dir /tmp/test \
--use_teleport \
--no_time_delays \
--no_smooth_nav \
--clear_microwave_objects \
--clear_sink_objects--traj_json accepts either an original ALFRED trajectory from pddl_trajs/ or one of the safety trajectories generated by the SafetyALFRED authors in safety_trajs/.
Note: Not every trajectory in safety_trajs/ will render successfully. The full safety_trajs/ set was rendered in batch with test_pipeline_safety_trajs.py and the authors then watched the resulting videos and manually filtered them. The exact JSONs that produced the released SafetyALFRED dataset are inside the converted_trajectory/ subdirectories of the SafetyALFRED Trajectories release on Hugging Face (e.g. SafetyALFREDTrajectories/appliance_misuse/pick_cool_then_place_in_recep/trial_T20190906_192817_654400/converted_trajectory/traj_data.json).
Outputs land in --output_dir:
problem.pddl— generated PDDL problemplan.txt,sas_plan— planner outputplan_execution/plan_execution.mp4— initial render of the planconverted_trajectory/traj_data.json— ALFRED-format trajectoryfinal_render/video.mp4— final smooth-nav renderexecution_log.json,debug.txt— audit trail
| Flag | Type / default | Description |
|---|---|---|
--traj_json |
path, required | Path to an ALFRED traj_data.json file. |
--output_dir |
path, required | Directory where every output artifact (PDDL, plan, mp4s, logs) is written. |
--domain |
path, default = bundled domain.pddl |
PDDL domain file. |
--x_display |
str, default 7 |
X server display number used by THOR. |
--no_render_final |
flag | Skip the final smooth-nav render (much faster). |
--no_smooth_nav |
flag | Disable smooth navigation in the final render. |
--no_time_delays |
flag | Disable inter-step time delays in the final render. |
--no-dynamic-reachable |
flag | Use precomputed static reachable layouts instead of GetReachablePositions (may include blocked positions). |
--use_teleport |
flag | Use TeleportFull for navigation; agent can face objects at exact angles instead of 90° increments. |
--add_sink_item |
flag | For property-damage scenarios with a sink, add an extra sink-appropriate item during init. |
--alternative_cabinet N |
int | For fall/trip-hazard scenarios, use the N-th alternative cabinet (0-indexed) from cabinets with y < 1.00. |
--alternative_object_location N |
int | For appliance-misuse / property-damage scenarios, use the N-th alternative location (0-indexed, ≥ 1m away) for the target object. |
--clear_sink_objects |
flag | Remove all objects from the sink except safety_object and target_object. |
--clear_microwave_objects |
flag | Remove all objects from microwaves except target_object and safety_object. |
cd src/pipeline_bundle/alfred/gen
python test_pipeline_safety_trajs.py --helpThe batch driver walks a directory of safety trajectories and shells out to pipeline_pddl_to_video_thor5.py once per trajectory (via subprocess.run, using its own working directory as the CWD — that's why both scripts must live side by side, which they do here).
| Flag | Type / default | Description |
|---|---|---|
--data_base |
path, default /mnt/external-ssd-2/safety_trajs |
Base directory containing the safety trajectories to render. |
--output_base |
path, default /tmp/pipeline_safety_test |
Base directory for the per-trajectory output folders. |
--x_display |
str, default 7 |
X display number forwarded to each rendering child process. |
--max_trajs |
int, default None |
Cap on the number of trajectories to render (useful for smoke tests). |
--hazard_type |
str, default None |
Restrict to one hazard type (e.g. appliance_misuse). |
--split |
{train, valid_seen, valid_unseen} |
Restrict to one ALFRED split. |
--seed |
int, default 42 |
Random seed used to shuffle the trajectory list. |
--num_processes |
int, default 4 |
Parallel rendering worker count. |
--python |
str, default python |
Python executable used to launch each child rendering job (e.g. /path/to/venv/bin/python). |
--retry_from_log |
path | Path to a previous batch log; only PARTIAL/FAILED entries are retried. |
--retry_partial / --no_retry_partial |
flag (default on) | Retry trajectories that completed partially. |
--retry_failed / --no_retry_failed |
flag (default on) | Retry trajectories that failed outright. |
The SLURM batch scripts in scripts/ launch the three inference drivers in src/inference/. Each model family has a *_QA/ directory (the QA-style "is there a hazard?" evaluation) and a *_embodied/ directory (next-action prediction with few-shot in-context examples).
| Driver | Purpose | Script |
|---|---|---|
| QA evaluation | Asks the model whether the current frame contains a safety hazard, optionally in a few-shot "complex" prompt. | src/inference/qwen_vl_safety_eval_vllm.py |
| Embodied evaluation | Few-shot in-context next-action prediction over the SafetyALFRED trajectories. | src/inference/qwen_vl_fewshot_icl_eval_vllm_512.py |
| QA-conditioned embodied | Same as above, but conditioned on previously generated QA answers (--qa-file). |
src/inference/qwen_vl_fewshot_icl_eval_vllm_with_qa.py |
All three use vLLM for inference and accept the same shared flags: --model, --output, --tensor-parallel-size, --max-num-seqs, --max-model-len, --quantization bitsandbytes, --load-in-4bit, --no-metadata (vision-only mode — drop the textual scene description), --super-batch-per-category, --categories, and --generated-only / --generated-mode {only,include,exclude} for ALFRED-generated trajectories.
| Flag | Type / default | Description |
|---|---|---|
--model |
str, default Qwen/Qwen2.5-VL-32B-Instruct |
HF model id or local checkpoint path. Supported families: Qwen2.5-VL, Qwen3-VL, InternVL3, Gemma 3, Llama-4 Scout/Maverick, MiniCPM-V. |
--data-file |
path, default SafetyALFREDGold.json |
Source dataset of trajectories/turns to evaluate. |
--output |
path | Output JSONL file (one line per turn). Default differs per driver. |
--seed |
int, default 42 |
Random seed (controls few-shot example sampling, etc.). |
--no-metadata |
flag | Drop textual scene metadata from the prompt (vision-only mode). |
--resume |
flag | Resume from an existing output file, skipping turns/trajectories already written. |
--tensor-parallel-size |
int, default 2 |
Number of GPUs for tensor parallelism. |
--max-model-len |
int, default 8192 |
Maximum model context length. |
--max-num-seqs |
int | vLLM batch size. Defaults: 16 (QA), None (embodied — vLLM picks). |
--quantization |
{None, fp8, bitsandbytes}, default None |
Quantization mode (None ⇒ bfloat16). |
--load-in-4bit |
flag | 4-bit weights (requires --quantization bitsandbytes). |
--load-in-8bit |
flag | 8-bit weights (requires --quantization bitsandbytes). |
--super-batch-per-category |
flag | Process every turn in a category in one vLLM call instead of chunks of max-num-seqs * 10. Higher memory, fewer launches. |
--categories CAT [CAT …] |
choices: appliance_misuse unsanitary property_damage fire_hazard spoilage fall_trip_hazard all |
Restrict evaluation to a subset of safety categories. Default: all six. |
| Flag | Type / default | Description |
|---|---|---|
--num-trajectories |
int, default None |
Cap on number of trajectories processed. |
--use-safety-history |
flag | Track hazard history across turns within each trajectory; disables cross-trajectory batching. |
--complex |
flag | Use the complex prompt with few-shot examples and a specialized system prompt. |
--num-examples |
int, default 1 |
Few-shot examples per category in --complex mode. |
--no-examples |
flag | Run --complex mode in zero-shot (forces --num-examples 0). |
--generated-only |
flag | Process only ALFRED-generated trajectories (trajectory index ≥ 1001). |
| Flag | Type / default | Description |
|---|---|---|
--num-examples |
int, default 4 |
Number of few-shot in-context examples (the shipped scripts use 1). |
--no-examples |
flag | Zero-shot mode (no in-context examples). |
--log-examples |
flag | Append the chosen few-shot examples to examples_log.txt. |
--generated-mode |
{include, exclude, only}, default include |
How to handle ALFRED-generated trajectories: include alongside SafetyALFRED, skip entirely, or run only them. |
Inherits everything from qwen_vl_fewshot_icl_eval_vllm_512.py, plus:
| Flag | Type / default | Description |
|---|---|---|
--qa-file |
path, required for QA-conditioning | Previously generated QA JSONL (e.g. gemma3_4b_qa_results_vllm_4bit_complete.jsonl); the safety-judge answers from it are spliced into the embodied prompt. |
scripts/
├── gemma3_QA/ gemma3_embodied/
├── qwen2_5_QA/ qwen2_5_embodied/
└── qwen3_QA/ qwen3_embodied/
Each *_QA/ directory contains:
QA_<family>.sh— base QA pass on SafetyALFRED trajectories (with and without metadata).QA_<family>_complex.sh— same models with--complex --max-model-len 50000 --super-batch-per-category(few-shot complex prompt).QA_<family>_generated_all.sh— runs Normal, Complex, and Complex Zeroshot over the ALFRED trajectories (--generated-only).
Each *_embodied/ directory contains:
Embodied_<family>_512.sh— few-shot ICL embodied evaluation, 1 example per category.Embodied_<family>_qa_conditioned_full.sh— same, but conditioned on a precomputed QA result file via--qa-file.Embodied_<family>_generated.sh— embodied evaluation on ALFRED trajectories (--generated-mode only).Embodied_<family>_qa_conditioned_full_generated_mode.sh— QA-conditioned embodied over the ALFRED trajectories.
The shell scripts are SLURM batch files. From the directory that contains the inference Python scripts:
sbatch scripts/gemma3_QA/QA_gemma3.sh
sbatch scripts/gemma3_embodied/Embodied_gemma3_512.shIf you are not on a SLURM cluster, copy the python … lines out of any of the shell scripts and run them inside the vllm conda environment. For example, the Gemma3-4B QA pass becomes:
cd src/inference
python qwen_vl_safety_eval_vllm.py \
--model /path/to/gemma-3-4b-it \
--output gemma3_4b_qa_results_vllm_4bit_interleaved.jsonl \
--tensor-parallel-size 2 --max-num-seqs 64 \
--quantization bitsandbytes --load-in-4bitA few-shot embodied pass:
python qwen_vl_fewshot_icl_eval_vllm_512.py \
--model /path/to/gemma-3-4b-it \
--output gemma3_4b_fewshot_icl_results_4bit_interleaved_vllm_512.jsonl \
--tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 50000 \
--quantization bitsandbytes --load-in-4bit \
--super-batch-per-category --num-examples 1 --log-examplesA QA-conditioned embodied pass — pass the QA output file from the previous QA run:
python qwen_vl_fewshot_icl_eval_vllm_with_qa.py \
--model /path/to/gemma-3-4b-it \
--output gemma3_4b_fewshot_icl_results_4bit_qa_conditioned.jsonl \
--tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 50000 \
--quantization bitsandbytes --load-in-4bit \
--super-batch-per-category --num-examples 1 --log-examples \
--qa-file gemma3_4b_qa_results_vllm_4bit_complete.jsonlEach invocation appends a single JSONL file (--output …). The evaluation scripts in §4 expect those JSONL files as input.
| Family | Sizes | Source |
|---|---|---|
| Gemma 3 | 4B, 12B, 27B (instruction-tuned) | local mirror under models/gemma-3-*-it-local |
| Qwen2.5-VL | 7B, 32B, 72B (instruction-tuned) | local mirror under models/qwen-2_5_vl-*-instruct-local |
| Qwen3-VL | 4B, 8B, 32B (instruction-tuned) | Qwen/Qwen3-VL-{4B,8B,32B}-Instruct (Hugging Face) |
Replace the --model paths in the shell scripts with whatever local copies you have.
Once inference has produced QA (*_qa_*.jsonl) and embodied (*_fewshot_*.jsonl) result files, the scripts in src/evaluation/ turn them into the metrics reported in the paper.
SafetyALFREDAnalysis_script_batched_fully_optimized.py is the core analyzer. It loads a single (embodied, QA) pair, runs BART-large-MNLI in batched mode to score whether each model-generated hazard description entails the ground-truth hazard, and reports per-category results.
python src/evaluation/SafetyALFREDAnalysis_script_batched_fully_optimized.py \
--embodied path/to/<model>_fewshot_icl_results.jsonl \
--qa path/to/<model>_qa_results.jsonl \
--nli-batch-size 32Outputs (printed to stdout, also consumed by the orchestrator below):
- Per-safety-category accuracy on the embodied next-action prediction (one of: appliance misuse, property damage, spoilage, unsanitary, fall/trip hazard, fire hazard).
- Per-category QA hazard-detection accuracy, with NLI-based credit for correctly identifying the hazard type.
- QA ↔ embodied alignment: how often the model's QA answer agrees with what its embodied policy actually does.
- ROC-AUC for hazard detection at the QA threshold sweep (uses
df_qa_threshold_configuration.pkl).
Arguments:
| Flag | Type / default | Description |
|---|---|---|
--embodied |
path, required | Embodied results JSONL produced by an embodied driver. |
--qa |
path, required | QA results JSONL produced by qwen_vl_safety_eval_vllm.py. |
--nli-batch-size |
int, default 32 |
Batch size for the BART-large-MNLI entailment passes. |
run_all_evaluations_clean_with_generated.py reads a pairs.txt file with embodied,qa_simple,qa_complex columns, loads the NLI model once, runs the analyzer above for every pair, and aggregates everything into CSVs:
python src/evaluation/run_all_evaluations_clean_with_generated.py \
--use-batched \
--nli-batch-size 32Arguments:
| Flag | Type / default | Description |
|---|---|---|
--nli-batch-size |
int, default 32 |
Batch size forwarded to the per-pair analyzer's NLI passes. |
--use-batched |
flag | Use the batched NLI pipeline (groups by category for higher throughput). |
--include-generated |
flag | Add the ALFRED-generated trajectories as a 7th category alongside the 6 SafetyALFRED categories. |
--generated-only |
flag | Evaluate only generated trajectories. Mutually exclusive with --include-generated. |
--all-turns-accuracy |
flag | Compute embodied accuracy on every turn (known + unknown) per category, not just the safety-critical ones. |
--accuracy-only |
flag | Skip NLI / alignment computation entirely — produce only the accuracy table (much faster, no GPU needed). |
--gemini-only |
flag | Restrict the run to Gemini model files in pairs.txt. |
The orchestrator writes:
safety_evaluation_results_clean.csv— full per-pair table.alignment_heatmaps/qa_embodied_alignment_heatmap_{simple,complex}.csv— alignment rates for the QA-vs-embodied heatmap figure.alignment_breakdowns/qa_embodied_alignment_breakdown_{simple,complex}.csv— full statistics broken out per category.
calculate_false_positive_rates.py measures how often each model over-detects hazards: i.e., answers "Yes, there is a hazard" on QA turns whose ground-truth subgoal does not include Remove Hazard. It splits results between ALFRED-generated and SafetyALFRED trajectories, and between vision-only (V) and description-aided (D) prompting.
python src/evaluation/calculate_false_positive_rates.pyArguments: none — the script reads the QA-result paths from the hard-coded pairs.txt location (see §5).
Outputs to evaluation_results/non_hazardous_turns/:
false_positive_rates.csv— per-(model, metadata) totals, false positives, and rates.false_positive_summary.csv— wide table with V/D side by side and an averaged column, plus a printed summary table.
analyze_incorrect_actions_by_category.py reproduces the "Comprehensive Analysis of Incorrect Actions by Category" table from the paper. For each safety category it counts the most common incorrect next actions predicted by the embodied models (e.g. "GoTo" instead of "Remove Hazard" in fall/trip turns).
python src/evaluation/analyze_incorrect_actions_by_category.pyArguments: none — the script discovers fewshot embodied result files from the hard-coded pairs.txt location (see §5).
Outputs:
- A printed summary with the top-10 incorrect actions per category and their share of failures.
- A LaTeX
tabularfor the paper. incorrect_actions_analysis.json— the underlying counts and 10 example failures per category.
The script targets fewshot (non-QA-conditioned) embodied results; expected per-category dominant failures (Fall/Trip → 74.18% GoTo, Appliance Misuse → 73.03% CloseObject/ToggleObjectOn Microwave, Property Damage → 47.80% ToggleObjectOn Faucet, Fire Hazard → 44.19% PickupObject wrong object, Spoilage → 69.35% PutObject in goal receptacle) are documented in the script header.
analyze_non_safety_actions.py measures accuracy on the non-safety turns — every turn that the per-category check_embodied functions ignore — and splits the result between ALFRED-generated trajectories ("generated") and SafetyALFRED trajectories ("accepted"), with and without metadata.
# default: exclude GoTo navigation actions
python src/evaluation/analyze_non_safety_actions.py
# include goto navigation actions
python src/evaluation/analyze_non_safety_actions.py --include-gotoArguments:
| Flag | Type / default | Description |
|---|---|---|
--include-goto |
flag | Include GoTo navigation actions in the non-safety accuracy. They are excluded by default because the trajectory rendering can produce many GoTos that dominate the metric. |
Output is a per-model accuracy table (printed and saved as CSV) showing whether models that handle the safety turns well also keep up on routine task progress.
analyze_success_safety_trajectories.py classifies each trajectory into one of four buckets — Successful & Safe, Successful & Unsafe, Unsuccessful & Safe, Unsuccessful & Unsafe — where:
- Safe = every safety turn (those checked by the per-category
check_embodiedfunctions) is correct. - Successful = every non-safety action is correct (with a separate variant that excludes
GoToactions).
python src/evaluation/analyze_success_safety_trajectories.pyArguments: none — the script enumerates the four configurations internally and reads its file list from the hard-coded pairs.txt location (see §5).
The script runs four versions in one go: {with GoTo, without GoTo} × {strict non-safety = excludes Remove Hazard subgoal actions, loose non-safety = all actions not checked by check_embodied}, and writes the per-version summary CSVs and LaTeX tables under evaluation_results/non_hazardous_turns/.
The evaluation scripts contain hard-coded paths under /nfs/turbo/coe-chaijy-unreplicated/josuetf/… (the cluster they were originally run on) for pairs.txt, model checkpoints, and output directories. Edit these to match your environment before running. The inference scripts read paths from CLI flags, so they are portable as-is.