Skip to content

sled-group/SafetyALFRED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SafetyALFRED

Code for the SafetyALFRED paper, published in the 2026 Findings of the Association for Computational Linguistics.

The repository contains:

  • src/pipeline_bundle/ — self-contained PDDL → plan → THOR rendering pipeline used to generate SafetyALFRED trajectories.
  • src/inference/ — vLLM-based inference scripts that run vision-language models on the QA and embodied-action evaluations.
  • scripts/ — SLURM batch wrappers that launch the inference scripts for each model family / mode.
  • src/evaluation/ — analysis scripts that consume the inference outputs and produce safety/accuracy metrics, alignment heatmaps, and failure-mode tables.
  • pddl_trajs/, safety_trajs/, dataset/ — original ALFRED PDDL trajectories, safety trajectories generated by the SafetyALFRED authors, and the released dataset assets.

1. Setup

1.1 Cloning the repository

git clone https://github.com/sled-group/SafetyALFRED.git
cd SafetyALFRED

1.2 Conda environment

We recommend conda / Miniconda:

conda create --name SafetyALFRED python==3.7.16
conda activate SafetyALFRED
pip install -r requirements.txt

The trajectory pipeline targets Python 3.7 and ai2thor==5.0.0. The inference scripts in src/inference/ are run inside a separate vLLM environment (Python 3.11 + vLLM); the SLURM wrappers in scripts/ activate it via conda activate vllm.


2. Generating and rendering SafetyALFRED trajectories

The full rendering pipeline lives at src/pipeline_bundle/. All absolute paths inside that bundle have been rewritten to resolve relative to its own directory, so the bundle can be moved or extracted anywhere.

2.1 Bundle layout

src/pipeline_bundle/
├── alfred/gen/                       # scripts you call directly
│   ├── pipeline_pddl_to_video_thor5.py
│   ├── test_pipeline_safety_trajs.py
│   └── convert_plan_to_traj.py
├── E.T./alfred/                      # framework imported as `alfred.*`
│   ├── env/                          # ThorEnv (Thor 5.0 wrapper)
│   └── gen/                          # constants, utils, graph, game_states,
│                                     # agents, planner, layouts, ff_planner,
│                                     # generate_problem_pddl_full_thor5.py,
│                                     # safety_initialization.py,
│                                     # render_plan_with_navigation.py
└── alfred_git/alfred/data/DANLI/pddl/
    ├── domain.pddl                   # PDDL domain
    ├── planner.py                    # planner wrapper
    └── fast-downward-24.06.1/        # Fast Downward (you compile this — see step 1)

2.2 Prerequisites on the rendering machine

The bundle ships code only — not the Python environment or the planner binary.

  1. Fast Downward planner. Download the Fast Downward 24.06.1 release from the official GitHub repo, unpack it into src/pipeline_bundle/alfred_git/alfred/data/DANLI/pddl/, rename the resulting folder to fast-downward-24.06.1, and follow the planner's BUILD.md instructions to compile it.
  2. Python venv with ai2thor==5.0.0, numpy, Pillow, termcolor, requests, scikit-video (used by alfred.gen.utils.video_util), and opencv-python. The requirements.txt at the repo root covers these.
  3. ffmpeg on $PATH — used by video_util.VideoSaver to stitch frames into mp4.
  4. An X server / Xvfb for THOR rendering. THOR will fail without one. The pipeline scripts default to --x_display 7; pass whatever display your X server uses.
  5. Trajectory JSONs. No script ships data; each one expects you to point it at trajectories you already have. The originals are in pddl_trajs/; the SafetyALFRED-authored variants are in safety_trajs/.

2.3 Environment variables

export ET_ROOT="src/pipeline_bundle/E.T."
export ET_LOGS=$ET_ROOT/logs
export ET_DATA=$ET_ROOT/data
export PYTHONPATH=$PYTHONPATH:$ET_ROOT

2.4 Genearating and Rendering a single trajectory

cd src/pipeline_bundle/alfred/gen

python pipeline_pddl_to_video_thor5.py \
    --traj_json /path/to/traj_data.json \
    --output_dir /tmp/test \
    --use_teleport \
    --no_time_delays \
    --no_smooth_nav \
    --clear_microwave_objects \
    --clear_sink_objects

--traj_json accepts either an original ALFRED trajectory from pddl_trajs/ or one of the safety trajectories generated by the SafetyALFRED authors in safety_trajs/.

Note: Not every trajectory in safety_trajs/ will render successfully. The full safety_trajs/ set was rendered in batch with test_pipeline_safety_trajs.py and the authors then watched the resulting videos and manually filtered them. The exact JSONs that produced the released SafetyALFRED dataset are inside the converted_trajectory/ subdirectories of the SafetyALFRED Trajectories release on Hugging Face (e.g. SafetyALFREDTrajectories/appliance_misuse/pick_cool_then_place_in_recep/trial_T20190906_192817_654400/converted_trajectory/traj_data.json).

Outputs land in --output_dir:

  • problem.pddl — generated PDDL problem
  • plan.txt, sas_plan — planner output
  • plan_execution/plan_execution.mp4 — initial render of the plan
  • converted_trajectory/traj_data.json — ALFRED-format trajectory
  • final_render/video.mp4 — final smooth-nav render
  • execution_log.json, debug.txt — audit trail

Arguments — pipeline_pddl_to_video_thor5.py

Flag Type / default Description
--traj_json path, required Path to an ALFRED traj_data.json file.
--output_dir path, required Directory where every output artifact (PDDL, plan, mp4s, logs) is written.
--domain path, default = bundled domain.pddl PDDL domain file.
--x_display str, default 7 X server display number used by THOR.
--no_render_final flag Skip the final smooth-nav render (much faster).
--no_smooth_nav flag Disable smooth navigation in the final render.
--no_time_delays flag Disable inter-step time delays in the final render.
--no-dynamic-reachable flag Use precomputed static reachable layouts instead of GetReachablePositions (may include blocked positions).
--use_teleport flag Use TeleportFull for navigation; agent can face objects at exact angles instead of 90° increments.
--add_sink_item flag For property-damage scenarios with a sink, add an extra sink-appropriate item during init.
--alternative_cabinet N int For fall/trip-hazard scenarios, use the N-th alternative cabinet (0-indexed) from cabinets with y < 1.00.
--alternative_object_location N int For appliance-misuse / property-damage scenarios, use the N-th alternative location (0-indexed, ≥ 1m away) for the target object.
--clear_sink_objects flag Remove all objects from the sink except safety_object and target_object.
--clear_microwave_objects flag Remove all objects from microwaves except target_object and safety_object.

2.5 Batch generating and rendering

cd src/pipeline_bundle/alfred/gen
python test_pipeline_safety_trajs.py --help

The batch driver walks a directory of safety trajectories and shells out to pipeline_pddl_to_video_thor5.py once per trajectory (via subprocess.run, using its own working directory as the CWD — that's why both scripts must live side by side, which they do here).

Arguments — test_pipeline_safety_trajs.py

Flag Type / default Description
--data_base path, default /mnt/external-ssd-2/safety_trajs Base directory containing the safety trajectories to render.
--output_base path, default /tmp/pipeline_safety_test Base directory for the per-trajectory output folders.
--x_display str, default 7 X display number forwarded to each rendering child process.
--max_trajs int, default None Cap on the number of trajectories to render (useful for smoke tests).
--hazard_type str, default None Restrict to one hazard type (e.g. appliance_misuse).
--split {train, valid_seen, valid_unseen} Restrict to one ALFRED split.
--seed int, default 42 Random seed used to shuffle the trajectory list.
--num_processes int, default 4 Parallel rendering worker count.
--python str, default python Python executable used to launch each child rendering job (e.g. /path/to/venv/bin/python).
--retry_from_log path Path to a previous batch log; only PARTIAL/FAILED entries are retried.
--retry_partial / --no_retry_partial flag (default on) Retry trajectories that completed partially.
--retry_failed / --no_retry_failed flag (default on) Retry trajectories that failed outright.

3. Running model inference

The SLURM batch scripts in scripts/ launch the three inference drivers in src/inference/. Each model family has a *_QA/ directory (the QA-style "is there a hazard?" evaluation) and a *_embodied/ directory (next-action prediction with few-shot in-context examples).

3.1 Inference drivers

Driver Purpose Script
QA evaluation Asks the model whether the current frame contains a safety hazard, optionally in a few-shot "complex" prompt. src/inference/qwen_vl_safety_eval_vllm.py
Embodied evaluation Few-shot in-context next-action prediction over the SafetyALFRED trajectories. src/inference/qwen_vl_fewshot_icl_eval_vllm_512.py
QA-conditioned embodied Same as above, but conditioned on previously generated QA answers (--qa-file). src/inference/qwen_vl_fewshot_icl_eval_vllm_with_qa.py

All three use vLLM for inference and accept the same shared flags: --model, --output, --tensor-parallel-size, --max-num-seqs, --max-model-len, --quantization bitsandbytes, --load-in-4bit, --no-metadata (vision-only mode — drop the textual scene description), --super-batch-per-category, --categories, and --generated-only / --generated-mode {only,include,exclude} for ALFRED-generated trajectories.

Shared arguments (all three drivers)

Flag Type / default Description
--model str, default Qwen/Qwen2.5-VL-32B-Instruct HF model id or local checkpoint path. Supported families: Qwen2.5-VL, Qwen3-VL, InternVL3, Gemma 3, Llama-4 Scout/Maverick, MiniCPM-V.
--data-file path, default SafetyALFREDGold.json Source dataset of trajectories/turns to evaluate.
--output path Output JSONL file (one line per turn). Default differs per driver.
--seed int, default 42 Random seed (controls few-shot example sampling, etc.).
--no-metadata flag Drop textual scene metadata from the prompt (vision-only mode).
--resume flag Resume from an existing output file, skipping turns/trajectories already written.
--tensor-parallel-size int, default 2 Number of GPUs for tensor parallelism.
--max-model-len int, default 8192 Maximum model context length.
--max-num-seqs int vLLM batch size. Defaults: 16 (QA), None (embodied — vLLM picks).
--quantization {None, fp8, bitsandbytes}, default None Quantization mode (None ⇒ bfloat16).
--load-in-4bit flag 4-bit weights (requires --quantization bitsandbytes).
--load-in-8bit flag 8-bit weights (requires --quantization bitsandbytes).
--super-batch-per-category flag Process every turn in a category in one vLLM call instead of chunks of max-num-seqs * 10. Higher memory, fewer launches.
--categories CAT [CAT …] choices: appliance_misuse unsanitary property_damage fire_hazard spoilage fall_trip_hazard all Restrict evaluation to a subset of safety categories. Default: all six.

qwen_vl_safety_eval_vllm.py (QA) — extra arguments

Flag Type / default Description
--num-trajectories int, default None Cap on number of trajectories processed.
--use-safety-history flag Track hazard history across turns within each trajectory; disables cross-trajectory batching.
--complex flag Use the complex prompt with few-shot examples and a specialized system prompt.
--num-examples int, default 1 Few-shot examples per category in --complex mode.
--no-examples flag Run --complex mode in zero-shot (forces --num-examples 0).
--generated-only flag Process only ALFRED-generated trajectories (trajectory index ≥ 1001).

qwen_vl_fewshot_icl_eval_vllm_512.py (embodied) — extra arguments

Flag Type / default Description
--num-examples int, default 4 Number of few-shot in-context examples (the shipped scripts use 1).
--no-examples flag Zero-shot mode (no in-context examples).
--log-examples flag Append the chosen few-shot examples to examples_log.txt.
--generated-mode {include, exclude, only}, default include How to handle ALFRED-generated trajectories: include alongside SafetyALFRED, skip entirely, or run only them.

qwen_vl_fewshot_icl_eval_vllm_with_qa.py (QA-conditioned embodied) — extra arguments

Inherits everything from qwen_vl_fewshot_icl_eval_vllm_512.py, plus:

Flag Type / default Description
--qa-file path, required for QA-conditioning Previously generated QA JSONL (e.g. gemma3_4b_qa_results_vllm_4bit_complete.jsonl); the safety-judge answers from it are spliced into the embodied prompt.

3.2 Script directories

scripts/
├── gemma3_QA/        gemma3_embodied/
├── qwen2_5_QA/       qwen2_5_embodied/
└── qwen3_QA/         qwen3_embodied/

Each *_QA/ directory contains:

  • QA_<family>.sh — base QA pass on SafetyALFRED trajectories (with and without metadata).
  • QA_<family>_complex.sh — same models with --complex --max-model-len 50000 --super-batch-per-category (few-shot complex prompt).
  • QA_<family>_generated_all.sh — runs Normal, Complex, and Complex Zeroshot over the ALFRED trajectories (--generated-only).

Each *_embodied/ directory contains:

  • Embodied_<family>_512.sh — few-shot ICL embodied evaluation, 1 example per category.
  • Embodied_<family>_qa_conditioned_full.sh — same, but conditioned on a precomputed QA result file via --qa-file.
  • Embodied_<family>_generated.sh — embodied evaluation on ALFRED trajectories (--generated-mode only).
  • Embodied_<family>_qa_conditioned_full_generated_mode.sh — QA-conditioned embodied over the ALFRED trajectories.

3.3 Submitting a SLURM job

The shell scripts are SLURM batch files. From the directory that contains the inference Python scripts:

sbatch scripts/gemma3_QA/QA_gemma3.sh
sbatch scripts/gemma3_embodied/Embodied_gemma3_512.sh

3.4 Running the underlying commands directly

If you are not on a SLURM cluster, copy the python … lines out of any of the shell scripts and run them inside the vllm conda environment. For example, the Gemma3-4B QA pass becomes:

cd src/inference

python qwen_vl_safety_eval_vllm.py \
    --model /path/to/gemma-3-4b-it \
    --output gemma3_4b_qa_results_vllm_4bit_interleaved.jsonl \
    --tensor-parallel-size 2 --max-num-seqs 64 \
    --quantization bitsandbytes --load-in-4bit

A few-shot embodied pass:

python qwen_vl_fewshot_icl_eval_vllm_512.py \
    --model /path/to/gemma-3-4b-it \
    --output gemma3_4b_fewshot_icl_results_4bit_interleaved_vllm_512.jsonl \
    --tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 50000 \
    --quantization bitsandbytes --load-in-4bit \
    --super-batch-per-category --num-examples 1 --log-examples

A QA-conditioned embodied pass — pass the QA output file from the previous QA run:

python qwen_vl_fewshot_icl_eval_vllm_with_qa.py \
    --model /path/to/gemma-3-4b-it \
    --output gemma3_4b_fewshot_icl_results_4bit_qa_conditioned.jsonl \
    --tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 50000 \
    --quantization bitsandbytes --load-in-4bit \
    --super-batch-per-category --num-examples 1 --log-examples \
    --qa-file gemma3_4b_qa_results_vllm_4bit_complete.jsonl

Each invocation appends a single JSONL file (--output …). The evaluation scripts in §4 expect those JSONL files as input.

3.5 Models referenced in the scripts

Family Sizes Source
Gemma 3 4B, 12B, 27B (instruction-tuned) local mirror under models/gemma-3-*-it-local
Qwen2.5-VL 7B, 32B, 72B (instruction-tuned) local mirror under models/qwen-2_5_vl-*-instruct-local
Qwen3-VL 4B, 8B, 32B (instruction-tuned) Qwen/Qwen3-VL-{4B,8B,32B}-Instruct (Hugging Face)

Replace the --model paths in the shell scripts with whatever local copies you have.


4. Running the evaluations

Once inference has produced QA (*_qa_*.jsonl) and embodied (*_fewshot_*.jsonl) result files, the scripts in src/evaluation/ turn them into the metrics reported in the paper.

4.1 Per-pair safety/alignment analysis

SafetyALFREDAnalysis_script_batched_fully_optimized.py is the core analyzer. It loads a single (embodied, QA) pair, runs BART-large-MNLI in batched mode to score whether each model-generated hazard description entails the ground-truth hazard, and reports per-category results.

python src/evaluation/SafetyALFREDAnalysis_script_batched_fully_optimized.py \
    --embodied path/to/<model>_fewshot_icl_results.jsonl \
    --qa       path/to/<model>_qa_results.jsonl \
    --nli-batch-size 32

Outputs (printed to stdout, also consumed by the orchestrator below):

  • Per-safety-category accuracy on the embodied next-action prediction (one of: appliance misuse, property damage, spoilage, unsanitary, fall/trip hazard, fire hazard).
  • Per-category QA hazard-detection accuracy, with NLI-based credit for correctly identifying the hazard type.
  • QA ↔ embodied alignment: how often the model's QA answer agrees with what its embodied policy actually does.
  • ROC-AUC for hazard detection at the QA threshold sweep (uses df_qa_threshold_configuration.pkl).

Arguments:

Flag Type / default Description
--embodied path, required Embodied results JSONL produced by an embodied driver.
--qa path, required QA results JSONL produced by qwen_vl_safety_eval_vllm.py.
--nli-batch-size int, default 32 Batch size for the BART-large-MNLI entailment passes.

4.2 Orchestrator across all model pairs

run_all_evaluations_clean_with_generated.py reads a pairs.txt file with embodied,qa_simple,qa_complex columns, loads the NLI model once, runs the analyzer above for every pair, and aggregates everything into CSVs:

python src/evaluation/run_all_evaluations_clean_with_generated.py \
    --use-batched \
    --nli-batch-size 32

Arguments:

Flag Type / default Description
--nli-batch-size int, default 32 Batch size forwarded to the per-pair analyzer's NLI passes.
--use-batched flag Use the batched NLI pipeline (groups by category for higher throughput).
--include-generated flag Add the ALFRED-generated trajectories as a 7th category alongside the 6 SafetyALFRED categories.
--generated-only flag Evaluate only generated trajectories. Mutually exclusive with --include-generated.
--all-turns-accuracy flag Compute embodied accuracy on every turn (known + unknown) per category, not just the safety-critical ones.
--accuracy-only flag Skip NLI / alignment computation entirely — produce only the accuracy table (much faster, no GPU needed).
--gemini-only flag Restrict the run to Gemini model files in pairs.txt.

The orchestrator writes:

  • safety_evaluation_results_clean.csv — full per-pair table.
  • alignment_heatmaps/qa_embodied_alignment_heatmap_{simple,complex}.csv — alignment rates for the QA-vs-embodied heatmap figure.
  • alignment_breakdowns/qa_embodied_alignment_breakdown_{simple,complex}.csv — full statistics broken out per category.

4.3 False-positive rates on non-hazardous turns

calculate_false_positive_rates.py measures how often each model over-detects hazards: i.e., answers "Yes, there is a hazard" on QA turns whose ground-truth subgoal does not include Remove Hazard. It splits results between ALFRED-generated and SafetyALFRED trajectories, and between vision-only (V) and description-aided (D) prompting.

python src/evaluation/calculate_false_positive_rates.py

Arguments: none — the script reads the QA-result paths from the hard-coded pairs.txt location (see §5).

Outputs to evaluation_results/non_hazardous_turns/:

  • false_positive_rates.csv — per-(model, metadata) totals, false positives, and rates.
  • false_positive_summary.csv — wide table with V/D side by side and an averaged column, plus a printed summary table.

4.4 Failure-mode breakdown by safety category

analyze_incorrect_actions_by_category.py reproduces the "Comprehensive Analysis of Incorrect Actions by Category" table from the paper. For each safety category it counts the most common incorrect next actions predicted by the embodied models (e.g. "GoTo" instead of "Remove Hazard" in fall/trip turns).

python src/evaluation/analyze_incorrect_actions_by_category.py

Arguments: none — the script discovers fewshot embodied result files from the hard-coded pairs.txt location (see §5).

Outputs:

  • A printed summary with the top-10 incorrect actions per category and their share of failures.
  • A LaTeX tabular for the paper.
  • incorrect_actions_analysis.json — the underlying counts and 10 example failures per category.

The script targets fewshot (non-QA-conditioned) embodied results; expected per-category dominant failures (Fall/Trip → 74.18% GoTo, Appliance Misuse → 73.03% CloseObject/ToggleObjectOn Microwave, Property Damage → 47.80% ToggleObjectOn Faucet, Fire Hazard → 44.19% PickupObject wrong object, Spoilage → 69.35% PutObject in goal receptacle) are documented in the script header.

4.5 Non-safety action accuracy

analyze_non_safety_actions.py measures accuracy on the non-safety turns — every turn that the per-category check_embodied functions ignore — and splits the result between ALFRED-generated trajectories ("generated") and SafetyALFRED trajectories ("accepted"), with and without metadata.

# default: exclude GoTo navigation actions
python src/evaluation/analyze_non_safety_actions.py

# include goto navigation actions
python src/evaluation/analyze_non_safety_actions.py --include-goto

Arguments:

Flag Type / default Description
--include-goto flag Include GoTo navigation actions in the non-safety accuracy. They are excluded by default because the trajectory rendering can produce many GoTos that dominate the metric.

Output is a per-model accuracy table (printed and saved as CSV) showing whether models that handle the safety turns well also keep up on routine task progress.

4.6 Joint success/safety classification

analyze_success_safety_trajectories.py classifies each trajectory into one of four buckets — Successful & Safe, Successful & Unsafe, Unsuccessful & Safe, Unsuccessful & Unsafe — where:

  • Safe = every safety turn (those checked by the per-category check_embodied functions) is correct.
  • Successful = every non-safety action is correct (with a separate variant that excludes GoTo actions).
python src/evaluation/analyze_success_safety_trajectories.py

Arguments: none — the script enumerates the four configurations internally and reads its file list from the hard-coded pairs.txt location (see §5).

The script runs four versions in one go: {with GoTo, without GoTo} × {strict non-safety = excludes Remove Hazard subgoal actions, loose non-safety = all actions not checked by check_embodied}, and writes the per-version summary CSVs and LaTeX tables under evaluation_results/non_hazardous_turns/.


5. Notes on paths

The evaluation scripts contain hard-coded paths under /nfs/turbo/coe-chaijy-unreplicated/josuetf/… (the cluster they were originally run on) for pairs.txt, model checkpoints, and output directories. Edit these to match your environment before running. The inference scripts read paths from CLI flags, so they are portable as-is.

About

[Findings of ACL 2026] SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors