SafetyALFRED

Code for the SafetyALFRED paper, published in the 2026 Findings of the Association for Computational Linguistics.

The repository contains:

src/pipeline_bundle/ — self-contained PDDL → plan → THOR rendering pipeline used to generate SafetyALFRED trajectories.
src/inference/ — vLLM-based inference scripts that run vision-language models on the QA and embodied-action evaluations.
scripts/ — SLURM batch wrappers that launch the inference scripts for each model family / mode.
src/evaluation/ — analysis scripts that consume the inference outputs and produce safety/accuracy metrics, alignment heatmaps, and failure-mode tables.
pddl_trajs/, safety_trajs/, dataset/ — original ALFRED PDDL trajectories, safety trajectories generated by the SafetyALFRED authors, and the released dataset assets.

1. Setup

1.1 Cloning the repository

git clone https://github.com/sled-group/SafetyALFRED.git
cd SafetyALFRED

1.2 Conda environment

We recommend conda / Miniconda:

conda create --name SafetyALFRED python==3.7.16
conda activate SafetyALFRED
pip install -r requirements.txt

The trajectory pipeline targets Python 3.7 and ai2thor==5.0.0. The inference scripts in src/inference/ are run inside a separate vLLM environment (Python 3.11 + vLLM); the SLURM wrappers in scripts/ activate it via conda activate vllm.

2. Generating and rendering SafetyALFRED trajectories

The full rendering pipeline lives at src/pipeline_bundle/. All absolute paths inside that bundle have been rewritten to resolve relative to its own directory, so the bundle can be moved or extracted anywhere.

2.1 Bundle layout

src/pipeline_bundle/
├── alfred/gen/                       # scripts you call directly
│   ├── pipeline_pddl_to_video_thor5.py
│   ├── test_pipeline_safety_trajs.py
│   └── convert_plan_to_traj.py
├── E.T./alfred/                      # framework imported as `alfred.*`
│   ├── env/                          # ThorEnv (Thor 5.0 wrapper)
│   └── gen/                          # constants, utils, graph, game_states,
│                                     # agents, planner, layouts, ff_planner,
│                                     # generate_problem_pddl_full_thor5.py,
│                                     # safety_initialization.py,
│                                     # render_plan_with_navigation.py
└── alfred_git/alfred/data/DANLI/pddl/
    ├── domain.pddl                   # PDDL domain
    ├── planner.py                    # planner wrapper
    └── fast-downward-24.06.1/        # Fast Downward (you compile this — see step 1)

2.2 Prerequisites on the rendering machine

The bundle ships code only — not the Python environment or the planner binary.

Fast Downward planner. Download the Fast Downward 24.06.1 release from the official GitHub repo, unpack it into src/pipeline_bundle/alfred_git/alfred/data/DANLI/pddl/, rename the resulting folder to fast-downward-24.06.1, and follow the planner's BUILD.md instructions to compile it.
Python venv with ai2thor==5.0.0, numpy, Pillow, termcolor, requests, scikit-video (used by alfred.gen.utils.video_util), and opencv-python. The requirements.txt at the repo root covers these.
ffmpeg on $PATH — used by video_util.VideoSaver to stitch frames into mp4.
An X server / Xvfb for THOR rendering. THOR will fail without one. The pipeline scripts default to --x_display 7; pass whatever display your X server uses.
Trajectory JSONs. No script ships data; each one expects you to point it at trajectories you already have. The originals are in pddl_trajs/; the SafetyALFRED-authored variants are in safety_trajs/.

2.3 Environment variables

export ET_ROOT="src/pipeline_bundle/E.T."
export ET_LOGS=$ET_ROOT/logs
export ET_DATA=$ET_ROOT/data
export PYTHONPATH=$PYTHONPATH:$ET_ROOT

2.4 Genearating and Rendering a single trajectory

cd src/pipeline_bundle/alfred/gen

python pipeline_pddl_to_video_thor5.py \
    --traj_json /path/to/traj_data.json \
    --output_dir /tmp/test \
    --use_teleport \
    --no_time_delays \
    --no_smooth_nav \
    --clear_microwave_objects \
    --clear_sink_objects

--traj_json accepts either an original ALFRED trajectory from pddl_trajs/ or one of the safety trajectories generated by the SafetyALFRED authors in safety_trajs/.

Note: Not every trajectory in safety_trajs/ will render successfully. The full safety_trajs/ set was rendered in batch with test_pipeline_safety_trajs.py and the authors then watched the resulting videos and manually filtered them. The exact JSONs that produced the released SafetyALFRED dataset are inside the converted_trajectory/ subdirectories of the SafetyALFRED Trajectories release on Hugging Face (e.g. SafetyALFREDTrajectories/appliance_misuse/pick_cool_then_place_in_recep/trial_T20190906_192817_654400/converted_trajectory/traj_data.json).

Outputs land in --output_dir:

problem.pddl — generated PDDL problem
plan.txt, sas_plan — planner output
plan_execution/plan_execution.mp4 — initial render of the plan
converted_trajectory/traj_data.json — ALFRED-format trajectory
final_render/video.mp4 — final smooth-nav render
execution_log.json, debug.txt — audit trail

Arguments — `pipeline_pddl_to_video_thor5.py`

Flag	Type / default	Description
`--traj_json`	path, required	Path to an ALFRED `traj_data.json` file.
`--output_dir`	path, required	Directory where every output artifact (PDDL, plan, mp4s, logs) is written.
`--domain`	path, default = bundled `domain.pddl`	PDDL domain file.
`--x_display`	str, default `7`	X server display number used by THOR.
`--no_render_final`	flag	Skip the final smooth-nav render (much faster).
`--no_smooth_nav`	flag	Disable smooth navigation in the final render.
`--no_time_delays`	flag	Disable inter-step time delays in the final render.
`--no-dynamic-reachable`	flag	Use precomputed static reachable layouts instead of `GetReachablePositions` (may include blocked positions).
`--use_teleport`	flag	Use `TeleportFull` for navigation; agent can face objects at exact angles instead of 90° increments.
`--add_sink_item`	flag	For property-damage scenarios with a sink, add an extra sink-appropriate item during init.
`--alternative_cabinet N`	int	For fall/trip-hazard scenarios, use the N-th alternative cabinet (0-indexed) from cabinets with `y < 1.00`.
`--alternative_object_location N`	int	For appliance-misuse / property-damage scenarios, use the N-th alternative location (0-indexed, ≥ 1m away) for the target object.
`--clear_sink_objects`	flag	Remove all objects from the sink except `safety_object` and `target_object`.
`--clear_microwave_objects`	flag	Remove all objects from microwaves except `target_object` and `safety_object`.

2.5 Batch generating and rendering

cd src/pipeline_bundle/alfred/gen
python test_pipeline_safety_trajs.py --help

The batch driver walks a directory of safety trajectories and shells out to pipeline_pddl_to_video_thor5.py once per trajectory (via subprocess.run, using its own working directory as the CWD — that's why both scripts must live side by side, which they do here).

Arguments — `test_pipeline_safety_trajs.py`

Flag	Type / default	Description
`--data_base`	path, default `/mnt/external-ssd-2/safety_trajs`	Base directory containing the safety trajectories to render.
`--output_base`	path, default `/tmp/pipeline_safety_test`	Base directory for the per-trajectory output folders.
`--x_display`	str, default `7`	X display number forwarded to each rendering child process.
`--max_trajs`	int, default `None`	Cap on the number of trajectories to render (useful for smoke tests).
`--hazard_type`	str, default `None`	Restrict to one hazard type (e.g. `appliance_misuse`).
`--split`	`{train, valid_seen, valid_unseen}`	Restrict to one ALFRED split.
`--seed`	int, default `42`	Random seed used to shuffle the trajectory list.
`--num_processes`	int, default `4`	Parallel rendering worker count.
`--python`	str, default `python`	Python executable used to launch each child rendering job (e.g. `/path/to/venv/bin/python`).
`--retry_from_log`	path	Path to a previous batch log; only `PARTIAL`/`FAILED` entries are retried.
`--retry_partial` / `--no_retry_partial`	flag (default on)	Retry trajectories that completed partially.
`--retry_failed` / `--no_retry_failed`	flag (default on)	Retry trajectories that failed outright.

3. Running model inference

The SLURM batch scripts in scripts/ launch the three inference drivers in src/inference/. Each model family has a *_QA/ directory (the QA-style "is there a hazard?" evaluation) and a *_embodied/ directory (next-action prediction with few-shot in-context examples).

3.1 Inference drivers

Driver	Purpose	Script
QA evaluation	Asks the model whether the current frame contains a safety hazard, optionally in a few-shot "complex" prompt.	`src/inference/qwen_vl_safety_eval_vllm.py`
Embodied evaluation	Few-shot in-context next-action prediction over the SafetyALFRED trajectories.	`src/inference/qwen_vl_fewshot_icl_eval_vllm_512.py`
QA-conditioned embodied	Same as above, but conditioned on previously generated QA answers (`--qa-file`).	`src/inference/qwen_vl_fewshot_icl_eval_vllm_with_qa.py`

All three use vLLM for inference and accept the same shared flags: --model, --output, --tensor-parallel-size, --max-num-seqs, --max-model-len, --quantization bitsandbytes, --load-in-4bit, --no-metadata (vision-only mode — drop the textual scene description), --super-batch-per-category, --categories, and --generated-only / --generated-mode {only,include,exclude} for ALFRED-generated trajectories.

Shared arguments (all three drivers)

Flag	Type / default	Description
`--model`	str, default `Qwen/Qwen2.5-VL-32B-Instruct`	HF model id or local checkpoint path. Supported families: Qwen2.5-VL, Qwen3-VL, InternVL3, Gemma 3, Llama-4 Scout/Maverick, MiniCPM-V.
`--data-file`	path, default `SafetyALFREDGold.json`	Source dataset of trajectories/turns to evaluate.
`--output`	path	Output JSONL file (one line per turn). Default differs per driver.
`--seed`	int, default `42`	Random seed (controls few-shot example sampling, etc.).
`--no-metadata`	flag	Drop textual scene metadata from the prompt (vision-only mode).
`--resume`	flag	Resume from an existing output file, skipping turns/trajectories already written.
`--tensor-parallel-size`	int, default `2`	Number of GPUs for tensor parallelism.
`--max-model-len`	int, default `8192`	Maximum model context length.
`--max-num-seqs`	int	vLLM batch size. Defaults: `16` (QA), `None` (embodied — vLLM picks).
`--quantization`	`{None, fp8, bitsandbytes}`, default `None`	Quantization mode (`None` ⇒ bfloat16).
`--load-in-4bit`	flag	4-bit weights (requires `--quantization bitsandbytes`).
`--load-in-8bit`	flag	8-bit weights (requires `--quantization bitsandbytes`).
`--super-batch-per-category`	flag	Process every turn in a category in one vLLM call instead of chunks of `max-num-seqs * 10`. Higher memory, fewer launches.
`--categories CAT [CAT …]`	choices: `appliance_misuse unsanitary property_damage fire_hazard spoilage fall_trip_hazard all`	Restrict evaluation to a subset of safety categories. Default: all six.

`qwen_vl_safety_eval_vllm.py` (QA) — extra arguments

Flag	Type / default	Description
`--num-trajectories`	int, default `None`	Cap on number of trajectories processed.
`--use-safety-history`	flag	Track hazard history across turns within each trajectory; disables cross-trajectory batching.
`--complex`	flag	Use the complex prompt with few-shot examples and a specialized system prompt.
`--num-examples`	int, default `1`	Few-shot examples per category in `--complex` mode.
`--no-examples`	flag	Run `--complex` mode in zero-shot (forces `--num-examples 0`).
`--generated-only`	flag	Process only ALFRED-generated trajectories (trajectory index ≥ 1001).

`qwen_vl_fewshot_icl_eval_vllm_512.py` (embodied) — extra arguments

Flag	Type / default	Description
`--num-examples`	int, default `4`	Number of few-shot in-context examples (the shipped scripts use `1`).
`--no-examples`	flag	Zero-shot mode (no in-context examples).
`--log-examples`	flag	Append the chosen few-shot examples to `examples_log.txt`.
`--generated-mode`	`{include, exclude, only}`, default `include`	How to handle ALFRED-generated trajectories: include alongside SafetyALFRED, skip entirely, or run only them.

`qwen_vl_fewshot_icl_eval_vllm_with_qa.py` (QA-conditioned embodied) — extra arguments

Inherits everything from qwen_vl_fewshot_icl_eval_vllm_512.py, plus:

Flag	Type / default	Description
`--qa-file`	path, required for QA-conditioning	Previously generated QA JSONL (e.g. `gemma3_4b_qa_results_vllm_4bit_complete.jsonl`); the safety-judge answers from it are spliced into the embodied prompt.

3.2 Script directories

scripts/
├── gemma3_QA/        gemma3_embodied/
├── qwen2_5_QA/       qwen2_5_embodied/
└── qwen3_QA/         qwen3_embodied/

Each *_QA/ directory contains:

QA_<family>.sh — base QA pass on SafetyALFRED trajectories (with and without metadata).
QA_<family>_complex.sh — same models with --complex --max-model-len 50000 --super-batch-per-category (few-shot complex prompt).
QA_<family>_generated_all.sh — runs Normal, Complex, and Complex Zeroshot over the ALFRED trajectories (--generated-only).

Each *_embodied/ directory contains:

Embodied_<family>_512.sh — few-shot ICL embodied evaluation, 1 example per category.
Embodied_<family>_qa_conditioned_full.sh — same, but conditioned on a precomputed QA result file via --qa-file.
Embodied_<family>_generated.sh — embodied evaluation on ALFRED trajectories (--generated-mode only).
Embodied_<family>_qa_conditioned_full_generated_mode.sh — QA-conditioned embodied over the ALFRED trajectories.

3.3 Submitting a SLURM job

The shell scripts are SLURM batch files. From the directory that contains the inference Python scripts:

sbatch scripts/gemma3_QA/QA_gemma3.sh
sbatch scripts/gemma3_embodied/Embodied_gemma3_512.sh

3.4 Running the underlying commands directly

If you are not on a SLURM cluster, copy the python … lines out of any of the shell scripts and run them inside the vllm conda environment. For example, the Gemma3-4B QA pass becomes:

cd src/inference

python qwen_vl_safety_eval_vllm.py \
    --model /path/to/gemma-3-4b-it \
    --output gemma3_4b_qa_results_vllm_4bit_interleaved.jsonl \
    --tensor-parallel-size 2 --max-num-seqs 64 \
    --quantization bitsandbytes --load-in-4bit

A few-shot embodied pass:

python qwen_vl_fewshot_icl_eval_vllm_512.py \
    --model /path/to/gemma-3-4b-it \
    --output gemma3_4b_fewshot_icl_results_4bit_interleaved_vllm_512.jsonl \
    --tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 50000 \
    --quantization bitsandbytes --load-in-4bit \
    --super-batch-per-category --num-examples 1 --log-examples

A QA-conditioned embodied pass — pass the QA output file from the previous QA run:

python qwen_vl_fewshot_icl_eval_vllm_with_qa.py \
    --model /path/to/gemma-3-4b-it \
    --output gemma3_4b_fewshot_icl_results_4bit_qa_conditioned.jsonl \
    --tensor-parallel-size 2 --max-num-seqs 32 --max-model-len 50000 \
    --quantization bitsandbytes --load-in-4bit \
    --super-batch-per-category --num-examples 1 --log-examples \
    --qa-file gemma3_4b_qa_results_vllm_4bit_complete.jsonl

Each invocation appends a single JSONL file (--output …). The evaluation scripts in §4 expect those JSONL files as input.

3.5 Models referenced in the scripts

Family	Sizes	Source
Gemma 3	4B, 12B, 27B (instruction-tuned)	local mirror under `models/gemma-3-*-it-local`
Qwen2.5-VL	7B, 32B, 72B (instruction-tuned)	local mirror under `models/qwen-2_5_vl-*-instruct-local`
Qwen3-VL	4B, 8B, 32B (instruction-tuned)	`Qwen/Qwen3-VL-{4B,8B,32B}-Instruct` (Hugging Face)

Replace the --model paths in the shell scripts with whatever local copies you have.

4. Running the evaluations

Once inference has produced QA (*_qa_*.jsonl) and embodied (*_fewshot_*.jsonl) result files, the scripts in src/evaluation/ turn them into the metrics reported in the paper.

4.1 Per-pair safety/alignment analysis

SafetyALFREDAnalysis_script_batched_fully_optimized.py is the core analyzer. It loads a single (embodied, QA) pair, runs BART-large-MNLI in batched mode to score whether each model-generated hazard description entails the ground-truth hazard, and reports per-category results.

python src/evaluation/SafetyALFREDAnalysis_script_batched_fully_optimized.py \
    --embodied path/to/<model>_fewshot_icl_results.jsonl \
    --qa       path/to/<model>_qa_results.jsonl \
    --nli-batch-size 32

Outputs (printed to stdout, also consumed by the orchestrator below):

Per-safety-category accuracy on the embodied next-action prediction (one of: appliance misuse, property damage, spoilage, unsanitary, fall/trip hazard, fire hazard).
Per-category QA hazard-detection accuracy, with NLI-based credit for correctly identifying the hazard type.
QA ↔ embodied alignment: how often the model's QA answer agrees with what its embodied policy actually does.
ROC-AUC for hazard detection at the QA threshold sweep (uses df_qa_threshold_configuration.pkl).

Arguments:

Flag	Type / default	Description
`--embodied`	path, required	Embodied results JSONL produced by an embodied driver.
`--qa`	path, required	QA results JSONL produced by `qwen_vl_safety_eval_vllm.py`.
`--nli-batch-size`	int, default `32`	Batch size for the BART-large-MNLI entailment passes.

4.2 Orchestrator across all model pairs

run_all_evaluations_clean_with_generated.py reads a pairs.txt file with embodied,qa_simple,qa_complex columns, loads the NLI model once, runs the analyzer above for every pair, and aggregates everything into CSVs:

python src/evaluation/run_all_evaluations_clean_with_generated.py \
    --use-batched \
    --nli-batch-size 32

Arguments:

Flag	Type / default	Description
`--nli-batch-size`	int, default `32`	Batch size forwarded to the per-pair analyzer's NLI passes.
`--use-batched`	flag	Use the batched NLI pipeline (groups by category for higher throughput).
`--include-generated`	flag	Add the ALFRED-generated trajectories as a 7th category alongside the 6 SafetyALFRED categories.
`--generated-only`	flag	Evaluate only generated trajectories. Mutually exclusive with `--include-generated`.
`--all-turns-accuracy`	flag	Compute embodied accuracy on every turn (known + unknown) per category, not just the safety-critical ones.
`--accuracy-only`	flag	Skip NLI / alignment computation entirely — produce only the accuracy table (much faster, no GPU needed).
`--gemini-only`	flag	Restrict the run to Gemini model files in `pairs.txt`.

The orchestrator writes:

safety_evaluation_results_clean.csv — full per-pair table.
alignment_heatmaps/qa_embodied_alignment_heatmap_{simple,complex}.csv — alignment rates for the QA-vs-embodied heatmap figure.
alignment_breakdowns/qa_embodied_alignment_breakdown_{simple,complex}.csv — full statistics broken out per category.

4.3 False-positive rates on non-hazardous turns

calculate_false_positive_rates.py measures how often each model over-detects hazards: i.e., answers "Yes, there is a hazard" on QA turns whose ground-truth subgoal does not include Remove Hazard. It splits results between ALFRED-generated and SafetyALFRED trajectories, and between vision-only (V) and description-aided (D) prompting.

python src/evaluation/calculate_false_positive_rates.py

Arguments: none — the script reads the QA-result paths from the hard-coded pairs.txt location (see §5).

Outputs to evaluation_results/non_hazardous_turns/:

false_positive_rates.csv — per-(model, metadata) totals, false positives, and rates.
false_positive_summary.csv — wide table with V/D side by side and an averaged column, plus a printed summary table.

4.4 Failure-mode breakdown by safety category

analyze_incorrect_actions_by_category.py reproduces the "Comprehensive Analysis of Incorrect Actions by Category" table from the paper. For each safety category it counts the most common incorrect next actions predicted by the embodied models (e.g. "GoTo" instead of "Remove Hazard" in fall/trip turns).

python src/evaluation/analyze_incorrect_actions_by_category.py

Arguments: none — the script discovers fewshot embodied result files from the hard-coded pairs.txt location (see §5).

Outputs:

A printed summary with the top-10 incorrect actions per category and their share of failures.
A LaTeX tabular for the paper.
incorrect_actions_analysis.json — the underlying counts and 10 example failures per category.

The script targets fewshot (non-QA-conditioned) embodied results; expected per-category dominant failures (Fall/Trip → 74.18% GoTo, Appliance Misuse → 73.03% CloseObject/ToggleObjectOn Microwave, Property Damage → 47.80% ToggleObjectOn Faucet, Fire Hazard → 44.19% PickupObject wrong object, Spoilage → 69.35% PutObject in goal receptacle) are documented in the script header.

4.5 Non-safety action accuracy

analyze_non_safety_actions.py measures accuracy on the non-safety turns — every turn that the per-category check_embodied functions ignore — and splits the result between ALFRED-generated trajectories ("generated") and SafetyALFRED trajectories ("accepted"), with and without metadata.

# default: exclude GoTo navigation actions
python src/evaluation/analyze_non_safety_actions.py

# include goto navigation actions
python src/evaluation/analyze_non_safety_actions.py --include-goto

Arguments:

Flag	Type / default	Description
`--include-goto`	flag	Include `GoTo` navigation actions in the non-safety accuracy. They are excluded by default because the trajectory rendering can produce many `GoTo`s that dominate the metric.

Output is a per-model accuracy table (printed and saved as CSV) showing whether models that handle the safety turns well also keep up on routine task progress.

4.6 Joint success/safety classification

analyze_success_safety_trajectories.py classifies each trajectory into one of four buckets — Successful & Safe, Successful & Unsafe, Unsuccessful & Safe, Unsuccessful & Unsafe — where:

Safe = every safety turn (those checked by the per-category check_embodied functions) is correct.
Successful = every non-safety action is correct (with a separate variant that excludes GoTo actions).

python src/evaluation/analyze_success_safety_trajectories.py

Arguments: none — the script enumerates the four configurations internally and reads its file list from the hard-coded pairs.txt location (see §5).

The script runs four versions in one go: {with GoTo, without GoTo} × {strict non-safety = excludes Remove Hazard subgoal actions, loose non-safety = all actions not checked by check_embodied}, and writes the per-version summary CSVs and LaTeX tables under evaluation_results/non_hazardous_turns/.

5. Notes on paths

The evaluation scripts contain hard-coded paths under /nfs/turbo/coe-chaijy-unreplicated/josuetf/… (the cluster they were originally run on) for pairs.txt, model checkpoints, and output directories. Edit these to match your environment before running. The inference scripts read paths from CLI flags, so they are portable as-is.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
pddl_trajs		pddl_trajs
safety_trajs		safety_trajs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SafetyALFRED

1. Setup

1.1 Cloning the repository

1.2 Conda environment

2. Generating and rendering SafetyALFRED trajectories

2.1 Bundle layout

2.2 Prerequisites on the rendering machine

2.3 Environment variables

2.4 Genearating and Rendering a single trajectory

Arguments — pipeline_pddl_to_video_thor5.py

2.5 Batch generating and rendering

Arguments — test_pipeline_safety_trajs.py

3. Running model inference

3.1 Inference drivers

Shared arguments (all three drivers)

qwen_vl_safety_eval_vllm.py (QA) — extra arguments

qwen_vl_fewshot_icl_eval_vllm_512.py (embodied) — extra arguments

qwen_vl_fewshot_icl_eval_vllm_with_qa.py (QA-conditioned embodied) — extra arguments

3.2 Script directories

3.3 Submitting a SLURM job

3.4 Running the underlying commands directly

3.5 Models referenced in the scripts

4. Running the evaluations

4.1 Per-pair safety/alignment analysis

4.2 Orchestrator across all model pairs

4.3 False-positive rates on non-hazardous turns

4.4 Failure-mode breakdown by safety category

4.5 Non-safety action accuracy

4.6 Joint success/safety classification

5. Notes on paths

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Arguments — `pipeline_pddl_to_video_thor5.py`

Arguments — `test_pipeline_safety_trajs.py`

`qwen_vl_safety_eval_vllm.py` (QA) — extra arguments

`qwen_vl_fewshot_icl_eval_vllm_512.py` (embodied) — extra arguments

`qwen_vl_fewshot_icl_eval_vllm_with_qa.py` (QA-conditioned embodied) — extra arguments

Packages