NaVILA is a research codebase that extends VILA/LLAVA-style multimodal models for vision-language-action navigation. The repository contains two main workflows you will see in the code: model training / pretraining (in llava/) and environment-based evaluation (in evaluation/, built on VLN-CE + Habitat). This file gives targeted, actionable hints for an AI coding agent to be immediately productive.
README.md— project overview, datasets, and high-level commands. Start here to understand goals and dataset expectations.pyproject.toml— package namevila, pinned dependency versions (Torch 2.3, Transformers 4.37.2, specific webdataset, etc.). Use this to infer exact runtime versions.llava/— core multimodal model and training code. Key files:llava/entry.py— top-level programmatic model loader (callsllava.model.builder.load_pretrained_model). Use this to learn how checkpoints are laid out (many checkpoints have amodel/subfolder).llava/cli/run.py— SLURM wrapper used by team runs; expects env varsVILA_SLURM_ACCOUNTandVILA_SLURM_PARTITIONand shows the job-run conventions (output inruns/<mode>/<job-name>).llava/train/— training entrypoints and Deepspeed/transformers compatibility patches (seetransformers_replace/anddeepspeed_replace/).
evaluation/— VLN-CE based evaluation and habitat integration. Look atevaluation/scripts/andevaluation/habitat_extensions/for Habitat-specific adapters.
- Environment: the canonical setup script is
./environment_setup.sh navilawhich (a) creates a conda env, (b) installs the package in editable mode, installs extras.[train]and.[eval], installs FlashAttention2 wheel, and patches Transformers/Deepspeed by copying files fromllava/train/transformers_replaceandllava/train/deepspeed_replaceinto site-packages. Prefer following this script; it is the single source of truth for env setup. - Training: look for training scripts in
llava/train/train.py(andscripts/train/sft_8frames.shfor example launch scripts). The repo provides a Slurmllava/cli/run.pywrapper for cluster runs. - Evaluation: evaluation depends on VLN-CE and Habitat-Sim v0.1.7. The README lists necessary manual steps (build Habitat-Sim from source, run
evaluation/scripts/habitat_sim_autofix.pyto patch NumPy compatibility). Evaluation entry:evaluation/scripts/eval/r2r.sh(see README for how to pass checkpoint path and GPU list). Visual outputs go toeval_out/<CKPT_NAME>/....
- Checkpoint layout: loaders (e.g.,
llava/entry.py) expect amodelsubfolder inside checkpoints. When writing code that saves/loads models, follow the same layout. - Monkey-patching approach: instead of forking library code, the project copies compatibility patches into the site-packages at install time (
transformers_replace/,deepspeed_replace/). For local experiments prefer replicating the same patching approach (copy files) rather than changing upstream libs directly. - SLURM & run naming: run names often contain
%tplaceholder (replaced by timestamp inllava/cli/run.py). Output directories useruns/<mode>/<job-name>and expectRUN_NAMEandOUTPUT_DIRenv vars for downstream scripts. - Pinned versions:
pyproject.tomlpins exact versions (Torch, Transformers, webdataset older/modified versions). Match these versions for reproducing reported results.
- HuggingFace model/weights: README references checkpoints hosted on HF (e.g.,
a8cheng/navila-siglip-llama3-8b-v1.5-pretrain). Code often uses HF-style repo layouts — when testing loaders, look formodel/under the repo path. - FlashAttention2: installed via a specific wheel in
environment_setup.sh; missing this wheel will break training performance code paths. - VLN-CE / Habitat: evaluation integrates VLN-CE; building Habitat from source is required and the repo provides
evaluation/scripts/habitat_sim_autofix.pyto patch compatibility.
- Install & setup (canonical): run
./environment_setup.sh navilathenconda activate navila(this installs edits, wheel, and copies patches). - Run a SLURM-backed training job: the maintainers use
llava/cli/run.pyto composesruncommands. It requiresVILA_SLURM_ACCOUNTandVILA_SLURM_PARTITIONenv vars and places logs inruns/<mode>/<job-name>/slurm/. - Load a model programmatically: use
llava.entry.load(model_path)— it auto-detects themodel/subfolder and callsllava.model.builder.load_pretrained_model.
- Do not alter the pinned versions in
pyproject.tomlwhen reproducing experiments; changes require retesting both training and evaluation pipelines. - Avoid in-place edits of system site-packages in CI; the repo’s current pattern is to copy small patches into site-packages at install time—mirror that workflow locally instead of editing packages directly.
llava/train/— for trainer behavior and parameter parsing.llava/model/— for model architecture construction and where new modalities would be wired in.evaluation/habitat_extensions/— for Habitat-specific observation/action adapters used by VLN-CE.scripts/train/sft_8frames.sh— concrete example of a training launch.
If any of these points are unclear or you'd like me to expand any section (for example, add exact command snippets for evaluation or show how a checkpoint is structured in the filesystem), tell me which part to expand and I'll iterate.