Paper | Project Page | Video
GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation
Hang Yin, Haoyu Wei, Xiuwei Xu$\dagger$ , Wenxuan Guo, Jie Zhou, Jiwen Lu$\ddagger$
* Equal contribution
We propose a unified 3D graph representation for zero-shot vision-and-language navigation. By modeling instruction graph as constraints, we can solve the optimal navigation path accordingly. Wrong exploration can also be handled by graph-based backtracking.
- [2026/06/08]: Release code.
- [2025/09/16]: Arxiv and project page available.
- [2025/08/01]: GC-VLN is accepted to CoRL 2025!
git clone --recursive https://github.com/bagh2178/GC-VLN.git
cd GC-VLNconda create -n GC-VLN python=3.9
conda activate GC-VLN
conda install habitat-sim==0.2.4 -c conda-forge -c aihabitat
pip install -e third_party/habitat-lab
pip install -r requirements.txt
python scripts/fix_torch_tensorboard.py
conda install faiss-gpu=1.8.0 -c pytorch -y
pip install --no-build-isolation -e third_party/GLIP
mkdir -p third_party/GLIP/MODEL
wget -O third_party/GLIP/MODEL/glip_large_model.pth https://huggingface.co/GLIPModel/GLIP/resolve/main/glip_large_model.pth?download=true
pip install -e third_party/ModelServer
pip install git+https://github.com/facebookresearch/pytorch3d.git --no-build-isolationconda create -n GC-VLN-Server python=3.10
conda activate GC-VLN-Server
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
pip install opencv-python==4.11.0.86 supervision==0.25.1 transformers==4.51.3 addict==2.4.0 yapf==0.43.0 pycocotools==2.0.8 timm==1.0.15 numpy==2.2.4 # supervision addict yapf pycocotools
pip install -e third_party/ModelServer
pip install -e third_party/Grounded-SAM-2
conda install intel-openmp=2021.4.0 -c defaults -y
pip install --no-build-isolation -e third_party/Grounded-SAM-2/grounding_dino
bash third_party/Grounded-SAM-2/checkpoints/download_ckpts.sh
bash third_party/Grounded-SAM-2/gdino_checkpoints/download_ckpts.shWe use R2R-CE, RxR-CE datasets, and Matterport3D (MP3D) scene data. The dataset structure should be organized as follows:
GC-VLN/
└── data/
├── datasets/
│ ├── R2R_VLNCE_v1-2_preprocessed/
│ │ └── val_unseen/
│ │ └── val_unseen.json.gz
│ └── RxR_VLNCE_v0/
│ └── val_unseen/
│ └── val_unseen_guide.json.gz
└── scene_datasets/
└── mp3d/
├── 1LXtFkjw3qL/
│ ├── 1LXtFkjw3qL.glb
│ ├── 1LXtFkjw3qL.house
│ ├── 1LXtFkjw3qL.navmesh
│ └── 1LXtFkjw3qL_semantic.ply
├── 1pXnuDYAj8r/
├── ...
└── zsNo4HB9uLZ/
Activate GC-VLN-Server environment and start the GSAM2 server:
conda activate GC-VLN-Server
python third_party/ModelServer/scripts/quickstart_server/GSAM2.py --port 7000Wait for the server to fully start before proceeding to the next step.
In a new terminal, activate GC-VLN environment and run evaluation:
conda activate GC-VLN
# For R2R dataset
bash run_eval.sh r2r
# For RxR dataset
bash run_eval.sh rxrGC-VLN/
├── src/
│ ├── solver/ # Navigation planning and constraint solving
│ ├── scenegraph/ # Scene graph construction and mapping
│ ├── agent/ # Agent and environment wrappers
│ └── habitat_extensions/ # Custom Habitat components
├── third_party/
│ ├── ModelServer/ # GSAM2 model server for segmentation
│ ├── Grounded-SAM-2/ # Grounded-SAM-2 implementation
│ ├── habitat-lab/ # Habitat simulation platform
│ └── GLIP/ # Grounded Language-Image Pretraining model
├── config/ # Configuration files
├── data/ # Data directory (datasets, scene_datasets)
├── outputs/ # Output directory for logs and results
└── main.py # Evaluation entry point
Check out our scene graph-based zero-shot navigation series:
@article{yin2025gcvln,
title={GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation},
author={Hang Yin and Haoyu Wei and Xiuwei Xu and Wenxuan Guo and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2509.10454},
year={2025}
}
