Zero-shot object navigation (ZSON) allows robots to find target objects in unfamiliar environments using natural language instructions, without relying on pre-built maps or task-specific training. Recent general-purpose models, such as large language models (LLMs) and vision-language models (VLMs), equip agents with semantic reasoning abilities to estimate target object locations in a zero-shot manner. However, these models often greedily select the next goal without maintaining a global understanding of the environment and are fundamentally limited in the spatial reasoning necessary for effective navigation. To overcome these limitations, we propose a novel 3D voxel-based belief map that estimates the target’s prior presence distribution within a voxelized 3D space. This approach enables agents to integrate semantic priors from LLMs and visual embeddings with hierarchical spatial structure, alongside real-time observations, to build a comprehensive 3Dglobal posterior belief of the target’s location. Building on this 3D voxel map, we introduce BeliefMapNav, an efficient navigation system with two key advantages: i) grounding LLM semantic reasoning within the 3D hierarchical semantics voxel space for precise target position estimation, and ii) integrating sequential path planning to enable efficient global navigation decisions. Experiments on HM3D and HSSD benchmarks show that BeliefMapNav achieves state-of-the-art (SOTA) Success Rate (SR) and Success weighted by Path Length (SPL), with a notable 9.7 SPL improvement over the previous best SR method, validating its effectiveness and efficiency. We will release the code of BeliefMapNav.
Create the conda environment:
conda_env_name=bmn
conda create -n $conda_env_name python=3.9 -y
conda activate $conda_env_name
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install git+https://github.com/IDEA-Research/GroundingDINO.git@eeba084341aaa454ce13cb32fa7fd9282fc73a67 salesforce-lavis==1.0.2
pip install -r environment.txtIf you are using habitat and are doing simulation experiments, install this repo into your env with the following:
pip install git+https://github.com/facebookresearch/habitat-sim.git@v0.2.5
git clone https://github.com/facebookresearch/habitat-lab.git
git checkout v0.2.5You need to replace the source code of nv_utils.py in habitat-lab with BeliefmapNav/habitat_lab/env_utils.py for resume the eval.
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines
cd ..Clone the following repo within this one (simply cloning will suffice):
git clone git@github.com:WongKinYiu/yolov7.gitFirst, set the following variables during installation (don't need to put in .bashrc):
MATTERPORT_TOKEN_ID=<FILL IN FROM YOUR ACCOUNT INFO IN MATTERPORT>
MATTERPORT_TOKEN_SECRET=<FILL IN FROM YOUR ACCOUNT INFO IN MATTERPORT>
DATA_DIR=</path/to/vlfm/data>
HM3D_OBJECTNAV=https://dl.fbaipublicfiles.com/habitat/data/datasets/objectnav/hm3d/v1/objectnav_hm3d_v1.zipEnsure that the correct conda environment is activated!!
# Download HM3D 3D scans (scenes_dataset)
python -m habitat_sim.utils.datasets_download \
--username $MATTERPORT_TOKEN_ID --password $MATTERPORT_TOKEN_SECRET \
--uids hm3d_train_v0.2 \
--data-path $DATA_DIR &&
python -m habitat_sim.utils.datasets_download \
--username $MATTERPORT_TOKEN_ID --password $MATTERPORT_TOKEN_SECRET \
--uids hm3d_val_v0.2 \
--data-path $DATA_DIR &&
# Download HM3D ObjectNav dataset episodes
wget $HM3D_OBJECTNAV &&
unzip objectnav_hm3d_v1.zip &&
mkdir -p $DATA_DIR/datasets/objectnav/hm3d &&
mv objectnav_hm3d_v1 $DATA_DIR/datasets/objectnav/hm3d/v1 &&
rm objectnav_hm3d_v1.zipThe weights for MobileSAM, GroundingDINO, and PointNav must be saved to the data/ directory. The weights can be downloaded from the following links:
mobile_sam.pt: https://github.com/ChaoningZhang/MobileSAMgroundingdino_swint_ogc.pth: https://github.com/IDEA-Research/GroundingDINOyolov7-e6e.pt: https://github.com/WongKinYiu/yolov7pointnav_weights.pth: included inside the data subdirectory
We use finetuned version of semantic segmentation model RedNet. Therefore, you need to download the segmentation model in data path.
If you want to run the evaluation, you only need to run the following command. The number and IDs of the GPUs, as well as the number of episodes to evaluate on each GPU, can be modified within the file:
.scripts/launch_multi_eval.sh(You may need to run chmod +x on this file first.)
This repo is heavily based on VLFM, and referenced the code in Stairway to Success in for the robot's stair-climbing/ascending and descending capabilities. We thank the authors for their great work.
If you use VLFM in your research, please use the following BibTeX entry.
@article{zhou2025beliefmapnav,
title={BeliefMapNav: 3D Voxel-Based Belief Map for Zero-Shot Object Navigation},
author={Zhou, Zibo and Hu, Yue and Zhang, Lingkai and Li, Zonglin and Chen, Siheng},
journal={arXiv preprint arXiv:2506.06487},
year={2025}
}
