This is the official implementation of Scene Grounding In The Wild.
Scene Grounding In the Wild
Tamir Cohen1, Leo Segre 1, Shay Shomer Chai1, Shai Avidan 1, Hadar Averbuch-Elor 2
1Tel Aviv University, 2Cornell University
Abstract
Reconstructing accurate 3D models of large-scale real-world scenes from unstructured, in-the-wild imagery remains a core challenge in computer vision, especially when the input views have little or no overlap. In such cases, existing reconstruction pipelines often produce multiple disconnected partial reconstructions or erroneously merge non-overlapping regions into overlapping geometry. In this work, we propose a framework that grounds each partial reconstruction to a complete reference model of the scene, enabling globally consistent alignment even in the absence of visual overlap. We obtain reference models from dense, geospatially accurate pseudo-synthetic renderings derived from Google Earth Studio. These renderings provide full scene coverage but differ substantially in appearance from real-world photographs. Our key insight is that, despite this significant domain gap, both domains share the same underlying scene semantics. We represent the reference model using 3D Gaussian Splatting, augmenting each Gaussian with semantic features, and formulate alignment as an inverse feature-based optimization scheme that estimates a global 6DoF pose and scale while keeping the reference model fixed. Furthermore, we introduce the WikiEarth dataset, which registers existing partial 3D reconstructions with pseudo-synthetic reference models. We demonstrate that our approach consistently improves global alignment when initialized with various classical and learning-based pipelines, while mitigating failure modes of state-of-the-art end-to-end models.
This implementation is built on top of the Nerfstudio framework
Create a conda environment and install the required dependencies:
# Create environment
conda create -n scene_grounding python=3.10
conda activate scene_grounding
./scripts/setup_env.shInstall colmap 3.10 (https://colmap.github.io/install.html)
Run ./scripts/prepare_dataset.sh to download the WikiEarth and WikiScenes datasets.
To create the 3D Gaussian Splatting base model, prepare the meta image, run the colmap baseline initializaiton and run registration:
./scripts/pipelines/full_pipeline.sh 39 4 dinoParameters:
cathedral_number: This first argument Cathedral ID from WikiScenes datasetcolmap_number: The second argument Meta image number from WikiScenes 3D reconstructionfeature_type: Feature extraction method (usedinoto reproduce paper results)
The 7DOF transform will be written to ./transforms
To run only the registration component:
./scripts/pipelines/features_pipeline.sh <cathedral_number> <feature_type> <colmap_number>This work is built upon the Nerfstudio framework. We thank the Nerfstudio team for their excellent codebase and tools that made this research possible.
If you find our work useful in your research, please consider citing:
@misc{cohen2026scenegroundingwild,
title={Scene Grounding In the Wild},
author={Tamir Cohen and Leo Segre and Shay Shomer-Chai and Shai Avidan and Hadar Averbuch-Elor},
year={2026},
eprint={2603.26584},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.26584},
}