Planning with the Views

Kangrui Wang¹, Linjie Li², Zhengyuan Yang³, Shiqi Chen⁴, Zihan Wang¹, Li Fei-Fei⁵, Jiajun Wu⁵, Leonidas Guibas⁵, Lijuan Wang³, Manling Li¹

¹Northwestern University ²University of Washington ³Microsoft ⁴University of Oxford ⁵Stanford University

📢 Updates

[2026-05-20] We release the ViewSuite codebase, benchmark, and the iterative self-exploration training framework, along with the dataset and trained checkpoints on HuggingFace.
[Coming soon] Paper on arXiv.

🌟 Overview

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, and decompose it into two coupled abilities: (1) understanding how a single action transforms the view, and (2) composing many such transformations across multi-turn plans to identify a target view.

ViewSuite is a 3D point-cloud environment and benchmark suite for view planning, built on ~300 real ScanNet indoor scenes (~55K view pairs, ~165K task instances). It probes view planning through three diagnostic tasks:

Path-to-View (P2V) — predict the resulting view from an action sequence (tests understanding).
View-to-Path (V2P) — infer the action sequence between two views (tests understanding).
Interactive View Planning (IVP) — plan view changes over multiple turns and submit a 6-DoF estimate of the target (tests multi-turn leveraging).

Across 13 frontier VLMs, a critical planning gap emerges: models possess basic view-action knowledge (~50–70% on short-horizon P2V/V2P) but fail to compose it across multi-turn plans (below 21% on IVP). To close this gap, we propose an iterative training framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of outcome, collectively form a view graph; distilling it into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% → 47.8% on Interactive View Planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).

For more details, see our paper and project homepage.

📦 Repository Structure

ViewSuite/
├── view_suite/      # Core ViewSuite environment & Python package
├── GraphRL/         # Iterative RL–SFT training framework
├── examples/        # Evaluation configs (API models, sglang, baselines)
├── scripts/         # Install, render-service, and data-download scripts
├── visualizer/      # Trajectory & view visualization tools
└── setup.py

All commands below assume you are at the ViewSuite repo root (cd /path/to/ViewSuite). Where applicable, ${VIEWSUITE_ROOT} is the absolute path to that root and is auto-exported by the install scripts.

⚙️ 1. Installation

Full install (training + evaluation + service)

Used on the machine that runs RL/SFT training and the eval harness.

# Clone the repository
git clone https://github.com/mll-lab-nu/ViewSuite.git
cd ViewSuite

# Create env (Python 3.12)
conda create -n viewsuite python=3.12 -y
conda activate viewsuite

# Install ViewSuite + GraphRL + VAGEN + verl + LLaMA-Factory + sglang
bash scripts/install.sh

Service-only install (render service host)

For the machine that hosts the ScanNet HTTP render service. Lighter — no RL stack.

conda create -n viewsuite python=3.12 -y
conda activate viewsuite

bash scripts/install_service.sh

📥 2. Download Data

Service side — ScanNet scans + meshes (large)

Lives on the render-service machine. Downloads from the public dataset repo MLL-Lab/viewsuite.

bash scripts/download_scannet.sh
# downloads scannet.tar.gz into data/

Local — ViewSuite tasks (jsonl + small assets)

Lives on the training/eval machine (the one talking to the render service).

bash scripts/download_viewsuite_all.sh
# downloads viewsuite_15k.tar.gz + mindcube.tar.gz into data/

After both, you should have:

data/
├── scannet/scans/...
└── viewsuite_15k/
    ├── interactive_view_planning_test.jsonl   # Interactive View Planning (IVP)
    ├── path_to_view_test.jsonl                # Path-to-View (P2V)
    ├── view_to_path_test.jsonl                # View-to-Path (V2P)
    └── ...

Trained checkpoints (optional)

Download the released Qwen2.5-VL-7B checkpoints — used as starting points or eval targets.

bash scripts/download_model.sh
# downloads into model/qwen25-ivp/{viewsuite-all-qwen25vl7b,viewsuite-ivp-qwen25vl7b}/

🖥️ 3. Start the ScanNet Render Service

The service exposes an HTTP render endpoint that gym environments call to render camera views from ScanNet scenes. Run it on a GPU box, after the ScanNet data is downloaded (Step 2).

Who needs this? Only Interactive View Planning (IVP) — and Gaussian-Splat–rendered evaluation — renders views on the fly and needs this service. Path-to-View (P2V) and View-to-Path (V2P) do not: they are single-turn tasks that read pre-rendered images straight from the jsonl, so they run and evaluate without ever starting the service.

Mesh backend (open3d, recommended, full splits available) — renders directly from the ScanNet meshes (Step 2); no extra download needed.

# Required: tells the service where to find data/scannet/...
export VIEWSUITE_ROOT="$(pwd)"

#   args: MAX_WORKERS=32 GPU_IDS=0 OMP_CAP=1 PORT=8767 T=10800 BACKEND=open3d
bash scripts/scannet_http_service_loop.sh 32 0 1 8767 10800 open3d

3D-Gaussian-Splatting backend (gsplat, only test split available) — renders from pretrained per-scene 3DGS reconstructions of the ScanNet scenes (GaussianWorld/scannet_mcmc_1.5M_3dgs, from the SceneSplat-7K project). Download those first into data/scannet_3dgs_mcmc/:

export VIEWSUITE_ROOT="$(pwd)"
export HF_TOKEN=hf_xxx               # huggingface_hub token
bash scripts/download_scannet_3dgs.sh

Then start the service with the gsplat backend (same args; BACKEND defaults to gsplat):

export VIEWSUITE_ROOT="$(pwd)"
bash scripts/scannet_http_service_loop_gs.sh 32 0 1 8767

The supervisor restarts the worker every T seconds (default 3h). Logs land under ./scannet_http_service_<TS>/.

To run it in the background and persist its URL:

export VIEWSUITE_ROOT="$(pwd)"
nohup bash scripts/scannet_http_service_loop.sh 32 0 1 8767 \
  > scannet_http_service_loop.log 2>&1 &
echo "$!" > scannet_http_service_loop.pid
echo "http://0.0.0.0:8767" > client_url.txt   # consumed by env configs

Choosing MAX_WORKERS (the first arg). Each worker keeps a ScanNet scene resident in GPU memory, so the worker count is bounded by both GPU VRAM and CPU core count (see scripts/scannet_http_service_loop.sh). 32 is a safe default for a 24–48 GB GPU on a ~32-core host. On a large card with many cores — e.g. an RTX 6000 Pro (Blackwell) on a 64-core box — try 64. If you hit GPU OOM or CPU thrashing, lower it.

🎮 4. Try the Environments

Each task is a self-contained gym environment you can play interactively from the keyboard — a quick way to get a feel for the tasks and to confirm your data, install, and (for IVP) render service are wired up. Run from the repo root, or with VIEWSUITE_ROOT exported. Every observed/rendered image is saved to a folder so you can look at it.

P2V and V2P — no render service needed. Single-turn multiple-choice tasks that read pre-rendered images from the jsonl, so they work as soon as the data (Step 2) is in place. The demo prints a question, saves its images, and you answer with A / B / C / D:

export VIEWSUITE_ROOT="$(pwd)"
python view_suite/envs/scannet_proxy_task/path_to_view.py   # images -> tests/p2v_play/
python view_suite/envs/scannet_proxy_task/view_to_path.py   # images -> tests/v2p_play/

IVP — requires a running render service (Step 3). You fly the camera around a ScanNet scene with the keyboard; each newly rendered view is saved to tests/ivp_play/. The demo runs in action-only + no-submit mode — you start on the initial view and just navigate; the episode auto-succeeds once you reach the target.

w/s  move forward / backward    q/e  turn left / right
a/d  move left / right          r/f  look up / down
y/h  move up / down             t/g  rotate ccw / cw

export VIEWSUITE_ROOT="$(pwd)"
# Service URL: --client_url, else client_url.txt, else http://0.0.0.0:8767
python view_suite/envs/scannet_proxy_task/interactive_view_planning.py
python view_suite/envs/scannet_proxy_task/interactive_view_planning.py \
  --client_url=http://0.0.0.0:8767

Type movement keys in any combination (e.g. wwd), or quit to exit. If the IVP demo cannot connect, the render service is unreachable — recheck Step 3 and client_url.txt.

📊 5. Evaluation

Both eval suites read data/viewsuite_15k/*.jsonl. IVP evaluation additionally needs the render service (Step 3) and client_url.txt; P2V/V2P do not.

5a. Closed-source / API models — `examples/evaluation/eval_scannet_proxy_task`

Configs already exist for each model (claude_opus_4_6.yaml, gpt_5_4.yaml, gemini_3_pro.yaml, ...).

export VIEWSUITE_ROOT="$(pwd)"
export fileroot="$(pwd)"

# Run all models (set the API keys for the models you intend to run):
export OPENROUTER_API_KEY=...        # Claude / GPT-5 family via OpenRouter
bash examples/evaluation/eval_scannet_proxy_task/eval_all.sh

# Or run a single model:
export GEMINI_API_KEY=...
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...

python -m vagen.evaluate.run_eval \
  --config examples/evaluation/eval_scannet_proxy_task/claude_opus_4_6.yaml \
  fileroot="$(pwd)"

Rollouts are dumped to ${fileroot}/rollouts/<model_name>/tag_<task>/....

5b. Open-source / custom models via sglang — `examples/evaluation/eval_sglang`

export VIEWSUITE_ROOT="$(pwd)"

# Minimal: any sglang-supported VLM, default 3-task YAML
MODEL_PATH=Qwen/Qwen3-VL-8B-Instruct \
  bash examples/evaluation/eval_sglang/eval_model.sh

# RTX 6000 / Blackwell + local checkpoint + IVP-only YAML
MODEL_PATH=/path/to/local/checkpoint \
MODEL_NAME=my_ckpt \
CONFIG=examples/evaluation/eval_sglang/interactive_view_planning_only.yaml \
DUMP_DIR="$(pwd)/rollouts/my_ckpt" \
CUDA_VISIBLE_DEVICES=1 \
SGLANG_EXTRA_ARGS="--attention-backend=flashinfer --mm-attention-backend=triton_attn" \
  bash examples/evaluation/eval_sglang/eval_model.sh

More examples are in examples/evaluation/eval_sglang/README.md.

🏋️ 6. Iterative RL–SFT Training

Trains the Qwen-VL agent on Interactive View Planning, alternating self-exploration (RL) with view graph distillation (SFT).

export VIEWSUITE_ROOT="$(pwd)"
export WANDB_API_KEY=your_wandb_key            # or `export WANDB_MODE=offline`
export HF_TOKEN=hf_xxx                         # for checkpoint upload, optional
cd GraphRL
# The render service must be reachable; the default expects http://0.0.0.0:8767
# (see client_url.txt produced in Step 3).

# Default: 8 GPUs per node for both RL and SFT.
bash examples/viewsuite/viewsuite_interactive_view_planning/run.sh

# Override GPU count or any pipeline knob:
N_GPUS_PER_NODE=8 SFT_N_GPUS=8 \
  bash examples/viewsuite/viewsuite_interactive_view_planning/run.sh \
  iterations=5

Outputs land under exps/viewsuite/viewsuite_interactive_view_planning/.

Scaling the render service for training

IVP rollouts hit the render service constantly, and switching scenes is expensive — each switch reloads a ScanNet point cloud into GPU memory. To keep the trainer fed, run multiple render services in parallel and list all of them in client_url.txt, one URL per line:

http://10.0.0.1:8767
http://10.0.0.2:8767
http://10.0.0.3:8767
http://10.0.0.4:8767

interactive_view_planning.py talks to the service over HTTP and distributes its environments across every URL listed in client_url.txt. More services means more scenes stay resident at once, so workers reload point clouds far less often. For our training runs we ran one service per machine on 4× RTX 4090 boxes with 32 workers each, which comfortably serves ~128 parallel environments. Scale the number of services and MAX_WORKERS to your hardware (see the MAX_WORKERS note in Step 3).

📝 Citation

If you find ViewSuite useful in your research, please consider citing our paper:

@article{wang2026viewsuite,
  title   = {Planning with the Views},
  author  = {Wang, Kangrui and Li, Linjie and Yang, Zhengyuan and Chen, Shiqi and
             Wang, Zihan and Fei-Fei, Li and Wu, Jiajun and Guibas, Leonidas and
             Wang, Lijuan and Li, Manling},
  year    = {2026}
}

🙏 Acknowledgements

ViewSuite is built on ScanNet for real 3D indoor scenes, and our training and evaluation framework draws on VAGEN, verl, LLaMA-Factory, and sglang. The higher-fidelity Gaussian-Splatting renders use pretrained per-scene ScanNet 3DGS reconstructions from SceneSplat-7K (SceneSplat, ICCV 2025). We thank the authors of these projects for open-sourcing their work.

📄 License

This project is released under the MIT License. Note that ScanNet data and any third-party models are subject to their own licenses and terms of use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Planning with the Views

📢 Updates

🌟 Overview

📦 Repository Structure

⚙️ 1. Installation

Full install (training + evaluation + service)

Service-only install (render service host)

📥 2. Download Data

Service side — ScanNet scans + meshes (large)

Local — ViewSuite tasks (jsonl + small assets)

Trained checkpoints (optional)

🖥️ 3. Start the ScanNet Render Service

🎮 4. Try the Environments

📊 5. Evaluation

5a. Closed-source / API models — `examples/evaluation/eval_scannet_proxy_task`

5b. Open-source / custom models via sglang — `examples/evaluation/eval_sglang`

🏋️ 6. Iterative RL–SFT Training

Scaling the render service for training

📝 Citation

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
GraphRL		GraphRL
assets		assets
examples/evaluation		examples/evaluation
scripts		scripts
view_suite		view_suite
visualizer		visualizer
.gitignore		.gitignore
README.md		README.md
setup.py		setup.py
viewsuite_paper.pdf		viewsuite_paper.pdf

Folders and files

Latest commit

History

Repository files navigation

Planning with the Views

📢 Updates

🌟 Overview

📦 Repository Structure

⚙️ 1. Installation

Full install (training + evaluation + service)

Service-only install (render service host)

📥 2. Download Data

Service side — ScanNet scans + meshes (large)

Local — ViewSuite tasks (jsonl + small assets)

Trained checkpoints (optional)

🖥️ 3. Start the ScanNet Render Service

🎮 4. Try the Environments

📊 5. Evaluation

5a. Closed-source / API models — examples/evaluation/eval_scannet_proxy_task

5b. Open-source / custom models via sglang — examples/evaluation/eval_sglang

🏋️ 6. Iterative RL–SFT Training

Scaling the render service for training

📝 Citation

🙏 Acknowledgements

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

5a. Closed-source / API models — `examples/evaluation/eval_scannet_proxy_task`

5b. Open-source / custom models via sglang — `examples/evaluation/eval_sglang`

Packages