Kangrui Wang1, Linjie Li2, Zhengyuan Yang3, Shiqi Chen4, Zihan Wang1, Li Fei-Fei5, Jiajun Wu5, Leonidas Guibas5, Lijuan Wang3, Manling Li1
1Northwestern University 2University of Washington 3Microsoft 4University of Oxford 5Stanford University
- [2026-05-20] We release the ViewSuite codebase, benchmark, and the iterative self-exploration training framework, along with the dataset and trained checkpoints on HuggingFace.
- [Coming soon] Paper on arXiv.
Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, and decompose it into two coupled abilities: (1) understanding how a single action transforms the view, and (2) composing many such transformations across multi-turn plans to identify a target view.
ViewSuite is a 3D point-cloud environment and benchmark suite for view planning, built on ~300 real ScanNet indoor scenes (~55K view pairs, ~165K task instances). It probes view planning through three diagnostic tasks:
- Path-to-View (P2V) — predict the resulting view from an action sequence (tests understanding).
- View-to-Path (V2P) — infer the action sequence between two views (tests understanding).
- Interactive View Planning (IVP) — plan view changes over multiple turns and submit a 6-DoF estimate of the target (tests multi-turn leveraging).
Across 13 frontier VLMs, a critical planning gap emerges: models possess basic view-action knowledge (~50–70% on short-horizon P2V/V2P) but fail to compose it across multi-turn plans (below 21% on IVP). To close this gap, we propose an iterative training framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of outcome, collectively form a view graph; distilling it into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% → 47.8% on Interactive View Planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).
For more details, see our paper and project homepage.
ViewSuite/
├── view_suite/ # Core ViewSuite environment & Python package
├── GraphRL/ # Iterative RL–SFT training framework
├── examples/ # Evaluation configs (API models, sglang, baselines)
├── scripts/ # Install, render-service, and data-download scripts
├── visualizer/ # Trajectory & view visualization tools
└── setup.py
All commands below assume you are at the ViewSuite repo root (
cd /path/to/ViewSuite). Where applicable,${VIEWSUITE_ROOT}is the absolute path to that root and is auto-exported by the install scripts.
Used on the machine that runs RL/SFT training and the eval harness.
# Clone the repository
git clone https://github.com/mll-lab-nu/ViewSuite.git
cd ViewSuite
# Create env (Python 3.12)
conda create -n viewsuite python=3.12 -y
conda activate viewsuite
# Install ViewSuite + GraphRL + VAGEN + verl + LLaMA-Factory + sglang
bash scripts/install.shFor the machine that hosts the ScanNet HTTP render service. Lighter — no RL stack.
conda create -n viewsuite python=3.12 -y
conda activate viewsuite
bash scripts/install_service.shLives on the render-service machine. Downloads from the public dataset repo MLL-Lab/viewsuite.
bash scripts/download_scannet.sh
# downloads scannet.tar.gz into data/Lives on the training/eval machine (the one talking to the render service).
bash scripts/download_viewsuite_all.sh
# downloads viewsuite_15k.tar.gz + mindcube.tar.gz into data/After both, you should have:
data/
├── scannet/scans/...
└── viewsuite_15k/
├── interactive_view_planning_test.jsonl # Interactive View Planning (IVP)
├── path_to_view_test.jsonl # Path-to-View (P2V)
├── view_to_path_test.jsonl # View-to-Path (V2P)
└── ...
Download the released Qwen2.5-VL-7B checkpoints — used as starting points or eval targets.
bash scripts/download_model.sh
# downloads into model/qwen25-ivp/{viewsuite-all-qwen25vl7b,viewsuite-ivp-qwen25vl7b}/The service exposes an HTTP render endpoint that gym environments call to render camera views from ScanNet scenes. Run it on a GPU box, after the ScanNet data is downloaded (Step 2).
Who needs this? Only Interactive View Planning (IVP) — and Gaussian-Splat–rendered evaluation — renders views on the fly and needs this service. Path-to-View (P2V) and View-to-Path (V2P) do not: they are single-turn tasks that read pre-rendered images straight from the jsonl, so they run and evaluate without ever starting the service.
Mesh backend (open3d, recommended, full splits available) — renders directly from the ScanNet meshes (Step 2); no extra download needed.
# Required: tells the service where to find data/scannet/...
export VIEWSUITE_ROOT="$(pwd)"
# args: MAX_WORKERS=32 GPU_IDS=0 OMP_CAP=1 PORT=8767 T=10800 BACKEND=open3d
bash scripts/scannet_http_service_loop.sh 32 0 1 8767 10800 open3d3D-Gaussian-Splatting backend (gsplat, only test split available) — renders from pretrained per-scene 3DGS reconstructions of the ScanNet scenes (GaussianWorld/scannet_mcmc_1.5M_3dgs, from the SceneSplat-7K project). Download those first into data/scannet_3dgs_mcmc/:
export VIEWSUITE_ROOT="$(pwd)"
export HF_TOKEN=hf_xxx # huggingface_hub token
bash scripts/download_scannet_3dgs.shThen start the service with the gsplat backend (same args; BACKEND defaults to gsplat):
export VIEWSUITE_ROOT="$(pwd)"
bash scripts/scannet_http_service_loop_gs.sh 32 0 1 8767The supervisor restarts the worker every T seconds (default 3h). Logs land under ./scannet_http_service_<TS>/.
To run it in the background and persist its URL:
export VIEWSUITE_ROOT="$(pwd)"
nohup bash scripts/scannet_http_service_loop.sh 32 0 1 8767 \
> scannet_http_service_loop.log 2>&1 &
echo "$!" > scannet_http_service_loop.pid
echo "http://0.0.0.0:8767" > client_url.txt # consumed by env configsChoosing MAX_WORKERS (the first arg). Each worker keeps a ScanNet scene resident in GPU memory, so the worker count is bounded by both GPU VRAM and CPU core count (see scripts/scannet_http_service_loop.sh). 32 is a safe default for a 24–48 GB GPU on a ~32-core host. On a large card with many cores — e.g. an RTX 6000 Pro (Blackwell) on a 64-core box — try 64. If you hit GPU OOM or CPU thrashing, lower it.
Each task is a self-contained gym environment you can play interactively from the keyboard — a quick way to get a feel for the tasks and to confirm your data, install, and (for IVP) render service are wired up. Run from the repo root, or with VIEWSUITE_ROOT exported. Every observed/rendered image is saved to a folder so you can look at it.
P2V and V2P — no render service needed. Single-turn multiple-choice tasks that read pre-rendered images from the jsonl, so they work as soon as the data (Step 2) is in place. The demo prints a question, saves its images, and you answer with A / B / C / D:
export VIEWSUITE_ROOT="$(pwd)"
python view_suite/envs/scannet_proxy_task/path_to_view.py # images -> tests/p2v_play/
python view_suite/envs/scannet_proxy_task/view_to_path.py # images -> tests/v2p_play/IVP — requires a running render service (Step 3). You fly the camera around a ScanNet scene with the keyboard; each newly rendered view is saved to tests/ivp_play/. The demo runs in action-only + no-submit mode — you start on the initial view and just navigate; the episode auto-succeeds once you reach the target.
w/s move forward / backward q/e turn left / right
a/d move left / right r/f look up / down
y/h move up / down t/g rotate ccw / cw
export VIEWSUITE_ROOT="$(pwd)"
# Service URL: --client_url, else client_url.txt, else http://0.0.0.0:8767
python view_suite/envs/scannet_proxy_task/interactive_view_planning.py
python view_suite/envs/scannet_proxy_task/interactive_view_planning.py \
--client_url=http://0.0.0.0:8767Type movement keys in any combination (e.g. wwd), or quit to exit. If the IVP demo cannot connect, the render service is unreachable — recheck Step 3 and client_url.txt.
Both eval suites read data/viewsuite_15k/*.jsonl. IVP evaluation additionally needs the render service (Step 3) and client_url.txt; P2V/V2P do not.
Configs already exist for each model (claude_opus_4_6.yaml, gpt_5_4.yaml, gemini_3_pro.yaml, ...).
export VIEWSUITE_ROOT="$(pwd)"
export fileroot="$(pwd)"
# Run all models (set the API keys for the models you intend to run):
export OPENROUTER_API_KEY=... # Claude / GPT-5 family via OpenRouter
bash examples/evaluation/eval_scannet_proxy_task/eval_all.sh
# Or run a single model:
export GEMINI_API_KEY=...
export ANTHROPIC_API_KEY=...
export OPENAI_API_KEY=...
python -m vagen.evaluate.run_eval \
--config examples/evaluation/eval_scannet_proxy_task/claude_opus_4_6.yaml \
fileroot="$(pwd)"Rollouts are dumped to ${fileroot}/rollouts/<model_name>/tag_<task>/....
export VIEWSUITE_ROOT="$(pwd)"
# Minimal: any sglang-supported VLM, default 3-task YAML
MODEL_PATH=Qwen/Qwen3-VL-8B-Instruct \
bash examples/evaluation/eval_sglang/eval_model.sh
# RTX 6000 / Blackwell + local checkpoint + IVP-only YAML
MODEL_PATH=/path/to/local/checkpoint \
MODEL_NAME=my_ckpt \
CONFIG=examples/evaluation/eval_sglang/interactive_view_planning_only.yaml \
DUMP_DIR="$(pwd)/rollouts/my_ckpt" \
CUDA_VISIBLE_DEVICES=1 \
SGLANG_EXTRA_ARGS="--attention-backend=flashinfer --mm-attention-backend=triton_attn" \
bash examples/evaluation/eval_sglang/eval_model.shMore examples are in examples/evaluation/eval_sglang/README.md.
Trains the Qwen-VL agent on Interactive View Planning, alternating self-exploration (RL) with view graph distillation (SFT).
export VIEWSUITE_ROOT="$(pwd)"
export WANDB_API_KEY=your_wandb_key # or `export WANDB_MODE=offline`
export HF_TOKEN=hf_xxx # for checkpoint upload, optional
cd GraphRL
# The render service must be reachable; the default expects http://0.0.0.0:8767
# (see client_url.txt produced in Step 3).
# Default: 8 GPUs per node for both RL and SFT.
bash examples/viewsuite/viewsuite_interactive_view_planning/run.sh
# Override GPU count or any pipeline knob:
N_GPUS_PER_NODE=8 SFT_N_GPUS=8 \
bash examples/viewsuite/viewsuite_interactive_view_planning/run.sh \
iterations=5Outputs land under exps/viewsuite/viewsuite_interactive_view_planning/.
IVP rollouts hit the render service constantly, and switching scenes is expensive — each switch reloads a ScanNet point cloud into GPU memory. To keep the trainer fed, run multiple render services in parallel and list all of them in client_url.txt, one URL per line:
http://10.0.0.1:8767
http://10.0.0.2:8767
http://10.0.0.3:8767
http://10.0.0.4:8767
interactive_view_planning.py talks to the service over HTTP and distributes its environments across every URL listed in client_url.txt. More services means more scenes stay resident at once, so workers reload point clouds far less often. For our training runs we ran one service per machine on 4× RTX 4090 boxes with 32 workers each, which comfortably serves ~128 parallel environments. Scale the number of services and MAX_WORKERS to your hardware (see the MAX_WORKERS note in Step 3).
If you find ViewSuite useful in your research, please consider citing our paper:
@article{wang2026viewsuite,
title = {Planning with the Views},
author = {Wang, Kangrui and Li, Linjie and Yang, Zhengyuan and Chen, Shiqi and
Wang, Zihan and Fei-Fei, Li and Wu, Jiajun and Guibas, Leonidas and
Wang, Lijuan and Li, Manling},
year = {2026}
}ViewSuite is built on ScanNet for real 3D indoor scenes, and our training and evaluation framework draws on VAGEN, verl, LLaMA-Factory, and sglang. The higher-fidelity Gaussian-Splatting renders use pretrained per-scene ScanNet 3DGS reconstructions from SceneSplat-7K (SceneSplat, ICCV 2025). We thank the authors of these projects for open-sourcing their work.
This project is released under the MIT License. Note that ScanNet data and any third-party models are subject to their own licenses and terms of use.

