⚡ Flash-WAM: Modality-Aware Distillation for World Action Models

Flash-WAM is a modality-aware step-distillation framework for joint video–action world models. It distills each modality with a consistency function matched to its noise regime — a linear-gradient-scaling choice for the low-noise action stream and a variance-preserving choice for the high-noise video stream — compressing LingBot-VA inference to a single step per modality. On RoboTwin 2.0 this yields up to a 23× speedup (8.1 s → 348 ms per chunk) while preserving teacher-level task success.

This repository provides the Flash-WAM distillation code (the modality-aware method plus its LCM ablations), the model code, and the distilled RoboTwin checkpoint.

📰 News

[2026-06] Flash-WAM RoboTwin checkpoint released on 🤗 HuggingFace.
[2026-06] Flash-WAM paper released on arXiv.

✅ Checklist

Flash-WAM RoboTwin checkpoint
Distillation code (Flash-WAM + LCM ablations)
Real-world deployment setup on Unitree G1 humanoid

📦 Model Checkpoints

Model	Repository	Description
LingBot-VA teacher (posttrain-robotwin)	🤗 robbyant/lingbot-va-posttrain-robotwin	Teacher checkpoint to distill from
Flash-WAM distilled (RoboTwin)	🤗 NU-World-Model-Embodied-AI/FlashWAM-RoboTwin	Single-step distilled student (complete, with encoders)

Post-training dataset: 🤗 robbyant/robotwin-clean-and-aug-lerobot.

🚀 Quick Start

Flash-WAM builds on LingBot-VA. For environment installation and evaluation, follow the LingBot-VA repository — Flash-WAM uses the same environment and the same RoboTwin server/client evaluation pipeline. Once the LingBot-VA environment is set up, this repository runs in it directly.

🔬 Distillation

Point the distiller at the teacher checkpoint and dataset, then select the method via DISTILL_MODE:

export TEACHER_PATH=/path/to/lingbot-va-posttrain-robotwin
export DATASET_PATH=/path/to/robotwin-clean-and-aug-lerobot

# Flash-WAM (the paper's modality-aware joint method)
DISTILL_MODE=flashwam bash distillation/run.sh

# LCM ablations from the paper
DISTILL_MODE=joint              bash distillation/run.sh   # naive joint LCM
DISTILL_MODE=video              bash distillation/run.sh   # video-only LCM
DISTILL_MODE=video_action_aware bash distillation/run.sh   # video-only LCM + reg

Key knobs (see distillation/config.py): NGPU, OUTPUT_DIR, num_ddim_timesteps (student video steps), num_ddim_timesteps_action (student action steps), cfg_min/cfg_max (teacher CFG range).

📊 Results

Ablation — RoboTwin 2.0 (success rate %, 50 tasks, by task horizon)

Flash-WAM is the only LCM-based strategy that preserves teacher-level accuracy across both NFE budgets and all horizons: naive joint LCM collapses, and the video-only variants trail.

Method	Nᵥ	Nₐ	H1 Clean	H1 Rand	H2 Clean	H2 Rand	H3 Clean	H3 Rand	Avg Clean	Avg Rand
LingBot-VA (teacher)	25	50	94.18	93.56	90.35	86.95	93.22	93.28	92.93	91.55
Video-only LCM	1	2	87.10	82.73	73.13	68.19	62.50	68.25	80.66	76.92
Video-only LCM + reg.	1	2	91.53	88.50	83.00	74.69	68.00	62.75	86.92	82.02
Naive Joint LCM	1	2	41.00	35.13	4.00	3.13	0.00	0.00	25.88	20.08
Flash-WAM	1	2	92.30	88.47	84.88	76.63	73.50	63.25	88.42	82.66
Video-only LCM	1	1	85.57	78.17	72.06	61.81	43.75	34.75	77.90	69.46
Video-only LCM + reg.	1	1	66.87	61.07	39.19	35.56	10.25	4.75	53.48	48.40
Naive Joint LCM	1	1	54.63	46.00	21.56	15.63	0.00	0.00	39.68	32.96
Flash-WAM	1	1	87.30	86.93	78.44	72.63	63.50	60.75	82.56	80.26

Real-World — Unitree G1 (success rate %, 3 tasks × 10 rollouts)

T1: open pot lid and place a potato inside · T2: pick the red bottle (with a yellow distractor) · T3: pick the pink object and place it on the target. See demo/ for rollout videos.

Method	Nᵥ / Nₐ	T1	T2	T3	Average
LingBot-VA	3 / 10	50	70	80	66.7
LingBot-VA (reduced NFE)	1 / 2	30	30	60	40.0
LingBot-VA + Video-only LCM	1 / 2	30	50	50	43.3
Flash-WAM	1 / 2	50	60	70	60.0
LingBot-VA (reduced NFE)	1 / 1	10	30	30	23.3
LingBot-VA + Video-only LCM	1 / 1	20	40	40	33.3
Flash-WAM	1 / 1	40	50	60	50.0

📝 Citation

@misc{akbari2026flashwammodalityawaredistillationworld,
      title={Flash-WAM: Modality-Aware Distillation for World Action Models}, 
      author={Arman Akbari and Ci Zhang and Arash Akbari and Lin Zhao and Yixiao Chen and Weiwei Chen and Xuan Zhang and Geng Yuan and Yanzhi Wang},
      year={2026},
      eprint={2606.05254},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.05254}, 
}

Acknowledgements

Built on LingBot-VA and evaluated on RoboTwin 2.0. Licensed under Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
demo		demo
distillation		distillation
wan_va		wan_va
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡ Flash-WAM: Modality-Aware Distillation for World Action Models

📰 News

✅ Checklist

📦 Model Checkpoints

🚀 Quick Start

🔬 Distillation

📊 Results

Ablation — RoboTwin 2.0 (success rate %, 50 tasks, by task horizon)

Real-World — Unitree G1 (success rate %, 3 tasks × 10 rollouts)

📝 Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

⚡ Flash-WAM: Modality-Aware Distillation for World Action Models

📰 News

✅ Checklist

📦 Model Checkpoints

🚀 Quick Start

🔬 Distillation

📊 Results

Ablation — RoboTwin 2.0 (success rate %, 50 tasks, by task horizon)

Real-World — Unitree G1 (success rate %, 3 tasks × 10 rollouts)

📝 Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages