QuantVideoGen is a lightweight KV-cache quantization toolkit for autoregressive video generation. It compresses long-horizon attention cache during inference, with experiment integrations for LongCat-Video, Self-Forcing, and HY-WorldPlay.
- Quantizes KV cache with Triton k-means / staged product quantization kernels.
- Keeps the original model weights unchanged; quantization is applied to inference cache.
- Includes bf16 and quantized launch scripts for three long-video / streaming generation repos.
- Targets memory-heavy long-context settings where KV cache dominates peak usage.
conda create -n qvg python=3.12.9 -y
conda activate qvg
pip install uv
# Everything, recommended for reproducing all experiments.
uv pip install -e ".[all]"
# Or install only one experiment extra.
uv pip install -e ".[longcat]"
uv pip install -e ".[selfforcing]"
uv pip install -e ".[hyworldplay]"
# Flash Attention, CUDA 12 / torch 2.8 wheel.
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whlDownload model checkpoints before running experiments.
# LongCat-Video
hf download meituan-longcat/LongCat-Video --local-dir ckpts/LongCat-Video
# Self-Forcing
bash scripts/Self-Forcing/download_models.sh
# HY-WorldPlay
bash scripts/HY-WorldPlay/download_models.shEach integration provides a bf16 baseline script and a quantized script. The quantized scripts currently use triton-nstages-kmeans-int2 with block size 64 and 256 K/V centroids by default.
# LongCat-Video
bash scripts/LongCat/run_bf16.sh
bash scripts/LongCat/run_qvg.sh
# Self-Forcing
bash scripts/Self-Forcing/run_bf16.sh
bash scripts/Self-Forcing/run_qvg.sh
# HY-WorldPlay
bash scripts/HY-WorldPlay/run_bf16.sh
bash scripts/HY-WorldPlay/run_qvg.shOutputs are written under results/. Quantization options can be changed directly in the corresponding run_qvg.sh script.
The table below reports KV-cache memory for the provided scripts. Numbers are in MB.
| Model | Precision | QVG | Per Layer KV | Total KV Cache | Compression Rate |
|---|---|---|---|---|---|
| LongCat-Video | BF16 | ✗ | 464.00 | 22272.00 | 1.00x |
| LongCat-Video | INT2 | ✓ | 67.32 | 3231.28 | 6.89x |
| Self-Forcing | BF16 | ✗ | 1535.76 | 46072.88 | 1.00x |
| Self-Forcing | INT2 | ✓ | 220.45 | 6613.59 | 6.97x |
| HY-WorldPlay | BF16 | ✗ | 990.00 | 29700.00 | 1.00x |
| HY-WorldPlay | INT2 | ✓ | 141.18 | 4235.45 | 7.01x |
Across these runs, QuantVideoGen reduces total KV-cache memory by about 85%.
@article{xi2026quant,
title={Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization},
author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Li, Muyang and Cai, Han and Li, Xingyang and Lin, Yujun and Zhang, Zhuoyang and Zhang, Jintao and Li, Xiuyu and others},
journal={arXiv preprint arXiv:2602.02958},
year={2026}
}