Skip to content

svg-project/Quant-VideoGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuantVideoGen

Quantized KV-cache compression for video generation models

QuantVideoGen is a lightweight KV-cache quantization toolkit for autoregressive video generation. It compresses long-horizon attention cache during inference, with experiment integrations for LongCat-Video, Self-Forcing, and HY-WorldPlay.

✨ Highlights

  • Quantizes KV cache with Triton k-means / staged product quantization kernels.
  • Keeps the original model weights unchanged; quantization is applied to inference cache.
  • Includes bf16 and quantized launch scripts for three long-video / streaming generation repos.
  • Targets memory-heavy long-context settings where KV cache dominates peak usage.

📦 Installation

conda create -n qvg python=3.12.9 -y
conda activate qvg

pip install uv

# Everything, recommended for reproducing all experiments.
uv pip install -e ".[all]"

# Or install only one experiment extra.
uv pip install -e ".[longcat]"
uv pip install -e ".[selfforcing]"
uv pip install -e ".[hyworldplay]"

# Flash Attention, CUDA 12 / torch 2.8 wheel.
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

⬇️ Download Models

Download model checkpoints before running experiments.

# LongCat-Video
hf download meituan-longcat/LongCat-Video --local-dir ckpts/LongCat-Video

# Self-Forcing
bash scripts/Self-Forcing/download_models.sh

# HY-WorldPlay
bash scripts/HY-WorldPlay/download_models.sh

💥 Quick Start

Each integration provides a bf16 baseline script and a quantized script. The quantized scripts currently use triton-nstages-kmeans-int2 with block size 64 and 256 K/V centroids by default.

# LongCat-Video
bash scripts/LongCat/run_bf16.sh
bash scripts/LongCat/run_qvg.sh

# Self-Forcing
bash scripts/Self-Forcing/run_bf16.sh
bash scripts/Self-Forcing/run_qvg.sh

# HY-WorldPlay
bash scripts/HY-WorldPlay/run_bf16.sh
bash scripts/HY-WorldPlay/run_qvg.sh

Outputs are written under results/. Quantization options can be changed directly in the corresponding run_qvg.sh script.

📊 Memory Results

The table below reports KV-cache memory for the provided scripts. Numbers are in MB.

Model Precision QVG Per Layer KV Total KV Cache Compression Rate
LongCat-Video BF16 464.00 22272.00 1.00x
LongCat-Video INT2 67.32 3231.28 6.89x
Self-Forcing BF16 1535.76 46072.88 1.00x
Self-Forcing INT2 220.45 6613.59 6.97x
HY-WorldPlay BF16 990.00 29700.00 1.00x
HY-WorldPlay INT2 141.18 4235.45 7.01x

Across these runs, QuantVideoGen reduces total KV-cache memory by about 85%.

✏️ Citation

@article{xi2026quant,
  title={Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization},
  author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Li, Muyang and Cai, Han and Li, Xingyang and Lin, Yujun and Zhang, Zhuoyang and Zhang, Jintao and Li, Xiuyu and others},
  journal={arXiv preprint arXiv:2602.02958},
  year={2026}
}

About

[ICML2026] Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors