QuantVideoGen

Quantized KV-cache compression for video generation models

QuantVideoGen is a lightweight KV-cache quantization toolkit for autoregressive video generation. It compresses long-horizon attention cache during inference, with experiment integrations for LongCat-Video, Self-Forcing, and HY-WorldPlay.

✨ Highlights

Quantizes KV cache with Triton k-means / staged product quantization kernels.
Keeps the original model weights unchanged; quantization is applied to inference cache.
Includes bf16 and quantized launch scripts for three long-video / streaming generation repos.
Targets memory-heavy long-context settings where KV cache dominates peak usage.

📦 Installation

conda create -n qvg python=3.12.9 -y
conda activate qvg

pip install uv

# Everything, recommended for reproducing all experiments.
uv pip install -e ".[all]"

# Or install only one experiment extra.
uv pip install -e ".[longcat]"
uv pip install -e ".[selfforcing]"
uv pip install -e ".[hyworldplay]"

# Flash Attention, CUDA 12 / torch 2.8 wheel.
uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiFALSE-cp312-cp312-linux_x86_64.whl

⬇️ Download Models

Download model checkpoints before running experiments.

# LongCat-Video
hf download meituan-longcat/LongCat-Video --local-dir ckpts/LongCat-Video

# Self-Forcing
bash scripts/Self-Forcing/download_models.sh

# HY-WorldPlay
bash scripts/HY-WorldPlay/download_models.sh

💥 Quick Start

Each integration provides a bf16 baseline script and a quantized script. The quantized scripts currently use triton-nstages-kmeans-int2 with block size 64 and 256 K/V centroids by default.

# LongCat-Video
bash scripts/LongCat/run_bf16.sh
bash scripts/LongCat/run_qvg.sh

# Self-Forcing
bash scripts/Self-Forcing/run_bf16.sh
bash scripts/Self-Forcing/run_qvg.sh

# HY-WorldPlay
bash scripts/HY-WorldPlay/run_bf16.sh
bash scripts/HY-WorldPlay/run_qvg.sh

Outputs are written under results/. Quantization options can be changed directly in the corresponding run_qvg.sh script.

📊 Memory Results

The table below reports KV-cache memory for the provided scripts. Numbers are in MB.

Model	Precision	QVG	Per Layer KV	Total KV Cache	Compression Rate
LongCat-Video	BF16	✗	464.00	22272.00	1.00x
LongCat-Video	INT2	✓	67.32	3231.28	6.89x
Self-Forcing	BF16	✗	1535.76	46072.88	1.00x
Self-Forcing	INT2	✓	220.45	6613.59	6.97x
HY-WorldPlay	BF16	✗	990.00	29700.00	1.00x
HY-WorldPlay	INT2	✓	141.18	4235.45	7.01x

Across these runs, QuantVideoGen reduces total KV-cache memory by about 85%.

✏️ Citation

@article{xi2026quant,
  title={Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization},
  author={Xi, Haocheng and Yang, Shuo and Zhao, Yilong and Li, Muyang and Cai, Han and Li, Xingyang and Lin, Yujun and Zhang, Zhuoyang and Zhang, Jintao and Li, Xiuyu and others},
  journal={arXiv preprint arXiv:2602.02958},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
experiments		experiments
quant_videogen		quant_videogen
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QuantVideoGen

✨ Highlights

📦 Installation

⬇️ Download Models

💥 Quick Start

📊 Memory Results

✏️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QuantVideoGen

✨ Highlights

📦 Installation

⬇️ Download Models

💥 Quick Start

📊 Memory Results

✏️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages