Smart-Diffusion is a high-performance diffusion model inference framework built on Chitu. It provides extreme performance and flexible scheduling for AI-generated content (AIGC) workloads.
Smart-Diffusion is the pure enjoyment version of Chitu-Diffusion, developed by the PACMAN team from Tsinghua University and QingCheng.ai. We aim to provide support for the rapidly growing Diffusion ecosystem by restructuring DiT models under the API and scheduling philosophy of Chitu, maintaining scheduling flexibility while offering extreme performance.
- 🚀 High Performance: Optimized diffusion inference with advanced parallelism strategies
- 🔧 Flexible Architecture: Support for multiple attention backends (FlashAttention, SageAttention, SpargeAttention)
- 💾 Memory Efficient: Low memory mode with model offloading and VAE tiling
- 📊 Feature Cache: Unified FlexCache API for TeaCache, PAB, and DiTango
- 🎯 Easy to Use: Simple API with per-request parameter configuration
- 🌐 Multi-Model: Currently supports Wan-T2V series (1.3B, 14B, A14B) with more coming soon
Smart-Diffusion follows three core pillars:
- Parallelism: Context parallelism (CP), CFG parallelism, and data parallelism
- Kernels: Optimized attention implementations with quantization support
- Algorithms: Feature reuse and caching strategies for acceleration
See Why Smart-Diffusion? for detailed design philosophy.
- Python 3.12+
- CUDA 12.4+ (recommended: 12.8)
- NVIDIA GPU with compute capability 8.0+ (Ampere) or 9.0+ (Hopper/Blackwell)
We recommend using uv for a smoother installation experience.
git clone git@github.com:chen-yy20/SmartDiffusion.git
cd SmartDiffusionOption 1: Clone all submodules (sage-attn, sparge-attn and vbench)
git submodule update --init --recursive Option 2: Clone only a specific submodule. For example, to clone only the sage/sparge_attn submodule:
git submodule update --init third_party/sage_attn
git submodule update --init third_party/sparge_attncurl -LsSf https://astral.sh/uv/install.sh | shSee uv documentation for more details.
Check your CUDA version:
nvcc --versionEdit pyproject.toml to match your CUDA version. For CUDA 12.8:
[[tool.uv.index]]
name = "pytorch-cu128"
url = "https://download.pytorch.org/whl/cu128"
explicit = true
[tool.uv.sources]
torch = { index = "pytorch-cu128" }
torchvision = { index = "pytorch-cu128" }Configure GPU architecture in pyproject.toml:
[tool.uv.extra-build-variables]
# Set TORCH_CUDA_ARCH_LIST according to your GPU
# Ampere: 8.0, Hopper: 9.0, Blackwell: 9.0
sageattention = {
EXT_PARALLEL= "4",
NVCC_APPEND_FLAGS="--threads 8",
MAX_JOBS="32",
"TORCH_CUDA_ARCH_LIST" = "8.0;9.0"
}
spas_sage_attn = {
EXT_PARALLEL= "4",
NVCC_APPEND_FLAGS="--threads 8",
MAX_JOBS="32",
"TORCH_CUDA_ARCH_LIST" = "8.0;9.0"
}# Required installation (base dependencies in [project.dependencies]) | 30mins
uv sync -v 2>&1 | tee uv_sync.log
# Optional extras from [project.optional-dependencies]
# SageAttention
uv sync -v --extra sage 2>&1 | tee build_sage.log
# SpargeAttention
uv sync -v --extra sparge 2>&1 | tee build_sparge.log
# VBench evaluation toolkit
uv sync -v --extra vbench 2>&1 | tee build_vbench.log
# Evaluation metrics (FID/FVD/PSNR/SSIM/LPIPS)
uv sync -v --extra eval 2>&1 | tee build_eval.log
# One-command extension install (sage + sparge + vbench + eval)
uv sync -v --all-extras 2>&1 | tee build_full.log# Install from requirements.txt
pip install -r requirements.txt
# Install in editable mode
pip install -e .Note: Flash Attention can be installed via wheel from GitHub releases.
Smart-Diffusion currently supports the Wan-T2V series:
| Model ID | Parameters | Description |
|---|---|---|
Wan-AI/Wan2.1-T2V-1.3B |
1.3B | Lightweight text-to-video model |
Wan-AI/Wan2.1-T2V-14B |
14B | High-quality text-to-video model |
Wan-AI/Wan2.2-T2V-A14B |
14B | Advanced two-stage text-to-video model |
More models are being added continuously. Stay tuned!
Create a test script test_generate.py:
from chitu_diffusion import chitu_init, chitu_generate, chitu_start
from chitu_diffusion.task import DiffusionUserParams, DiffusionTask, DiffusionTaskPool
from hydra import compose, initialize
# Initialize with configuration
initialize(config_path="config", version_base=None)
args = compose(config_name="wan")
# Set model checkpoint path
args.models.ckpt_dir = "/path/to/your/model/checkpoint"
# Initialize backend
chitu_init(args)
chitu_start()
# Create generation task
user_params = DiffusionUserParams(
role="user1",
prompt="A cat walking on grass.",
num_inference_steps=50,
height=480,
width=848,
num_frames=81,
guidance_scale=7.0,
)
task = DiffusionTask.from_user_request(user_params)
DiffusionTaskPool.add(task)
# Generate
while not DiffusionTaskPool.all_finished():
chitu_generate()
print(f"Video saved to: {task.buffer.save_path}")Only srun launch is supported.
- Edit
system_config.yamlto configure model path, system params, andcfp. - Run the unified launcher:
bash run.sh system_config.yamlOptional runtime overrides:
bash run.sh system_config.yaml --num-nodes 2 --gpus-per-node 8 --cfp 2Runtime notes:
parallel.cfp(or--cfp) must be1or2; launcher maps it toinfer.diffusion.cfg_size.infer.diffusion.cp_sizeis auto-derived as(num_nodes * gpus_per_node) / cfp.launch.tagis exported asCHITU_RUN_TAGand prefixes output run directory names.launch.enable_launch_log=truewrites launcher logs tooutput.root_dir/launch_<timestamp>.log.CHITU_PYTHON_BINcan force the runtime Python; default order is.venv/bin/python->python->python3.
Recommended system_config.yaml output section:
output:
root_dir: outputs
enable_run_log: true
enable_timer_dump: true
hydra_dump_mode: off # default/video_dir/offhydra_dump_mode=video_dir relocates Hydra .hydra metadata to the video output directory.
When enable_timer_dump=true, timer statistics are dumped as time_stats.csv in each run directory.
Configuration is split into three levels:
- Model Parameters (Static): Defined in
chitu_core/config/models/<model>.yaml - User Parameters (Dynamic): Set per-request via
DiffusionUserParams - System Parameters (Semi-static): Set in
system_config.yaml
Example: Using different attention backend
python test_generate.py \
models.ckpt_dir=/path/to/checkpoint \
infer.attn_type=sage \
infer.diffusion.low_mem_level=2Control your attention implementation with infer.attn_type:
| Type | Description | Performance |
|---|---|---|
flash_attn |
Default FlashAttention. High-performance full attention without accuracy loss | Baseline |
sage |
SageAttention (NIPS25 spotlight). Train-free quantized attention | ~2x speedup |
sparge |
SpargeAttention (ICML25). Train-free sparse attention | ~3x speedup |
auto |
Automatically choose best backend | - |
Example:
python test_generate.py infer.attn_type=sageControl GPU memory usage with infer.diffusion.low_mem_level:
| Level | Behavior |
|---|---|
| 0 | All models loaded to GPU |
| 1 | VAE enables tiling |
| 2 | T5 encoder offloaded to CPU |
| ≥3 | DiT model offloaded to CPU |
Example:
python test_generate.py infer.diffusion.low_mem_level=2Enable feature reuse acceleration with infer.diffusion.enable_flexcache=true:
| Method | cache_type | Description |
|---|---|---|
teacache |
TeaCache | CVPR24 spotlight. Time embedding tells. |
pab |
Pyramid Attention Broadcast | ICLR25. Pyramid attention broadcasting |
ditango |
DiTango | ASE + anchor-gated grouped reuse |
DiTango behavior notes (current implementation):
- Local partition is always computed each step and merged separately for stability.
- Anchor decision is step-level and synchronized across CFG positive/negative branches.
cache_ratiocontrols both anchor trigger aggressiveness and global ASE-threshold quantile update.- Strategy implementation is in
chitu_diffusion/flex_cache/strategy/ditango/ditango.py. - A merged decision visualization is emitted to
<output_dir>/ditango_policy_step_layer_group.ppm.
Unified per-request API:
from chitu_diffusion.task import DiffusionUserParams, FlexCacheParams
user_params = DiffusionUserParams(
prompt="A cat walking on grass.",
flexcache_params=FlexCacheParams(
strategy="teacache", # teacache / pab / ditango
cache_ratio=0.4, # 0 quality-first, 1 speed-first
warmup=5,
cooldown=5,
),
)Legacy style is still supported:
user_params = DiffusionUserParams(
prompt="A cat walking on grass.",
flexcache='teacache',
# ... other params
)Enable automatic evaluation with eval.eval_type (multi-select):
python test_generate.py eval.eval_type=[vbench,fid,psnr] eval.reference_path=/path/to/reference_videosSupported evaluation methods:
vbench: VBench custom-mode evaluationfid: Frechet Inception Distance (requiresreference_path)fvd: Frechet Video Distance (requiresreference_path)psnr: Peak Signal-to-Noise Ratio (requiresreference_path)ssim: Structural Similarity Index (requiresreference_path)lpips: Learned Perceptual Image Patch Similarity (requiresreference_path)
Behavior notes:
eval.eval_type=[]ornulldisables evaluation.- Metrics requiring references are skipped with warning if
eval.reference_pathis missing or invalid. - Results are saved under
./vbench_out/(vbench) and./eval_out/(other metrics).
- Why Smart-Diffusion? - Design philosophy and architecture
- API Reference - Detailed API documentation
- Configuration Guide - Complete configuration options
We welcome contributions! Smart-Diffusion is in active development.
To contribute:
- Fork the repository
- Create a feature branch
- Make your changes with proper documentation
- Submit a pull request
Please see our Developer Guide for parameter taxonomy and best practices.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- More diffusion model support (Flux2, Longcat-Video, FireRed etc.)
- More acceleration algorithms
- More parallelism strategies
- Better operator implementations
- Production-ready serving framework
- Comprehensive benchmarks
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use Smart-Diffusion in your research, please cite:
@software{smart_diffusion2025,
title={Smart-Diffusion: High-Performance Diffusion Model Inference Framework},
author={PACMAN Team, Tsinghua University and QingCheng.ai},
year={2025},
url={https://github.com/chen-yy20/SmartDiffusion}
}- Chitu - Base inference framework
- xDiT - Scalable Inference Engine for Diffusion Transformers
- SGLang-Diffusion - Image/Video Generation Framework
- SageAttention - Quantized attention implementation
- SpargeAttention - Sparse+Sage attention implementation
- FlashAttention - Efficient attention implementation
- TeaCache - Feature cache strategy
- PyramidAttentionBroadcast - PAB algorithm
Note: Smart-Diffusion is currently in testing and development phase. We're working hard to make it better! Join us in building the future of AIGC acceleration. 🚀