CuFlash-Attn

High-performance CUDA C++ FlashAttention implementation from scratch

English · 简体中文 · Documentation · API Reference

🎯 Overview

CuFlash-Attn is a from-scratch implementation of the FlashAttention algorithm, optimized for educational purposes, research experimentation, and production integration.

Project Status

Status: Stable v0.3.0 codebase under maintenance cleanup
Source of Truth: public headers, implementation, tests, and user-facing docs
Positioning: Lean reference implementation for learning, auditing, and lightweight integration
Current Focus: Removing stale workflow layers, tightening docs, and fixing long-tail defects rather than adding new features

Why CuFlash-Attn?

Challenge	Solution
📚 Learn FlashAttention	Clean, well-documented CUDA kernels with step-by-step algorithm implementation
🔬 Research & Experiment	Modify attention mechanisms without complex framework dependencies
🚀 Production Ready	C++ API with C ABI bindings for seamless Python integration via ctypes
⚡ GPU Optimized	Multi-architecture support from V100 (sm_70) to H100 (sm_90)

✨ Key Features

Feature	Description
⚡ O(N) Memory	Linear memory complexity vs O(N²) in standard attention — handle 16K+ sequences
🔢 Dual Precision	FP32 & FP16 support for both forward and backward passes
🔁 Full Training	Complete forward/backward with gradient computation
🎭 Causal Masking	Built-in support for autoregressive models (GPT-style)
🔧 Easy Integration	Clean C++ API + C ABI for Python ctypes integration
🏎️ Multi-Arch	Optimized CUDA kernels for sm_70 → sm_90 (V100 → H100)
🧪 Comprehensive Tests	Unit tests, integration tests, stress tests, PyTorch comparison
📊 Benchmarks	Google Benchmark integration for performance tracking
📚 Bilingual Docs	Complete English & Chinese documentation

Comparison with Other Libraries

Feature	CuFlash-Attn	PyTorch SDPA	xFormers	FlashAttention-2
Educational Code	✅ Clean	❌ Complex	❌ Complex	⚠️ Moderate
Custom Modifications	✅ Easy	⚠️ Hard	⚠️ Hard	⚠️ Hard
No Framework Dependency	✅ Yes	❌ PyTorch	❌ PyTorch	❌ PyTorch/Cutlass
Python Binding	✅ ctypes	✅ Native	✅ Native	✅ PyTorch
Training Support	✅ Full	✅ Full	✅ Full	✅ Full
BF16 Support	⚠️ Coming	✅ Yes	✅ Yes	✅ Yes

Choose CuFlash-Attn when: You want to understand, modify, or embed FlashAttention without heavy dependencies.

🚀 Quick Start

Prerequisites

GPU: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40, A100, H100)
CUDA Toolkit: 11.0 or later
CMake: 3.18 or later
Compiler: GCC 7+, Clang 5+, or MSVC 2017+ (C++17 support required)

Option 1: Build from Source

# Clone repository
git clone https://github.com/AICL-Lab/cuflash-attn.git
cd cuflash-attn

# Build with preset (Release mode)
cmake --preset release
cmake --build --preset release

# Run tests
ctest --preset release --output-on-failure

Option 2: Docker (Recommended for Quick Testing)

# Build Docker image
docker build -t cuflash-attn .

# Run with GPU support
docker run --gpus all -it cuflash-attn

# Inside container: run benchmarks
./build/release/benchmarks/cuflash_attn_bench

Your First C++ Program

#include <cuda_runtime.h>
#include "cuflash/flash_attention.h"
#include <iostream>
#include <vector>
#include <cmath>
#include <random>

// Helper: Initialize device memory with random values
void init_random(float* d_ptr, size_t n, unsigned seed = 42) {
    std::mt19937 gen(seed);
    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);
    std::vector<float> h_data(n);
    for (auto& v : h_data) v = dist(gen);
    cudaMemcpy(d_ptr, h_data.data(), n * sizeof(float), cudaMemcpyHostToDevice);
}

int main() {
    const int B = 2, H = 8, N = 1024, D = 64;
    const float scale = 1.0f / std::sqrt(static_cast<float>(D));
    const size_t qkv_size = B * H * N * D;
    const size_t l_size = B * H * N;
    
    // Allocate device memory
    float *d_Q, *d_K, *d_V, *d_O, *d_L;
    cudaMalloc(&d_Q, qkv_size * sizeof(float));
    cudaMalloc(&d_K, qkv_size * sizeof(float));
    cudaMalloc(&d_V, qkv_size * sizeof(float));
    cudaMalloc(&d_O, qkv_size * sizeof(float));
    cudaMalloc(&d_L, l_size * sizeof(float));
    
    // Initialize with random data
    init_random(d_Q, qkv_size, 1);
    init_random(d_K, qkv_size, 2);
    init_random(d_V, qkv_size, 3);
    
    // Compute FlashAttention with causal masking
    auto err = cuflash::flash_attention_forward(
        d_Q, d_K, d_V, d_O, d_L,
        B, H, N, D, scale,
        true  // causal masking
    );
    
    if (err != cuflash::FlashAttentionError::SUCCESS) {
        std::cerr << "Error: " << cuflash::get_error_string(err) << std::endl;
        return 1;
    }
    
    std::cout << "FlashAttention forward pass completed successfully!" << std::endl;
    
    // Cleanup
    cudaFree(d_Q); cudaFree(d_K); cudaFree(d_V);
    cudaFree(d_O); cudaFree(d_L);
    return 0;
}

📖 More examples: See examples/ directory for complete programs including backward pass and Python integration.

Python Integration

import ctypes
import numpy as np
import torch

# Load the shared library
lib = ctypes.CDLL("./build/release/libcuflash_attn.so")

# Define API
lib.cuflash_attention_forward_f32.argtypes = [
    ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p,
    ctypes.c_void_p, ctypes.c_void_p,
    ctypes.c_int, ctypes.c_int, ctypes.c_int, ctypes.c_int,
    ctypes.c_float, ctypes.c_bool, ctypes.c_void_p
]
lib.cuflash_attention_forward_f32.restype = ctypes.c_int

# Prepare data with PyTorch
B, H, N, D = 2, 8, 1024, 64
Q = torch.randn(B, H, N, D, dtype=torch.float32, device='cuda')
K = torch.randn(B, H, N, D, dtype=torch.float32, device='cuda')
V = torch.randn(B, H, N, D, dtype=torch.float32, device='cuda')
O = torch.empty_like(Q)
L = torch.empty(B, H, N, dtype=torch.float32, device='cuda')

# Call CuFlash-Attn
scale = 1.0 / np.sqrt(D)
result = lib.cuflash_attention_forward_f32(
    ctypes.c_void_p(Q.data_ptr()),
    ctypes.c_void_p(K.data_ptr()),
    ctypes.c_void_p(V.data_ptr()),
    ctypes.c_void_p(O.data_ptr()),
    ctypes.c_void_p(L.data_ptr()),
    B, H, N, D, scale, True, None
)

assert result == 0, f"FlashAttention failed with error code {result}"
print(f"Output shape: {O.shape}, mean: {O.mean().item():.4f}")

📖 Full Python example: See examples/python_binding.py

📊 Performance

Memory Efficiency

Sequence Length	Standard Attention	FlashAttention	Savings
1,024	4 MB	8 KB	99.8%
4,096	64 MB	32 KB	99.95%
16,384	1 GB	128 KB	99.99%

Time Performance (A100-40GB, FP16, Batch=8, Heads=16)

Seq Length	Standard (ms)	CuFlash-Attn (ms)	Speed-up
1,024	2.45	0.38	6.4x
4,096	12.8	1.52	8.4x
8,192	52.3	5.86	8.9x
16,384	Out of Memory	23.4	∞

Benchmarks run with causal masking enabled. See benchmarks/ for methodology.

Run Benchmarks

cmake --preset release
cmake --build --preset release
./build/release/benchmarks/cuflash_attn_bench

📖 Documentation

Quick Links

Resource	Link
📘 Full Documentation	https://aicl-lab.github.io/cuflash-attn/
🔌 API Reference	English API Docs
🧠 Algorithm Deep Dive	FlashAttention Explained
🔧 Build Guide	Building from Source
❓ Troubleshooting	Common Issues & Solutions

Documentation Languages

🇬🇧 English Documentation
🇨🇳 中文文档

⚙️ Configuration

Supported Parameters

Parameter	Values	Notes
`head_dim`	32, 64, 128	Required for kernel optimization
Data Types	FP32 (`float`), FP16 (`half`)	Both forward & backward
Causal Mask	Optional	Enabled/disabled at runtime
Batch Size	≥ 1	Any positive integer
Sequence Length	≥ 1	Optimized for 1K-16K+
Number of Heads	≥ 1	Any positive integer

GPU Architecture Support

Architecture	Compute	Example GPUs
Volta	sm_70	V100
Turing	sm_75	RTX 2080 Ti
Ampere	sm_80, sm_86	A100, RTX 3090
Ada Lovelace	sm_89	RTX 4090
Hopper	sm_90	H100

Default build targets: sm_80, sm_86 (A100 + RTX 30xx/40xx)

Customize with: cmake --preset release -DCMAKE_CUDA_ARCHITECTURES="90"

🏗️ Project Structure

cuflash-attn/
├── benchmarks/                 # Performance benchmarks (Google Benchmark)
├── cmake/                      # CMake modules & packaging
├── docs/                       # VitePress documentation site
│   ├── en/                     # English documentation
│   ├── zh/                     # Chinese documentation
│   └── public/                 # Static assets (logos, favicons)
├── examples/                   # Complete usage examples
│   ├── basic_usage.cu          # C++ basic example
│   └── python_binding.py       # Python ctypes example
├── include/cuflash/            # Public API headers
│   ├── flash_attention.h       # Main API with C++ and C ABI
│   ├── export.h                # Visibility macros
│   └── version.h.in            # Version header template
├── src/                        # Implementation
│   ├── api/                    # API dispatch layer
│   ├── forward/                # Forward kernel implementations
│   ├── backward/               # Backward kernel implementations
│   └── kernels/                # Internal kernel utilities (.cuh)
├── tests/                      # Test suite
│   ├── unit/                   # Unit tests (8 files)
│   ├── integration/            # Integration tests + PyTorch comparison
│   └── package_smoke/          # Package smoke tests
├── CMakeLists.txt              # Main build configuration
├── CMakePresets.json           # Build presets (release, debug, asan)
├── Dockerfile                  # Container build
└── .github/workflows/          # CI/CD workflows
    ├── ci.yml                  # Matrix builds, tests, benchmarks
    ├── codeql.yml              # Security scanning
    ├── pages.yml               # Docs deployment
    └── release.yml             # Release automation

🧪 Testing & Quality

Test Categories

# Run all tests
ctest --preset release --output-on-failure

# Run specific test categories
ctest --preset release -R ForwardTest    # Forward pass tests
ctest --preset release -R BackwardTest   # Backward pass tests
ctest --preset release -R StressTest     # Stress & edge cases
ctest --preset release -R PyTorch        # PyTorch comparison (requires GPU + PyTorch)

Code Quality Tools

✅ clang-format: Automated code formatting (enforced in CI)
✅ clang-tidy: Static analysis with 50+ checks
✅ CodeQL: Weekly security scanning
✅ Sanitizers: AddressSanitizer & UBSan support (debug builds)

# Build with AddressSanitizer
cmake --preset debug-asan
cmake --build --preset debug-asan
ctest --preset debug-asan

🤝 Contributing

Contributions are welcome. Keep changes small, explicit, and aligned with the current CUDA library boundary.

Getting Started

Clone the repository and confirm the current scope
Build with presets and validate the environment
Write or update tests that cover your change
Update docs when behavior, API usage, or workflows change
Open a pull request with a concise summary and validation notes

Development Workflow

# Format code before committing
find . -name "*.cu" -o -name "*.cuh" -o -name "*.cpp" -o -name "*.h" | xargs clang-format -i

# Run tests locally
cmake --preset release && cmake --build --preset release
ctest --preset release --output-on-failure

# Optional: Run clang-tidy
clang-tidy src/api/flash_attention_api.cu -- -Iinclude

📋 Detailed guidelines: See CONTRIBUTING.md

📄 License

This project is licensed under the MIT License.

📚 References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness — Dao et al., NeurIPS 2022
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning — Dao, ICLR 2024

📈 Version History

See CHANGELOG.md for detailed version history and updates.

_{Built with ❤️ for efficient attention computation}
_{Lean Reference Implementation · CUDA C++ · Open Source}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
.vscode		.vscode
benchmarks		benchmarks
cmake		cmake
docs		docs
examples		examples
include/cuflash		include/cuflash
scripts		scripts
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clangd		.clangd
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CONTEXT.md		CONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md

Folders and files

Latest commit

History

Repository files navigation

CuFlash-Attn

🎯 Overview

Project Status

Why CuFlash-Attn?

✨ Key Features

Comparison with Other Libraries

🚀 Quick Start

Prerequisites

Option 1: Build from Source

Option 2: Docker (Recommended for Quick Testing)

Your First C++ Program

Python Integration

📊 Performance

Memory Efficiency

Time Performance (A100-40GB, FP16, Batch=8, Heads=16)

Run Benchmarks

📖 Documentation

Quick Links

Documentation Languages

⚙️ Configuration

Supported Parameters

GPU Architecture Support

🏗️ Project Structure

🧪 Testing & Quality

Test Categories

Code Quality Tools

🤝 Contributing

Getting Started

Development Workflow

📄 License

📚 References

📈 Version History

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages