Skip to content

PagedAttention kv-cache quantization (F8E4M3)#1400

Merged
EricLBuehler merged 42 commits into
masterfrom
paged_attention_kvquant
Jun 23, 2025
Merged

PagedAttention kv-cache quantization (F8E4M3)#1400
EricLBuehler merged 42 commits into
masterfrom
paged_attention_kvquant

Conversation

@EricLBuehler

@EricLBuehler EricLBuehler commented Jun 2, 2025

Copy link
Copy Markdown
Owner

Summary by CodeRabbit

  • New Features

    • Added support for FP8 (8-bit floating point) quantized key-value caches in PagedAttention for CUDA and Metal backends with optional per-tensor scaling factors.
    • Introduced a configurable cache data type for PagedAttention via a new cache_type setting (auto and f8e4m3).
    • Enabled specifying the PagedAttention cache type through CLI arguments, Python, and Rust builder APIs.
    • Added support for new model architectures: Qwen3 and GLM4.
  • Enhancements

    • Improved Metal device memory allocation to respect system memory limits.
    • Extended kernels and backends to transparently handle new data types and scaling factors.
    • Added thread-safe scale tracking in PagedAttention for improved attention computations.
  • Documentation

    • Updated APIs and Python type hints to include new cache type options and model support.
    • Added documentation for KV cache quantization feature with usage examples.
  • Bug Fixes

    • Prevented underflow in sequence choice count decrement with saturating subtraction.
    • Improved error handling and argument validation for cache type and scaling factors.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants