PagedAttention kv-cache quantization (F8E4M3) by EricLBuehler · Pull Request #1400 · EricLBuehler/mistral.rs

EricLBuehler · 2025-06-02T02:07:48Z

Summary by CodeRabbit

New Features
- Added support for FP8 (8-bit floating point) quantized key-value caches in PagedAttention for CUDA and Metal backends with optional per-tensor scaling factors.
- Introduced a configurable cache data type for PagedAttention via a new cache_type setting (auto and f8e4m3).
- Enabled specifying the PagedAttention cache type through CLI arguments, Python, and Rust builder APIs.
- Added support for new model architectures: Qwen3 and GLM4.
Enhancements
- Improved Metal device memory allocation to respect system memory limits.
- Extended kernels and backends to transparently handle new data types and scaling factors.
- Added thread-safe scale tracking in PagedAttention for improved attention computations.
Documentation
- Updated APIs and Python type hints to include new cache type options and model support.
- Added documentation for KV cache quantization feature with usage examples.
Bug Fixes
- Prevented underflow in sequence choice count decrement with saturating subtraction.
- Improved error handling and argument validation for cache type and scaling factors.