Optimize FP8 KV Cache with dedicated scale_update kernel by guoqingbao · Pull Request #1651 · EricLBuehler/mistral.rs

guoqingbao · 2025-10-11T11:22:39Z

Hi Eric,

This PR adds a dedicated kernel for FP8 KV scale updates, which improves overall performance by roughly 20%.
To make this work, a separate PR in Candle must be merged first: EricLBuehler/candle#98, since the underlying Candle framework currently does not support alloc and zero operations for the FP8 data type (used during FP8 KV cache creation).

Tested case:

cargo run --features cuda -- -i --pa-cache-type f8e4m3 gguf -m /data/shared/ -f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

github-actions · 2025-10-11T11:26:53Z

Code Metrics Report

===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           63           54            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           39           22            8            9
 HTML                    1           78           64            5            9
 JavaScript              7         1397         1068          180          149
 JSON                   22          410          407            0            3
 Makefile                1            6            5            0            1
 Python                102         5660         4631          298          731
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   23          877          809           11           57
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               74         6981            0         5227         1754
 |- BASH                19          299          260           24           15
 |- JSON                11          523          523            0            0
 |- Python              14          521          434           35           52
 |- Rust                32         1320         1108           36          176
 |- TOML                 2           75           63            0           12
 (Total)                           9719         2388         5322         2009
-------------------------------------------------------------------------------
 Rust                  422       156830       138527         3993        14310
 |- Markdown           200         4348          285         3498          565
 (Total)                         161178       138812         7491        14875
===============================================================================
 Total                 666       176621       146040        12169        18412
===============================================================================

sempervictus · 2025-11-23T20:45:14Z

@guoqingbao @EricLBuehler - with #1722 looking rather viable, what are the technical challenges to upstreaming the work you gentlemen have been doing on your forks of the Candle ecosystem such as the dependency for this PR?

sempervictus · 2025-11-23T21:52:07Z

With #1722, we dont need the referenced forked-candle PR as the relevant logic is already upstream

EricLBuehler · 2025-11-25T19:44:18Z

Thank you @guoqingbao!

guoqingbao added 2 commits October 11, 2025 11:14

Optimize FP8 KV Cache with dedicated scale_update kernel

1da00b9

Typo fix

71fa6f1

EricLBuehler merged commit 992c85e into EricLBuehler:master Nov 25, 2025
3 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize FP8 KV Cache with dedicated scale_update kernel#1651

Optimize FP8 KV Cache with dedicated scale_update kernel#1651
EricLBuehler merged 2 commits into
EricLBuehler:masterfrom
guoqingbao:master

guoqingbao commented Oct 11, 2025

Uh oh!

github-actions Bot commented Oct 11, 2025

Uh oh!

sempervictus commented Nov 23, 2025

Uh oh!

sempervictus commented Nov 23, 2025

Uh oh!

Uh oh!

EricLBuehler commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

guoqingbao commented Oct 11, 2025

Uh oh!

github-actions Bot commented Oct 11, 2025

Uh oh!

sempervictus commented Nov 23, 2025

Uh oh!

sempervictus commented Nov 23, 2025

Uh oh!

Uh oh!

EricLBuehler commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants