Skip to content

Optimize FP8 KV Cache with dedicated scale_update kernel#1651

Merged
EricLBuehler merged 2 commits into
EricLBuehler:masterfrom
guoqingbao:master
Nov 25, 2025
Merged

Optimize FP8 KV Cache with dedicated scale_update kernel#1651
EricLBuehler merged 2 commits into
EricLBuehler:masterfrom
guoqingbao:master

Conversation

@guoqingbao

Copy link
Copy Markdown
Contributor

Hi Eric,

This PR adds a dedicated kernel for FP8 KV scale updates, which improves overall performance by roughly 20%.
To make this work, a separate PR in Candle must be merged first: EricLBuehler/candle#98, since the underlying Candle framework currently does not support alloc and zero operations for the FP8 data type (used during FP8 KV cache creation).

Tested case:

cargo run --features cuda -- -i --pa-cache-type f8e4m3 gguf -m /data/shared/ -f Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

@github-actions

Copy link
Copy Markdown
Code Metrics Report
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C Header                3           63           54            0            9
 CSS                     1          473          408           14           51
 Dockerfile              1           39           22            8            9
 HTML                    1           78           64            5            9
 JavaScript              7         1397         1068          180          149
 JSON                   22          410          407            0            3
 Makefile                1            6            5            0            1
 Python                102         5660         4631          298          731
 Shell                   1           63           26           18           19
 Plain Text              3         3723            0         2413         1310
 TOML                   23          877          809           11           57
 YAML                    2           21           19            2            0
-------------------------------------------------------------------------------
 Jupyter Notebooks       3            0            0            0            0
 |- Markdown             2           77           32           31           14
 |- Python               2          205          178            1           26
 (Total)                            282          210           32           40
-------------------------------------------------------------------------------
 Markdown               74         6981            0         5227         1754
 |- BASH                19          299          260           24           15
 |- JSON                11          523          523            0            0
 |- Python              14          521          434           35           52
 |- Rust                32         1320         1108           36          176
 |- TOML                 2           75           63            0           12
 (Total)                           9719         2388         5322         2009
-------------------------------------------------------------------------------
 Rust                  422       156830       138527         3993        14310
 |- Markdown           200         4348          285         3498          565
 (Total)                         161178       138812         7491        14875
===============================================================================
 Total                 666       176621       146040        12169        18412
===============================================================================

@sempervictus

Copy link
Copy Markdown
Contributor

@guoqingbao @EricLBuehler - with #1722 looking rather viable, what are the technical challenges to upstreaming the work you gentlemen have been doing on your forks of the Candle ecosystem such as the dependency for this PR?

@sempervictus

Copy link
Copy Markdown
Contributor

With #1722, we dont need the referenced forked-candle PR as the relevant logic is already upstream

@EricLBuehler EricLBuehler merged commit 992c85e into EricLBuehler:master Nov 25, 2025
3 of 13 checks passed
@EricLBuehler

Copy link
Copy Markdown
Owner

Thank you @guoqingbao!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants