[BugFix][Quant] Accumulate E5M6 BF16 amax in FP32 by JayceSu98 · Pull Request #16 · deepseek-ai/TileKernels

JayceSu98 · 2026-05-28T01:51:06Z

E5M6 per-token casts use the same scale-selection contract as the FP8/FP4 per-token path: compute a per-token/channel absmax, round the dequant scale when requested, then quantize through the inverse scale. The BF16 input path stored the amax reduction result in a BF16 fragment before computing the E5M6 scale.

On H100 with CUDA 13.2 this can underestimate the BF16 row amax for large hidden blocks. The resulting scale is one power-of-two too small when round_sf=True, which changes the packed E5M6 bytes. In the cast-back test this bad forward scale can also overflow BF16 dequantization to inf, making the cosine-style diff report nan.

Store the E5M6 amax reduction result in FP32 so scale selection matches the PyTorch reference and is not tied to packed BF16 reduction behavior. The cast-back kernel itself does not need a workaround; its nan failures were downstream of the incorrect forward scale.

Verified on H100 (NVIDIA H100 PCIe, sm_90) with CUDA 13.2, PyTorch 2.12.0+cu132, and TileLang 0.1.10+cuda.git23d91c58: the 24 per_token_cast_to_e5m6 byte-mismatch cases and 24 cast_back_e5m6 nan-diff cases passed with pytest -n 2.

JayceSu98 jayce.su@enflame-tech.com authored and validated this patch.

Co-author GitHub: https://github.com/dingsg

E5M6 per-token casts use the same scale-selection contract as the FP8/FP4 per-token path: compute a per-token/channel absmax, round the dequant scale when requested, then quantize through the inverse scale. The BF16 input path stored the amax reduction result in a BF16 fragment before computing the E5M6 scale. On H100 with CUDA 13.2 this can underestimate the BF16 row amax for large hidden blocks. The resulting scale is one power-of-two too small when round_sf=True, which changes the packed E5M6 bytes. In the cast-back test this bad forward scale can also overflow BF16 dequantization to inf, making the cosine-style diff report nan. Store the E5M6 amax reduction result in FP32 so scale selection matches the PyTorch reference and is not tied to packed BF16 reduction behavior. The cast-back kernel itself does not need a workaround; its nan failures were downstream of the incorrect forward scale. JayceSu98 <jayce.su@enflame-tech.com> authored and validated this patch. Co-author GitHub: https://github.com/dingsg Verified on H100 (NVIDIA H100 PCIe, sm_90) with CUDA 13.2, PyTorch 2.12.0+cu132, and TileLang 0.1.10+cuda.git23d91c58: the 24 per_token_cast_to_e5m6 byte-mismatch cases and 24 cast_back_e5m6 nan-diff cases passed with pytest -n 2. Co-authored-by: dingsg <shengge.ding@enflame-tech.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][Quant] Accumulate E5M6 BF16 amax in FP32#16

[BugFix][Quant] Accumulate E5M6 BF16 amax in FP32#16
JayceSu98 wants to merge 1 commit into
deepseek-ai:mainfrom
JayceSu98:jayce/bugfix-quant-e5m6-bf16-amax-fp32

JayceSu98 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JayceSu98 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant