[BugFix][Quant] Accumulate per-token BF16 amax in FP32 by JayceSu98 · Pull Request #15 · deepseek-ai/TileKernels

JayceSu98 · 2026-05-28T01:50:31Z

Per-token FP8/FP4 casts derive each output scale factor from an absmax reduction over a token/channel group. For BF16 inputs, the reduction result was stored in a BF16 fragment, so TileLang lowered the local reduction through the BF16 packed path before the final cross-thread max.

On H100 with CUDA 13.2 this can underestimate the row amax for large hidden blocks. With round_sf=True, the underestimated amax rounds the dequant scale one power-of-two too small, so the quantized E4M3/E2M1 bytes and the stored scaling factors diverge from the PyTorch reference.

Store the per-group amax reduction result in FP32. The input data remains BF16, but scale selection is a FP32 numeric contract and should not depend on low-precision packed BF16 reduction behavior.

Verified on H100 (NVIDIA H100 PCIe, sm_90) with CUDA 13.2, PyTorch 2.12.0+cu132, and TileLang 0.1.10+cuda.git23d91c58: the 43 previously failing BF16 per_token_cast E4M3/E2M1 cases passed with pytest -n 2.

JayceSu98 jayce.su@enflame-tech.com authored and validated this patch.

Co-author GitHub: https://github.com/dingsg

Per-token FP8/FP4 casts derive each output scale factor from an absmax reduction over a token/channel group. For BF16 inputs, the reduction result was stored in a BF16 fragment, so TileLang lowered the local reduction through the BF16 packed path before the final cross-thread max. On H100 with CUDA 13.2 this can underestimate the row amax for large hidden blocks. With round_sf=True, the underestimated amax rounds the dequant scale one power-of-two too small, so the quantized E4M3/E2M1 bytes and the stored scaling factors diverge from the PyTorch reference. Store the per-group amax reduction result in FP32. The input data remains BF16, but scale selection is a FP32 numeric contract and should not depend on low-precision packed BF16 reduction behavior. JayceSu98 <jayce.su@enflame-tech.com> authored and validated this patch. Co-author GitHub: https://github.com/dingsg Verified on H100 (NVIDIA H100 PCIe, sm_90) with CUDA 13.2, PyTorch 2.12.0+cu132, and TileLang 0.1.10+cuda.git23d91c58: the 43 previously failing BF16 per_token_cast E4M3/E2M1 cases passed with pytest -n 2. Co-authored-by: dingsg <shengge.ding@enflame-tech.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][Quant] Accumulate per-token BF16 amax in FP32#15

[BugFix][Quant] Accumulate per-token BF16 amax in FP32#15
JayceSu98 wants to merge 1 commit into
deepseek-ai:mainfrom
JayceSu98:jayce/bugfix-quant-per-token-bf16-amax-fp32

JayceSu98 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JayceSu98 commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant