[ROCm] Allow mixed (F8E4M3FNUZ, F8E5M2FNUZ) in Triton dot gate by Ruturaj4 · Pull Request #897 · ROCm/xla

Ruturaj4 · 2026-05-29T16:00:21Z

The mixed-FP8 in IsTritonSupportedDot only listed the OCP pair (F8E5M2, F8E4M3FN). The ROCm-native FNUZ pair was rejected even though the rest of the file already accepts FNUZ FP8 inputs on ROCm.

This blocks TransformerEngine FP8 GEMM on MI300 (gfx94X), which lowers dgrad to dot_general(F8E4M3FNUZ, F8E5M2FNUZ) and gets routed to a __triton_nested_gemm_fusion. The gate then refuses it at codegen time with "INTERNAL: ... Dot operation only supports same types for lhs and rhs."

Mirror the existing OCP allowance under gpu_version.IsRocm() so the FNUZ pair passes the same check.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

The mixed-FP8 carve-out in IsTritonSupportedDot only listed the OCP pair (F8E5M2, F8E4M3FN). The ROCm-native FNUZ pair was rejected even though the rest of the file already accepts FNUZ FP8 inputs on ROCm. This blocks TransformerEngine FP8 GEMM on MI300 (gfx94X), which lowers dgrad to dot_general(F8E4M3FNUZ, F8E5M2FNUZ) and gets routed to a __triton_nested_gemm_fusion. The gate then refuses it at codegen time with "INTERNAL: ... Dot operation only supports same types for lhs and rhs." Mirror the existing OCP allowance under gpu_version.IsRocm() so the FNUZ pair passes the same check.

draganmladjenovic · 2026-05-29T16:09:15Z

-  if (lhs_type != rhs_type && !types_are(F8E5M2, F8E4M3FN)) {
+  const bool mixed_fp8_ok =
+      types_are(F8E5M2, F8E4M3FN) ||
+      (gpu_version.IsRocm() && types_are(F8E5M2FNUZ, F8E4M3FNUZ));


You may unconditionally support it here and let it fail at AreDotAlgorithmInputAndOutputConversionsSupported?

Extends AllDevicesToTest() to include gfx942 (MI300) and gfx950 (MI355x) so the FNUZ pair gets exercised against hardware that actually supports FNUZ FP8. Adds the FNUZ pair to MixedF8DotTest's parameterization. Also fixes the existing F8-requires-Hopper skip to use cc.IsCuda() instead of the pointer-nonnull anti-pattern. A default-constructed GpuComputeCapability holds a default CudaComputeCapability whose IsAtLeastHopper() is false, which the old check would silently treat as "CUDA pre-Hopper" and skip the test even on what callers think is ROCm. Addresses TODO(b/393299275).

IsTritonSupportedDot's only caller (IsTritonSupportedInstructionImpl) already rejects FNUZ inputs on CUDA via IsTritonSupportedDataType, so the additional gpu_version.IsRocm() check inside mixed_fp8_ok was dead code. Drop it for consistency with the OCP pair (F8E5M2, F8E4M3FN) above, which similarly relies on the upstream data-type filter.

CI clang-format check rejected the manual line-wrap on this two-call OR. Apply the canonical formatting CI provided.

draganmladjenovic reviewed May 29, 2026

View reviewed changes

Ruturaj4 added 3 commits May 30, 2026 00:55

[ROCm] Reformat mixed_fp8_ok per clang-format

87a4b32

CI clang-format check rejected the manual line-wrap on this two-call OR. Apply the canonical formatting CI provided.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Allow mixed (F8E4M3FNUZ, F8E5M2FNUZ) in Triton dot gate#897

[ROCm] Allow mixed (F8E4M3FNUZ, F8E5M2FNUZ) in Triton dot gate#897
Ruturaj4 wants to merge 4 commits into
mainfrom
ruvaidya/fnuz-mixed-fp8-triton-dot

Ruturaj4 commented May 29, 2026

Uh oh!

draganmladjenovic May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ruturaj4 commented May 29, 2026

Submission Checklist

Uh oh!

draganmladjenovic May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants