Add L9/L10/L11 block-distributed transforms (HIP/MFMA/rocWMMA) by ahurta92 · Pull Request #7 · devreal/transformbench-cublasdx

ahurta92 · 2026-05-04T14:21:01Z

Ports the blocked-transform contributions from the blocked-transform branch onto upstream/main without disturbing existing L1-L8 dispatch. New levels share the per-wave corner-turn algorithm but differ in the inner GEMM:

L9 (blocked) v_mfma_f64_16x16x4f64 manual fragments, K=16
L10 (blocked-rocwmma) rocWMMA mma_sync, K=16
L11 (blocked-k20) hybrid 16x16x4 + 4x4x4 MFMA, K=20

All three are gated behind HIP in transformbench.cu and validate_levels.hip so the CUDA executable still compiles. The header bodies are likewise wrapped. Validated bit-exact against L1 on MI210 for K=16 (L9, L10) and K=20 (L11).

Also adds:
validate.hip GPU L1 vs CPU mTxm reference (mirrors transform3d.cc)
BLOCKED_TRANSFORM.md design notes, MFMA fragment layouts, perf tables

Ports the blocked-transform contributions from the blocked-transform branch onto upstream/main without disturbing existing L1-L8 dispatch. New levels share the per-wave corner-turn algorithm but differ in the inner GEMM: L9 (blocked) v_mfma_f64_16x16x4f64 manual fragments, K=16 L10 (blocked-rocwmma) rocWMMA mma_sync, K=16 L11 (blocked-k20) hybrid 16x16x4 + 4x4x4 MFMA, K=20 All three are gated behind __HIP__ in transformbench.cu and validate_levels.hip so the CUDA executable still compiles. The header bodies are likewise wrapped. Validated bit-exact against L1 on MI210 for K=16 (L9, L10) and K=20 (L11). Also adds: validate.hip GPU L1 vs CPU mTxm reference (mirrors transform3d.cc) BLOCKED_TRANSFORM.md design notes, MFMA fragment layouts, perf tables Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ahurta92 · 2026-05-04T14:21:26Z

This replaces the previous PR. I'll close the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ahurta92 mentioned this pull request May 4, 2026

Blocked transform #5

Closed

Ignore rocprof virtualenvs and profiling output dirs

eeab26a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add L9/L10/L11 block-distributed transforms (HIP/MFMA/rocWMMA)#7

Add L9/L10/L11 block-distributed transforms (HIP/MFMA/rocWMMA)#7
ahurta92 wants to merge 2 commits into
devreal:mainfrom
ahurta92:merge-blocked-into-main

ahurta92 commented May 4, 2026

Uh oh!

ahurta92 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ahurta92 commented May 4, 2026

Uh oh!

ahurta92 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant