Skip to content

Add L9/L10/L11 block-distributed transforms (HIP/MFMA/rocWMMA)#7

Open
ahurta92 wants to merge 2 commits into
devreal:mainfrom
ahurta92:merge-blocked-into-main
Open

Add L9/L10/L11 block-distributed transforms (HIP/MFMA/rocWMMA)#7
ahurta92 wants to merge 2 commits into
devreal:mainfrom
ahurta92:merge-blocked-into-main

Conversation

@ahurta92

@ahurta92 ahurta92 commented May 4, 2026

Copy link
Copy Markdown
Contributor

Ports the blocked-transform contributions from the blocked-transform branch onto upstream/main without disturbing existing L1-L8 dispatch. New levels share the per-wave corner-turn algorithm but differ in the inner GEMM:

L9 (blocked) v_mfma_f64_16x16x4f64 manual fragments, K=16
L10 (blocked-rocwmma) rocWMMA mma_sync, K=16
L11 (blocked-k20) hybrid 16x16x4 + 4x4x4 MFMA, K=20

All three are gated behind HIP in transformbench.cu and validate_levels.hip so the CUDA executable still compiles. The header bodies are likewise wrapped. Validated bit-exact against L1 on MI210 for K=16 (L9, L10) and K=20 (L11).

Also adds:
validate.hip GPU L1 vs CPU mTxm reference (mirrors transform3d.cc)
BLOCKED_TRANSFORM.md design notes, MFMA fragment layouts, perf tables

Ports the blocked-transform contributions from the blocked-transform branch
onto upstream/main without disturbing existing L1-L8 dispatch. New levels
share the per-wave corner-turn algorithm but differ in the inner GEMM:

  L9  (blocked)         v_mfma_f64_16x16x4f64 manual fragments, K=16
  L10 (blocked-rocwmma) rocWMMA mma_sync,                       K=16
  L11 (blocked-k20)     hybrid 16x16x4 + 4x4x4 MFMA,            K=20

All three are gated behind __HIP__ in transformbench.cu and validate_levels.hip
so the CUDA executable still compiles. The header bodies are likewise wrapped.
Validated bit-exact against L1 on MI210 for K=16 (L9, L10) and K=20 (L11).

Also adds:
  validate.hip          GPU L1 vs CPU mTxm reference (mirrors transform3d.cc)
  BLOCKED_TRANSFORM.md  design notes, MFMA fragment layouts, perf tables

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ahurta92

ahurta92 commented May 4, 2026

Copy link
Copy Markdown
Contributor Author

This replaces the previous PR. I'll close the other.

@ahurta92 ahurta92 mentioned this pull request May 4, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant