linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19#2273
Draft
czoli1976 wants to merge 1 commit into
Draft
linalg/arm64/sme: experimental f16 SME kernels for Apple M5/A19#2273czoli1976 wants to merge 1 commit into
czoli1976 wants to merge 1 commit into
Conversation
…_SME_F16F16) Adds half-precision SME kernels for the mmm_f16 / mmv_f16 slots, the f16 companions to the merged f32 SME backend (sonos#2230). They use the non-widening half-precision SME path (`fmopa za.h`, FEAT_SME_F16F16): f16 inputs, f16 accumulate in ZA.H, consuming tract's native K-major f16 packing directly. - sme_mmm_f16_32x32 (GEMM): one 32x32 ZA.H tile, `fmopa za.h` per K-step (at SVL=512, one f16 FMOPA covers the whole 32x32 tile). where(SME_F16F16). - sme_mmv_f16_64x1 (GEMV, N==1): vgx2 ZA.H group, SME2 multi-vec `fmla za.h[w8,0,vgx2]`. where(SME2 && SME_F16F16). EXPERIMENTAL / effectively Apple M5 + A19 only, and unvalidated on real hardware. FEAT_SME_F16F16 is an optional SME2 feature that, among shipping silicon, only the Apple M5 / A19 implement. The "smoking gun" is upstream LLVM's CPU definition (llvm/llvm-project commit f85494f6afeb, "Define apple-m5/a19"): def TuneAppleM5 : SubtargetFeature<"apple-m5", ..., FeatureSME, FeatureSME2, FeatureSMEF64F64, FeatureSMEI16I64, FeatureSME2p1, FeatureSMEB16B16, FeatureSMEF16F16, ...> In LLVM's whole AArch64 CPU table FeatureSMEF16F16 appears for apple-m5/a19 and nothing else: the Apple M4/A18 report it as 0 (verified by sysctl on an M4), and the newest non-Apple SME2 cores (Arm C1/Lumex in Exynos 2600, Cortex-X925, Qualcomm Oryon) have SME/SME2 but not FEAT_SME_F16F16. So this needs community testing on an actual M5 / A19 (iPhone 17) — the maintainers' M4 cannot exercise it (it falls back to AMX f16). Build gating (so the f32 SME backend is never regressed): the f16 unit uses the `+sme-f16f16` assembler extension. A new dummy_sme_f16f16.S probe + assembler_supports_sme_f16f16() compiles the f16 kernels as a separate object gated on the `tract_sme_f16f16` cfg; the f32 SME kernels keep building on toolchains that have base SME but not f16f16. The Rust f16 registrations, detection (HWCAP2_SME_F16F16 bit 42 on Linux / sysctl on macOS, plus the existing 512-bit SVL check), and plug() wiring are all behind that cfg. Validated under QEMU only (no SME_F16F16 hardware available): with `qemu-aarch64 -cpu max,sme512=on` the full SME auto-test surface passes 220/220 (the two new f16 kernels: matmul proptest, every fuse op, store layouts, frame; plus the existing f32 GEMM/GEMV, no regression). Builds clean on macOS (Apple clang) and debian:sid (gcc 15). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addendum to the merged f32 SME backend (#2230): experimental half-precision SME kernels for the
mmm_f16/mmv_f16slots, using the non-wideningfmopa za.hpath (FEAT_SME_F16F16).Warning
This is effectively Apple M5 / A19-only and has NOT been validated on real hardware. It is QEMU-correctness-validated only and needs community testing on an actual M5 / A19 (iPhone 17). It cannot regress anything else (see Build safety), so it's offered as a low-risk, opt-in-by-hardware addition.
Why "M5 / A19-only" — the LLVM smoking gun
FEAT_SME_F16F16(non-widening f16→f16 in ZA.H) is an optional SME2 sub-feature. Among shipping silicon, only the Apple M5 / A19 implement it. The proof is upstream LLVM's own CPU definition —llvm/llvm-projectcommitf85494f6afeb"Define apple-m5/a19 CPUs":In LLVM's entire AArch64 CPU table,
FeatureSMEF16F16appears for apple-m5/a19 and nothing else:FEAT_SME_F16F16sysctl=0on an M4)So this can only be exercised on an M5/A19, which I don't have — hence experimental, community testing requested.
What it adds
sme_mmm_f16_32x32(GEMM): one 32×32ZA.Htile,fmopa za.hper K‑step (at SVL=512 a single f16 FMOPA covers the whole 32×32 tile = the f32 kernel's 4‑tile MAC count).where(SME_F16F16).sme_mmv_f16_64x1(GEMV, N==1): vgx2ZA.Hgroup, SME2 multi‑vecfmla za.h[w8,0,vgx2].where(SME2 && SME_F16F16).CAN_FUSEmatches the f32 kernels.Build safety (cannot regress the f32 SME backend)
The f16 unit needs the
+sme-f16f16assembler extension. A separatedummy_sme_f16f16.Sprobe +assembler_supports_sme_f16f16()compiles the f16 kernels as their own object behind a newtract_sme_f16f16cfg. A toolchain with base SME but not f16f16 still builds the f32 SME kernels exactly as before — only the f16 unit (and all its Rust registration / detection /plug()wiring) is gated off. Detection isHWCAP2_SME_F16F16(bit 42) on Linux / thehw.optional.arm.FEAT_SME_F16F16sysctl on macOS, plus the existing 512‑bit SVL check.Validation
qemu-aarch64 -cpu max,sme512=on→ full SME auto-test surface 220/220 (the two new f16 kernels — matmul proptest, every fuse op, store layouts, frame — plus the existing f32 GEMM/GEMV, no regression).fmopa za.hand the multi-vecfmla za.h vgx2forms are bit-correct.FEAT_SME_F16F16silicon — that's the ask.Request
If you (or anyone) have an M5 Mac or an iPhone 17 (A19), a run of
cargo test -p tract-linalg arm64::sme::test_sme_mmm_f16/test_sme_mmv_f16(or an f16 model A/B vs the AMX path) would confirm real-hardware correctness and let this graduate from experimental.🤖 Generated with Claude Code