You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On A19 (iPhone 17 Pro Max, iOS 26.5.1) MLX produces deterministically wrong numerical results on the GPU in a real Hybrid-Transformer-Demucs source-separation pipeline. The separated audio stems contain +12 to +21 dB of spurious high-frequency energy (14–22 kHz) that is not present when the identical code, weights and parameters run on Apple Silicon Macs (M2). The corruption is:
deterministic (not NaN, reproducible run to run),
present in the output tensors themselves (we exported the raw WAVs from the device — it is not a playback/codec artifact),
present in every output stem (drums/bass/other/vocals), growing toward Nyquist and broadband on vocals.
This matches the A19 GPU dot-product issue independently documented by Taras Zakharko's microbenchmark "Investigating the GPU Neural Accelerators on Apple A19/M5", which reports that "choosing certain matrix dimensions produces invalid results on A19", attributed to "a bug with masking out unused lanes in the dot product hardware."
We suspect the same root cause is surfacing through MLX's Neural-Accelerator matmul path, which #3083 enables for gen >= 18 phone architectures (i.e. A19).
Environment
Device: iPhone 17 Pro Max (A19 Pro, 5 GPU cores), iOS 26.5.1, 12 GB
mlx-swift 0.30.6 (also reproduced on 0.31.4)
Model: Hybrid Transformer Demucs (htdemucs and htdemucs_ft), FP32 weights
Reference (clean): identical code/weights on M2 (both the app and the demucs-mlx-swift CLI)
Symptom (measured)
Per-band energy (dB relative to that stem's total) of each separated stem, A19 device vs. M2 CLI, identical model/params (htdemucs_ft, shifts=1, overlap=0.75, segment=5):
stem
band
A19 device
M2 reference
delta
drums
18–22 kHz
−59.4
−71.8
+12.4 dB
bass
18–22 kHz
−63.5
−76.1
+12.5 dB
other
14–18 / 18–22 kHz
−51.8 / −57.5
−60.8 / −72.4
+8.9 / +14.9 dB
vocals
10–14 / 14–18 / 18–22 kHz
−12.2 / −14.7 / −12.4
−25.2 / −29.3 / −33.0
+13.0 / +14.6 / +20.6 dB
Output peaks are not clipped (≈0.90 on both), ruling out output quantization/saturation. A third-party reference (a cloud separation of the same track) agrees closely with the M2 output, confirming the M2 result is the correct one.
What we tried (did not resolve)
MLX_ENABLE_TF32=0 — verified actually applied (we logged getenv("MLX_ENABLE_TF32") at separation time and it reads "0"): no change, the spurious HF energy persists and the output is essentially identical to TF32 on. This is the documented control for the M5/A19 Neural-Accelerator reduced-precision path (cf. [BUG] M5 float32 precision issue since 0.30.0 #3534, closed as "expected behavior — Neural Accelerators trade precision for performance"), so what we are seeing is not the expected TF32 precision tradeoff but a separate correctness issue.
Forcing the high-level NAX flag off in the metal backend (can_use_nax = false): the output changed but remained corrupted.
Forcing rfft/irfft to the CPU stream (stream: .cpu): no change.
Upgrading mlx-swift 0.30.6 → 0.31.4: no change.
So the corruption is not the TF32 precision mode and not fully gated by the high-level NAX flag alone. We have not yet isolated the exact op (matmul / SDPA / conv / a custom Metal kernel).
Questions
Is A19 (gen >= 18 phone) intended to use the Neural-Accelerator matmul path after Fix nax condition for iphone #3083? If so, is the dot-product correctness issue Zakharko documents for certain matrix shapes on A19 known to the team?
Is there a recommended workaround today to get numerically correct GPU results on A19 (e.g. a supported way to disable the NAX/tensor path on A19, or to avoid the affected matmul shapes)?
Would a minimal matmul/conv GPU-vs-CPU reproducer on A19 be useful? We can build and share one, plus the full audio measurements and raw stems.
Summary
On A19 (iPhone 17 Pro Max, iOS 26.5.1) MLX produces deterministically wrong numerical results on the GPU in a real Hybrid-Transformer-Demucs source-separation pipeline. The separated audio stems contain +12 to +21 dB of spurious high-frequency energy (14–22 kHz) that is not present when the identical code, weights and parameters run on Apple Silicon Macs (M2). The corruption is:
This matches the A19 GPU dot-product issue independently documented by Taras Zakharko's microbenchmark "Investigating the GPU Neural Accelerators on Apple A19/M5", which reports that "choosing certain matrix dimensions produces invalid results on A19", attributed to "a bug with masking out unused lanes in the dot product hardware."
We suspect the same root cause is surfacing through MLX's Neural-Accelerator matmul path, which #3083 enables for
gen >= 18phone architectures (i.e. A19).Environment
htdemucsandhtdemucs_ft), FP32 weightsdemucs-mlx-swiftCLI)Symptom (measured)
Per-band energy (dB relative to that stem's total) of each separated stem, A19 device vs. M2 CLI, identical model/params (
htdemucs_ft, shifts=1, overlap=0.75, segment=5):Output peaks are not clipped (≈0.90 on both), ruling out output quantization/saturation. A third-party reference (a cloud separation of the same track) agrees closely with the M2 output, confirming the M2 result is the correct one.
What we tried (did not resolve)
MLX_ENABLE_TF32=0— verified actually applied (we loggedgetenv("MLX_ENABLE_TF32")at separation time and it reads"0"): no change, the spurious HF energy persists and the output is essentially identical to TF32 on. This is the documented control for the M5/A19 Neural-Accelerator reduced-precision path (cf. [BUG] M5 float32 precision issue since 0.30.0 #3534, closed as "expected behavior — Neural Accelerators trade precision for performance"), so what we are seeing is not the expected TF32 precision tradeoff but a separate correctness issue.can_use_nax = false): the output changed but remained corrupted.rfft/irfftto the CPU stream (stream: .cpu): no change.So the corruption is not the TF32 precision mode and not fully gated by the high-level NAX flag alone. We have not yet isolated the exact op (matmul / SDPA / conv / a custom Metal kernel).
Questions
gen >= 18phone) intended to use the Neural-Accelerator matmul path after Fix nax condition for iphone #3083? If so, is the dot-product correctness issue Zakharko documents for certain matrix shapes on A19 known to the team?References
gen >= 18): Fix nax condition for iphone #3083MLX_ENABLE_TF32(related, "expected behavior"): [BUG] M5 float32 precision issue since 0.30.0 #3534