A19 (iPhone 17 Pro) GPU returns numerically wrong results in float32 workloads — deterministic HF corruption in a Demucs separation pipeline; M2/M3 clean

### Summary

On A19 (iPhone 17 Pro Max, iOS 26.5.1) MLX produces **deterministically wrong numerical results on the GPU** in a real Hybrid-Transformer-Demucs source-separation pipeline. The separated audio stems contain **+12 to +21 dB of spurious high-frequency energy (14–22 kHz)** that is **not present** when the identical code, weights and parameters run on Apple Silicon Macs (M2). The corruption is:

- **deterministic** (not NaN, reproducible run to run),
- present **in the output tensors themselves** (we exported the raw WAVs from the device — it is not a playback/codec artifact),
- present in **every** output stem (drums/bass/other/vocals), growing toward Nyquist and broadband on vocals.

This matches the A19 GPU dot-product issue independently documented by Taras Zakharko's microbenchmark *"Investigating the GPU Neural Accelerators on Apple A19/M5"*, which reports that **"choosing certain matrix dimensions produces invalid results on A19"**, attributed to **"a bug with masking out unused lanes in the dot product hardware."**

We suspect the same root cause is surfacing through MLX's Neural-Accelerator matmul path, which #3083 enables for `gen >= 18` phone architectures (i.e. A19).

### Environment

- Device: **iPhone 17 Pro Max (A19 Pro, 5 GPU cores), iOS 26.5.1, 12 GB**
- mlx-swift **0.30.6** (also reproduced on **0.31.4**)
- Model: **Hybrid Transformer Demucs** (`htdemucs` and `htdemucs_ft`), FP32 weights
- Reference (clean): identical code/weights on **M2** (both the app and the `demucs-mlx-swift` CLI)

### Symptom (measured)

Per-band energy (dB relative to that stem's total) of each separated stem, **A19 device vs. M2 CLI**, identical model/params (`htdemucs_ft`, shifts=1, overlap=0.75, segment=5):

| stem | band | A19 device | M2 reference | delta |
|---|---|---|---|---|
| drums | 18–22 kHz | −59.4 | −71.8 | **+12.4 dB** |
| bass | 18–22 kHz | −63.5 | −76.1 | **+12.5 dB** |
| other | 14–18 / 18–22 kHz | −51.8 / −57.5 | −60.8 / −72.4 | **+8.9 / +14.9 dB** |
| vocals | 10–14 / 14–18 / 18–22 kHz | −12.2 / −14.7 / −12.4 | −25.2 / −29.3 / −33.0 | **+13.0 / +14.6 / +20.6 dB** |

Output peaks are **not clipped** (≈0.90 on both), ruling out output quantization/saturation. A third-party reference (a cloud separation of the same track) agrees closely with the M2 output, confirming the M2 result is the correct one.

### What we tried (did **not** resolve)

- **`MLX_ENABLE_TF32=0`** — *verified actually applied* (we logged `getenv("MLX_ENABLE_TF32")` at separation time and it reads `"0"`): **no change**, the spurious HF energy persists and the output is essentially identical to TF32 on. This is the documented control for the M5/A19 Neural-Accelerator reduced-precision path (cf. #3534, closed as "expected behavior — Neural Accelerators trade precision for performance"), so what we are seeing is **not** the expected TF32 precision tradeoff but a separate **correctness** issue.
- Forcing the high-level NAX flag off in the metal backend (`can_use_nax = false`): the output **changed** but **remained corrupted**.
- Forcing `rfft`/`irfft` to the CPU stream (`stream: .cpu`): **no change**.
- Upgrading mlx-swift 0.30.6 → 0.31.4: **no change**.

So the corruption is **not** the TF32 precision mode and **not fully gated by the high-level NAX flag** alone. We have not yet isolated the exact op (matmul / SDPA / conv / a custom Metal kernel).

### Questions

1. Is A19 (`gen >= 18` phone) intended to use the Neural-Accelerator matmul path after #3083? If so, is the dot-product correctness issue Zakharko documents for certain matrix shapes on A19 known to the team?
2. Is there a **recommended workaround today** to get numerically correct GPU results on A19 (e.g. a supported way to disable the NAX/tensor path on A19, or to avoid the affected matmul shapes)?
3. Would a **minimal matmul/conv GPU-vs-CPU reproducer on A19** be useful? We can build and share one, plus the full audio measurements and raw stems.

### References

- A19 dot-product bug (microbenchmark): https://tzakharko.github.io/apple-neural-accelerators-benchmark/
- NAX gate for iPhone (`gen >= 18`): https://github.com/ml-explore/mlx/pull/3083
- M5 float32 precision via Neural Accelerators / `MLX_ENABLE_TF32` (related, "expected behavior"): https://github.com/ml-explore/mlx/issues/3534


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A19 (iPhone 17 Pro) GPU returns numerically wrong results in float32 workloads — deterministic HF corruption in a Demucs separation pipeline; M2/M3 clean #3702

Summary

Environment

Symptom (measured)

What we tried (did not resolve)

Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

stem	band	A19 device	M2 reference	delta
drums	18–22 kHz	−59.4	−71.8	+12.4 dB
bass	18–22 kHz	−63.5	−76.1	+12.5 dB
other	14–18 / 18–22 kHz	−51.8 / −57.5	−60.8 / −72.4	+8.9 / +14.9 dB
vocals	10–14 / 14–18 / 18–22 kHz	−12.2 / −14.7 / −12.4	−25.2 / −29.3 / −33.0	+13.0 / +14.6 / +20.6 dB

A19 (iPhone 17 Pro) GPU returns numerically wrong results in float32 workloads — deterministic HF corruption in a Demucs separation pipeline; M2/M3 clean #3702

Description

Summary

Environment

Symptom (measured)

What we tried (did not resolve)

Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions