Cuda metal/translator cpu fallback by kali · Pull Request #2299 · sonos/tract

kali · 2026-05-27T13:07:52Z

No description provided.

Some GPU op translators (notably the AxisOp path: GpuAxisOp can carry a Reshape(from, to) whose dims were synthesised from the source shape) pass try_make and then bail inside wire_node's output_facts call when an upstream translation has produced a different shape — e.g. pulsification rewriting an axis size. Previously this aborted the whole CUDA/Metal transform. Pre-check the gpu_op against the already-translated target-side input facts before wiring; fall back to the original CPU op when the check fails. The CPU fallback can still surface a separate inconsistency downstream (the encoder pulse + cuda chain hits one), but that's a different bug that the previous early abort was hiding.

Regression-lock the translator fallback fix. The pulsified preprocessor used to crash translation on --cuda (and presumably --metal) because the AxisOp translator emitted a GpuReshape with stale dims; now it falls back to CPU. The allowlist locks in the current CPU residue: STFT/Pad/MoveAxis/PulsedSameAxisConcat/OptMulByScalar/OptSubUnicast. Any future op spilling to CPU in this configuration fails CI. Encoder pulse + GPU is not yet covered: my translator fix makes it fall back from the GpuReshape, but a separate upstream CUDA op then produces a shape that breaks the CPU Reshape too. Follow-up.

Both ops had typed output_facts that diverged from eval / pulsed_output_facts when the model is fed through pulsification and then re-typed (which the CUDA/Metal translator does to compute target-side facts). PulsedSameAxisConcat returned inputs[0] (the small constant 'pre' prefix) instead of inputs[1] (the streaming data). In the pulsified preprocessor, every downstream node then saw the prefix's shape instead of the pulse-axis size and collapsed its outputs to 1. AffineChunkTrim unconditionally subtracted typed_trim from the input dim, but the eval/pulsed logic only trims when the input exceeds target_per_pulse (handling the case where the upstream pulsifier absorbed 'c' into Delay state and emits target_per_pulse directly). Visible as src 14 -> tgt 13 in the encoder pulse chain. Both are visible only on the pulsified-then-re-typed path; pure CPU pulse runs use the PulsedFact pipeline and bypass these typed output_facts.

Now that the PulsedSameAxisConcat + AffineChunkTrim output_facts fixes unblock the encoder pulse + GPU translation, lock in both: - preprocessor allowlist: STFT, Pad, PulsedSameAxisConcat, OptMulByScalar, OptSubUnicast - encoder allowlist: AffineChunkTrim, PulsedRange Runtime numerical correctness on these chains is still off (preprocessor 1.9% outliers vs CPU, encoder 26% — separate runtime/state-init bug likely in GpuDelay or PulsePad). The CI here only asserts translation doesn't crash and the CPU-spill set doesn't grow.

…eset Without this, comparing a CUDA-translated model against an npz reference panics on the first non-plain output, and cumulative-off bisection keeps the drifted device value instead of reseeding from the reference.

Matches the GPU Delay op and ASR convention (silent context before first pulse). Uninit memory leaked into the first `delay` output frames, which were normally masked by downstream pulse trimming but made any per-node CPU↔GPU comparison flaky on the warmup region.

cli/compare: strip the .fused_axis_op suffix the CUDA translator adds when an op absorbs adjacent axis ops, so per-node compare lines up GPU outputs against the CPU reference (covers ~17% more nodes on a typical pulsified GPU model). cuda/transform: TRACT_CUDA_FORCE_CPU=substr[,substr,...] env var that forces matching nodes to the CPU fallback path. Pinpointed CudaGgmlGemm on selfAttn_xMatmul.blockified as the source of the encoder pulse + CUDA drift; useful keep-around for the next time.

PulsePad is on `can_fuse_move`'s allowlist of ops that accept non-contiguous (Move-permuted) inputs. Its partial fills already use `copy_with_origins`, but the initial 'copy the whole input to output' used `flat_copy` — a flat memcpy that reads the buffer in pre-Move byte order while the output is laid out in post-Move natural strides. Visible symptom on the pulsified Nemotron encoder under --cuda: the attention-output matmul fed a GpuPulsePad with a fused GpuMoveAxis(0→1) on its input; the bad initial copy garbled the matmul output before downstream layers consumed it, accumulating ~26% outliers end-to-end.

…nputs The fallback pre-check called `gpu_op.output_facts` on the raw target-side input facts, but those can be a mix of host facts (e.g. a kv-cache past tensor) and device facts (current-turn output). GPU op `output_facts` impls bail with 'Inconsistent facts' on mixed inputs, which then wrongly trips the CPU fallback. Symptom on Llama 3.2 1B f32f32 --cuda: kv-cache Concat and residual Add ops all fell back to CPU, blowing the LLM CI op-only allowlist. Mirror what wire-time `sync_inputs_if_required(ToDevice)` does: wrap each non-device input as a DeviceFact-from-host before calling `output_facts` for the pre-check. Also adds an opt-in `TRACT_CUDA_TRANSLATE_DEBUG` env var that prints each rejected node and the underlying error chain — handy for the next time a pre-check decision needs investigation.

kali added 10 commits May 27, 2026 13:02

metal/transform: mirror cuda pre-check fix on post-sync device facts

f0e668d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda metal/translator cpu fallback#2299

Cuda metal/translator cpu fallback#2299
kali wants to merge 10 commits into
mainfrom
cuda-metal/translator-cpu-fallback

kali commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kali commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant