Skip to content

Cuda metal/translator cpu fallback#2299

Open
kali wants to merge 10 commits into
mainfrom
cuda-metal/translator-cpu-fallback
Open

Cuda metal/translator cpu fallback#2299
kali wants to merge 10 commits into
mainfrom
cuda-metal/translator-cpu-fallback

Conversation

@kali
Copy link
Copy Markdown
Collaborator

@kali kali commented May 27, 2026

No description provided.

kali added 10 commits May 27, 2026 13:02
Some GPU op translators (notably the AxisOp path: GpuAxisOp can carry a
Reshape(from, to) whose dims were synthesised from the source shape)
pass try_make and then bail inside wire_node's output_facts call when
an upstream translation has produced a different shape — e.g.
pulsification rewriting an axis size.  Previously this aborted the
whole CUDA/Metal transform.

Pre-check the gpu_op against the already-translated target-side input
facts before wiring; fall back to the original CPU op when the check
fails.  The CPU fallback can still surface a separate inconsistency
downstream (the encoder pulse + cuda chain hits one), but that's a
different bug that the previous early abort was hiding.
Regression-lock the translator fallback fix.  The pulsified preprocessor
used to crash translation on --cuda (and presumably --metal) because the
AxisOp translator emitted a GpuReshape with stale dims; now it falls
back to CPU.  The allowlist locks in the current CPU residue:
STFT/Pad/MoveAxis/PulsedSameAxisConcat/OptMulByScalar/OptSubUnicast.
Any future op spilling to CPU in this configuration fails CI.

Encoder pulse + GPU is not yet covered: my translator fix makes it
fall back from the GpuReshape, but a separate upstream CUDA op then
produces a shape that breaks the CPU Reshape too.  Follow-up.
Both ops had typed output_facts that diverged from eval / pulsed_output_facts
when the model is fed through pulsification and then re-typed (which the
CUDA/Metal translator does to compute target-side facts).

PulsedSameAxisConcat returned inputs[0] (the small constant 'pre' prefix)
instead of inputs[1] (the streaming data).  In the pulsified preprocessor,
every downstream node then saw the prefix's shape instead of the pulse-axis
size and collapsed its outputs to 1.

AffineChunkTrim unconditionally subtracted typed_trim from the input dim,
but the eval/pulsed logic only trims when the input exceeds target_per_pulse
(handling the case where the upstream pulsifier absorbed 'c' into Delay state
and emits target_per_pulse directly).  Visible as src 14 -> tgt 13 in the
encoder pulse chain.

Both are visible only on the pulsified-then-re-typed path; pure CPU pulse
runs use the PulsedFact pipeline and bypass these typed output_facts.
Now that the PulsedSameAxisConcat + AffineChunkTrim output_facts fixes
unblock the encoder pulse + GPU translation, lock in both:
 - preprocessor allowlist: STFT, Pad, PulsedSameAxisConcat,
   OptMulByScalar, OptSubUnicast
 - encoder allowlist:      AffineChunkTrim, PulsedRange

Runtime numerical correctness on these chains is still off (preprocessor
1.9% outliers vs CPU, encoder 26% — separate runtime/state-init bug
likely in GpuDelay or PulsePad).  The CI here only asserts translation
doesn't crash and the CPU-spill set doesn't grow.
…eset

Without this, comparing a CUDA-translated model against an npz reference
panics on the first non-plain output, and cumulative-off bisection keeps
the drifted device value instead of reseeding from the reference.
Matches the GPU Delay op and ASR convention (silent context before
first pulse).  Uninit memory leaked into the first `delay` output
frames, which were normally masked by downstream pulse trimming but
made any per-node CPU↔GPU comparison flaky on the warmup region.
cli/compare: strip the .fused_axis_op suffix the CUDA translator adds
when an op absorbs adjacent axis ops, so per-node compare lines up GPU
outputs against the CPU reference (covers ~17% more nodes on a typical
pulsified GPU model).

cuda/transform: TRACT_CUDA_FORCE_CPU=substr[,substr,...] env var that
forces matching nodes to the CPU fallback path.  Pinpointed
CudaGgmlGemm on selfAttn_xMatmul.blockified as the source of the
encoder pulse + CUDA drift; useful keep-around for the next time.
PulsePad is on `can_fuse_move`'s allowlist of ops that accept
non-contiguous (Move-permuted) inputs.  Its partial fills already use
`copy_with_origins`, but the initial 'copy the whole input to output'
used `flat_copy` — a flat memcpy that reads the buffer in pre-Move
byte order while the output is laid out in post-Move natural strides.

Visible symptom on the pulsified Nemotron encoder under --cuda: the
attention-output matmul fed a GpuPulsePad with a fused
GpuMoveAxis(0→1) on its input; the bad initial copy garbled the
matmul output before downstream layers consumed it, accumulating
~26% outliers end-to-end.
…nputs

The fallback pre-check called `gpu_op.output_facts` on the raw
target-side input facts, but those can be a mix of host facts (e.g. a
kv-cache past tensor) and device facts (current-turn output).  GPU op
`output_facts` impls bail with 'Inconsistent facts' on mixed inputs,
which then wrongly trips the CPU fallback.  Symptom on Llama 3.2 1B
f32f32 --cuda: kv-cache Concat and residual Add ops all fell back to
CPU, blowing the LLM CI op-only allowlist.

Mirror what wire-time `sync_inputs_if_required(ToDevice)` does:
wrap each non-device input as a DeviceFact-from-host before calling
`output_facts` for the pre-check.

Also adds an opt-in `TRACT_CUDA_TRANSLATE_DEBUG` env var that prints
each rejected node and the underlying error chain — handy for the
next time a pre-check decision needs investigation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant