Finding
attention.py's docstring claims train + rollout prefill + rollout decode route through one CUDA forward kernel for consistency. In reality rollout prefill/decode use the new varlen/paged-decode kernels while training uses PyTorch SDPA math. Consequences:
areno_causal_attention (dense 4D forward + custom backward) is never called by any runtime path or test.
- Three custom backwards (
causal_attention_backward, varlen_..._backward, _ArenoPagedCausalAttentionDecode.backward reference recompute) are unreachable (rollout in inference_mode, train via SDPA). This is large hand-written CUDA built unconditionally with no callers or numeric tests.
Recommendation
Either (a) wire the dense kernel into a truly unified training path, or (b) delete the dead dense kernel + unreachable backwards and rewrite the docstring to describe reality (rollout = accel kernel, train = SDPA math).
Finding
attention.py's docstring claims train + rollout prefill + rollout decode route through one CUDA forward kernel for consistency. In reality rollout prefill/decode use the new varlen/paged-decode kernels while training uses PyTorch SDPA math. Consequences:areno_causal_attention(dense 4D forward + custom backward) is never called by any runtime path or test.causal_attention_backward,varlen_..._backward,_ArenoPagedCausalAttentionDecode.backwardreference recompute) are unreachable (rollout ininference_mode, train via SDPA). This is large hand-written CUDA built unconditionally with no callers or numeric tests.Recommendation
Either (a) wire the dense kernel into a truly unified training path, or (b) delete the dead dense kernel + unreachable backwards and rewrite the docstring to describe reality (rollout = accel kernel, train = SDPA math).