Skip to content

Fix native attention design-claim mismatch: dense kernel and custom backwards unused #80

Description

@adohe

Finding

attention.py's docstring claims train + rollout prefill + rollout decode route through one CUDA forward kernel for consistency. In reality rollout prefill/decode use the new varlen/paged-decode kernels while training uses PyTorch SDPA math. Consequences:

  • areno_causal_attention (dense 4D forward + custom backward) is never called by any runtime path or test.
  • Three custom backwards (causal_attention_backward, varlen_..._backward, _ArenoPagedCausalAttentionDecode.backward reference recompute) are unreachable (rollout in inference_mode, train via SDPA). This is large hand-written CUDA built unconditionally with no callers or numeric tests.

Recommendation

Either (a) wire the dense kernel into a truly unified training path, or (b) delete the dead dense kernel + unreachable backwards and rewrite the docstring to describe reality (rollout = accel kernel, train = SDPA math).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/accelIssues or PRs related to CUDA kernels and fused operatorskind/cleanupCategorizes issue or PR as related to cleaning up code, process, or technical debt

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions