Fix native attention design-claim mismatch: dense kernel and custom backwards unused

## Finding
`attention.py`'s docstring claims train + rollout prefill + rollout decode route through one CUDA forward kernel for consistency. In reality rollout prefill/decode use the new varlen/paged-decode kernels while training uses PyTorch SDPA math. Consequences:
- `areno_causal_attention` (dense 4D forward + custom backward) is never called by any runtime path or test.
- Three custom backwards (`causal_attention_backward`, `varlen_..._backward`, `_ArenoPagedCausalAttentionDecode.backward` reference recompute) are unreachable (rollout in `inference_mode`, train via SDPA). This is large hand-written CUDA built unconditionally with no callers or numeric tests.

## Recommendation
Either (a) wire the dense kernel into a truly unified training path, or (b) delete the dead dense kernel + unreachable backwards and rewrite the docstring to describe reality (rollout = accel kernel, train = SDPA math).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix native attention design-claim mismatch: dense kernel and custom backwards unused #80

Finding

Recommendation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fix native attention design-claim mismatch: dense kernel and custom backwards unused #80

Description

Finding

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions