Preserve DSA output across fused inverse RoPE#5526
Conversation
|
/ok to test ddf5e29 |
|
/claude strict-review |
There was a problem hiding this comment.
Review Summary
CRITICAL: 0 | IMPORTANT: 0 | SUGGESTION: 0
This is a clean, well-targeted correctness bug fix. The problem and solution are both clearly understood:
Bug: _FusedMLARoPEInplace.forward saves cos/sin for backward but the caller (DSv4HybridAttention.forward) passes the attention output O as a view sharing storage with what the upstream fused DSA autograd function saved for its backward. The in-place inverse RoPE kernel then corrupts that saved tensor, causing incorrect gradients during backward.
Fix: Clone the tensor before applying the in-place kernel, giving inverse RoPE its own storage and leaving the DSA-saved output intact.
Verification:
fused_mla_rope_out_of_placeis a minimal wrapper —t.clone()then delegate to the existing in-place path. Gradient flow is correct:_FusedMLARoPEInplace.backwardonly needscos/sin(not the forward input), so the clone introduces no autograd complications.- The only other call sites of
fused_mla_rope_inplaceare forward RoPE on freshly-projected Q/K tensors (lines 657, 669 in the same file; line 365 incsa.py) — these operate on independent storage with no saved-for-backward aliasing, so they remain correct as in-place. - The test is thorough: it first demonstrates the bug with the in-place version (proving the aliasing corruption), then verifies the out-of-place version preserves the original tensor bitwise (
rtol=0, atol=0), produces correct output, and propagates gradients correctly through backward. - All new identifiers (
fused_mla_rope_out_of_place,_SaveOutputForBackward) have meaningful use paths.
LGTM — no issues found.
|
/ok to test ddf5e29 |
Signed-off-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: Kaixiang Lei <5780122+shyoshyo@users.noreply.github.com>
ddf5e29 to
526e198
Compare
What does this PR do ?
Fix a bug:
Fused DSA forward saves the raw attention output
Ofor use during backward, while the value returned to the caller is only a reshape/view backed by the same storage. without creating independent storage.Post-attention inverse RoPE then operates in-place on this returned view. The Triton kernel writes directly back through the input pointer (fused_mla_yarn_rope_apply.py). As a result, the original O saved for DSA backward is overwritten by the inverse-RoPE output before backward executes, causing the backward kernel to consume an incorrect attention output.
Bug reported by 5780122+shyoshyo@users.noreply.github.com
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.