Skip to content

[deepseek-v4] fix tilelang sparse MLA backward NaNs#4

Open
tang-t21 wants to merge 1 commit into
yueming-yuan:deepseek-v4from
tang-t21:codex-deepseek-v4-tilelang-bwd-nan
Open

[deepseek-v4] fix tilelang sparse MLA backward NaNs#4
tang-t21 wants to merge 1 commit into
yueming-yuan:deepseek-v4from
tang-t21:codex-deepseek-v4-tilelang-bwd-nan

Conversation

@tang-t21
Copy link
Copy Markdown

Summary

Fixes DeepSeek V4 tilelang sparse MLA backward NaNs seen during Bridge TP8/EP8 e2e training by disabling the aggressive shared-memory merge pass for the backward kernel.

Also tightens the sparse MLA backward test so NaN/Inf tilelang gradients fail explicitly instead of being hidden by diff calculations.

Debug Notes

  • Bridge e2e with torch sparse backend was stable and aligned with AutoModel.
  • The default tilelang backend produced finite forward loss but NaN local grad norms during backward before DP communication.
  • tests/deepseekv4/test_v4_tilelang_sparse_mla.py::test_sparse_mla_backward printed NaN gradients before this fix, but did not assert finiteness.
  • Removing TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE from tilelang_sparse_mla_bwd.py makes standalone backward gradients finite and allows the Bridge e2e training run to complete.

Validation

  • uv run --no-project --python /usr/bin/python python -m pytest -q tests/deepseekv4/test_v4_tilelang_sparse_mla.py::test_sparse_mla_backward -x -s
    • result: 5 passed
  • Bridge TP8/EP8 e2e, tilelang backend, 1 iter:
    • log: /tmp/dsv4_tilelang_debug_no_aggressive_merge_20260521_021632.log
    • loss: 0.903955, grad norm: 2.502, NaN iterations: 0
  • Bridge TP8/EP8 e2e, tilelang backend, 3 iter:
    • log: /tmp/dsv4_tilelang_debug_no_aggressive_merge_3iter_20260521_094640.log
    • losses: 0.9047399, 1.2337190, 0.5607164
    • grad norms: 3.037, 2.951, 3.640
    • NaN iterations: 0

Stacked on radixark#1045 (yueming-yuan:deepseek-v4).

Signed-off-by: tang-t21 <tiantang673@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant