[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd#2
Open
kakisong wants to merge 1 commit into
Open
[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd#2kakisong wants to merge 1 commit into
kakisong wants to merge 1 commit into
Conversation
…e_mla_bwd Mirrors the V4 fix landed in PR #1 (#1). The aggressive shared-memory merge tilelang pass aliases acc_dkv_shared (fp32) with Q_shared / KV_shared / dQ_shared (bf16) without inserting the syncs needed when producer/consumer dtypes differ. The split_store atomic_addx4 path then writes fp32 bytes that the next iteration reads back as bf16. On V4 (D=512, single buffer) this produced 100% NaN gradients on production shapes. GLM-5's D=512 + D_tail=64 layout has not been observed to trip the alias collision in the wild, but the flag is unsafe by construction — the pass has no way to distinguish dtype-mixed buffers. Better to disable it preemptively than to wait for a layout change to trigger silent NaN. No GLM-5 sparse_mla unit test exists in this repo to verify numerically; the change is a single-line config flip with an explanatory comment.
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One-line config flip on the GLM-5 sparse-MLA backward kernel: disable
TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE.Background
PR #1 traced V4's 100% NaN backward gradients to this exact tilelang pass aliasing
acc_dkv_shared(fp32) withQ_shared/KV_shared/dQ_shared(bf16) without inserting the syncs needed for dtype-mixed buffers. The split_store atomic_addx4 path then writes fp32 bytes into shared memory that the next loop iteration reads back as bf16 — propagating NaN through the GEMM chain.GLM-5's
miles_plugins/models/glm5/ops/tilelang_sparse_mla_bwd.pyhas the same:TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: Trueacc_dkv_shared/acc_dkv_tail_shared(fp32) coexisting withQ_shared/Q_tail_shared/KV_shared/KV_tail_shared/dQ_shared/dQ_tail_shared(bf16)GLM-5's
D=512 + D_tail=64layout has not been observed to trip the alias collision in the wild, but the merge pass has no way to distinguish dtype-mixed buffers — it's unsafe by construction. Better to disable preemptively than wait for a layout / kernel change to silently regress.What's in this PR
miles_plugins/models/glm5/ops/tilelang_sparse_mla_bwd.py: comment out the flag, with a comment cross-referencing the V4 fix and explaining why the structural risk applies even though GLM-5 hasn't seen the bug.Test plan
tests/deepseekv4/test_v4_tilelang_sparse_mla.py) covers the V4 kernel only. If a GLM-5 test gets added later it should exercise this path.tests/e2e/megatron/test_glm5_744b_a40b_4layer.py) to confirm no regression — heavy, deferred.Follow-ups
tests/deepseekv4/test_v4_tilelang_sparse_mla.pyso this kernel has its own numerical regression guard.