Skip to content

[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd#2

Open
kakisong wants to merge 1 commit into
v4-rl-basefrom
glm5-tilelang-bwd-flag-fix
Open

[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd#2
kakisong wants to merge 1 commit into
v4-rl-basefrom
glm5-tilelang-bwd-flag-fix

Conversation

@kakisong
Copy link
Copy Markdown
Owner

Summary

One-line config flip on the GLM-5 sparse-MLA backward kernel: disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE.

Background

PR #1 traced V4's 100% NaN backward gradients to this exact tilelang pass aliasing acc_dkv_shared (fp32) with Q_shared / KV_shared / dQ_shared (bf16) without inserting the syncs needed for dtype-mixed buffers. The split_store atomic_addx4 path then writes fp32 bytes into shared memory that the next loop iteration reads back as bf16 — propagating NaN through the GEMM chain.

GLM-5's miles_plugins/models/glm5/ops/tilelang_sparse_mla_bwd.py has the same:

  • pass_configs: TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: True
  • buffer mix: acc_dkv_shared / acc_dkv_tail_shared (fp32) coexisting with Q_shared / Q_tail_shared / KV_shared / KV_tail_shared / dQ_shared / dQ_tail_shared (bf16)
  • split_store=2 atomic_addx4 write pattern

GLM-5's D=512 + D_tail=64 layout has not been observed to trip the alias collision in the wild, but the merge pass has no way to distinguish dtype-mixed buffers — it's unsafe by construction. Better to disable preemptively than wait for a layout / kernel change to silently regress.

What's in this PR

  • miles_plugins/models/glm5/ops/tilelang_sparse_mla_bwd.py: comment out the flag, with a comment cross-referencing the V4 fix and explaining why the structural risk applies even though GLM-5 hasn't seen the bug.

Test plan

  • AST parse check on the modified file
  • Numerical verification: no GLM-5 sparse-MLA unit test exists in this repo. The V4 unit test (tests/deepseekv4/test_v4_tilelang_sparse_mla.py) covers the V4 kernel only. If a GLM-5 test gets added later it should exercise this path.
  • (optional) Run a GLM-5 e2e smoke (tests/e2e/megatron/test_glm5_744b_a40b_4layer.py) to confirm no regression — heavy, deferred.

Follow-ups

  • Add a GLM-5 sparse-MLA unit test mirroring tests/deepseekv4/test_v4_tilelang_sparse_mla.py so this kernel has its own numerical regression guard.

…e_mla_bwd

Mirrors the V4 fix landed in PR #1 (#1). The aggressive
shared-memory merge tilelang pass aliases acc_dkv_shared (fp32) with
Q_shared / KV_shared / dQ_shared (bf16) without inserting the syncs
needed when producer/consumer dtypes differ. The split_store atomic_addx4
path then writes fp32 bytes that the next iteration reads back as bf16.

On V4 (D=512, single buffer) this produced 100% NaN gradients on
production shapes. GLM-5's D=512 + D_tail=64 layout has not been
observed to trip the alias collision in the wild, but the flag is unsafe
by construction — the pass has no way to distinguish dtype-mixed buffers.
Better to disable it preemptively than to wait for a layout change to
trigger silent NaN.

No GLM-5 sparse_mla unit test exists in this repo to verify numerically;
the change is a single-line config flip with an explanatory comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant