[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd by kakisong · Pull Request #2 · kakisong/miles

kakisong · 2026-05-10T08:55:11Z

Summary

One-line config flip on the GLM-5 sparse-MLA backward kernel: disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE.

Background

PR #1 traced V4's 100% NaN backward gradients to this exact tilelang pass aliasing acc_dkv_shared (fp32) with Q_shared / KV_shared / dQ_shared (bf16) without inserting the syncs needed for dtype-mixed buffers. The split_store atomic_addx4 path then writes fp32 bytes into shared memory that the next loop iteration reads back as bf16 — propagating NaN through the GEMM chain.

GLM-5's miles_plugins/models/glm5/ops/tilelang_sparse_mla_bwd.py has the same:

pass_configs: TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE: True
buffer mix: acc_dkv_shared / acc_dkv_tail_shared (fp32) coexisting with Q_shared / Q_tail_shared / KV_shared / KV_tail_shared / dQ_shared / dQ_tail_shared (bf16)
split_store=2 atomic_addx4 write pattern

GLM-5's D=512 + D_tail=64 layout has not been observed to trip the alias collision in the wild, but the merge pass has no way to distinguish dtype-mixed buffers — it's unsafe by construction. Better to disable preemptively than wait for a layout / kernel change to silently regress.

What's in this PR

miles_plugins/models/glm5/ops/tilelang_sparse_mla_bwd.py: comment out the flag, with a comment cross-referencing the V4 fix and explaining why the structural risk applies even though GLM-5 hasn't seen the bug.

Test plan

AST parse check on the modified file
Numerical verification: no GLM-5 sparse-MLA unit test exists in this repo. The V4 unit test (tests/deepseekv4/test_v4_tilelang_sparse_mla.py) covers the V4 kernel only. If a GLM-5 test gets added later it should exercise this path.
(optional) Run a GLM-5 e2e smoke (tests/e2e/megatron/test_glm5_744b_a40b_4layer.py) to confirm no regression — heavy, deferred.

Follow-ups

Add a GLM-5 sparse-MLA unit test mirroring tests/deepseekv4/test_v4_tilelang_sparse_mla.py so this kernel has its own numerical regression guard.

…e_mla_bwd Mirrors the V4 fix landed in PR #1 (#1). The aggressive shared-memory merge tilelang pass aliases acc_dkv_shared (fp32) with Q_shared / KV_shared / dQ_shared (bf16) without inserting the syncs needed when producer/consumer dtypes differ. The split_store atomic_addx4 path then writes fp32 bytes that the next iteration reads back as bf16. On V4 (D=512, single buffer) this produced 100% NaN gradients on production shapes. GLM-5's D=512 + D_tail=64 layout has not been observed to trip the alias collision in the wild, but the flag is unsafe by construction — the pass has no way to distinguish dtype-mixed buffers. Better to disable it preemptively than to wait for a layout change to trigger silent NaN. No GLM-5 sparse_mla unit test exists in this repo to verify numerically; the change is a single-line config flip with an explanatory comment.

kakisong mentioned this pull request May 10, 2026

[V4-SFT] OpenHermes 64-GPU pipeline + tilelang_sparse_mla_bwd NaN fix #1

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd#2

[GLM-5] disable TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE in sparse_mla_bwd#2
kakisong wants to merge 1 commit into
v4-rl-basefrom
glm5-tilelang-bwd-flag-fix

kakisong commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kakisong commented May 10, 2026

Summary

Background

What's in this PR

Test plan

Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant