[lite] GLM5.2 (DeepSeek-V3.2) IndexShare DSA support by ISEEKYAN · Pull Request #66 · ISEEKYAN/Megatron-LM

ISEEKYAN · 2026-06-27T03:05:09Z

GLM5.2 (DeepSeek-V3.2) IndexShare DSA support

Adds GLM5.2 support to Megatron-Lite as a configuration variant of the existing GLM5 (deepseek_v3_2) implementation. The model-specific delta is IndexShare plus the GLM5.2 RoPE/context configuration; the DSA implementation remains in the shared attention primitive and reuses the existing fused kernels.

Implementation

IndexShare schedule: full layers compute top-k and following shared layers reuse that source. Shared layers have no indexer parameters or indexer loss. The feature is configuration-gated; index_topk_freq=1 keeps the existing GLM5 all-full path.
Packed/CP positions: the model protocol derives and CP-splits per-sequence positions, and the model/layer/DSA wrapper now forwards those positions unchanged. Packed DSA requires explicit positions; the dense CP fallback uses the shared zigzag position primitive.
Pipeline safety: the split guard covers both trunk and MTP logical layer indices, so a shared MTP layer cannot be placed on a different stage from its full source.
Bounded cache lifecycle: normal forward keeps only the current source group on GPU and drops it after its consumers. Recompute/offload retains old groups on CPU and pages at most one source group back to GPU during backward, then clears the resident group.
Checkpoint mapping: shared layers omit indexer weights on import/export, while full layers retain the existing mapping.

MTP schedule fact

The official GLM5.2 layout has 78 decoder layers followed by one MTP layer. The MTP transformer is zero-based global layer 78 (one-based layer 79), which is a full indexer layer under the freq=4, offset=3 schedule. It computes its own top-k and does not share the last trunk layer's top-k. index_share_for_mtp_iteration is retained as source-config compatibility and is not evidence of cross-layer MTP sharing.

Validation (Slurm GPU, non-skip)

Actual causal models vs HF, job 13182986 (COMPLETED, 0:0): reduced-size four-layer Glm5Model fused IndexShare forward, exact MLite-to-HF weight export, and independent transformers.GlmMoeDsaForCausalLM.forward over sequence length 1024. Full logits cosine: global 0.9998986721, token mean 0.9999044538, token minimum 0.9996886849, p01 0.9998022318; all values are finite. Combined instrumentation for the model forward and following recompute probe recorded fused_indexer_sparse_attn_with_topk=2, dsa_sparse_attn=10, indexer_topk=5.
Real protocol packed THD + CP2 + IndexShare + MTP, job 13182986 (COMPLETED, 0:0): two-GPU _forward_step with variable lengths [16, 20, 24], forward and backward, shared trunk layer and full MTP indexer; loss 5.269021e+00, grad norm 6.683617e+01, 1 passed with no skip.
Recompute cache lifecycle, job 13182986 (COMPLETED, 0:0): two source groups through actual checkpoint recomputation; CPU history 4,194,304 bytes, GPU-resident cache entries after forward/backward 0/0, finite grad norm 0.0194911640.
Fused DSA primitive vs Megatron reference, job 13128510 (COMPLETED, 0:0): source/shared/MTP attention max_abs = 0.001953125 / 0.001953125 / 0.0009765625; Megatron top-k support matched. This is primitive-level parity evidence, not a full-model-logits claim.
Weight IO: save-HF BF16 round trips with IndexShare off/on, job 13128494 (COMPLETED, 0:0), both max_abs=0.0.
Parallel/runtime regression: uneven 78-layer pipeline job 13135803, VERL runtime job 13135801, and fused run-to-run acceptance job 13138759 all completed successfully; the non-deterministic fused backward is assessed against deterministic unfused and golden-variance references.
Pipeline layout unit matrix: job 13183070 (COMPLETED, 0:0), 56 passed, 1 xfailed; the expected xfail is the existing unsupported standalone-MTP placement.

Focused CPU checks

Position forwarding, MTP pipeline guard, and cache drop/offload unit checks: 4 passed.
Runtime layering and packed bridge checks: 12 passed.
compileall, targeted Ruff, and git diff --check pass.

ISEEKYAN added 6 commits June 25, 2026 22:31

Add GLM5.2 DSA index sharing

dfcf14c

Handle GLM5.2 shared indexer checkpoint weights

2b9aa20

Complete GLM5.2 pipeline and DSA validation

9ccce73

Add deterministic unfused DSA acceptance reference

49e5434

Fix GLM5.2 packed positions and index sharing lifecycle

325b36a

Reuse the existing pipeline layout constructor

f637760

ISEEKYAN force-pushed the mlite-glm52-checkpoint-indexshare branch from d9f5316 to f637760 Compare June 27, 2026 10:53

ISEEKYAN added 2 commits June 27, 2026 08:14

Narrow DSA attention exports

b269f3d

Fix GLM5 context-parallel rotary reconstruction

da86fe5

Meirtz mentioned this pull request Jun 27, 2026

[lite] Harden GLM5.2 readiness and checkpoint recovery (#66 follow-up) #67

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lite] GLM5.2 (DeepSeek-V3.2) IndexShare DSA support#66

[lite] GLM5.2 (DeepSeek-V3.2) IndexShare DSA support#66
ISEEKYAN wants to merge 8 commits into
mainfrom
mlite-glm52-checkpoint-indexshare

ISEEKYAN commented Jun 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ISEEKYAN commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GLM5.2 (DeepSeek-V3.2) IndexShare DSA support

Implementation

MTP schedule fact

Validation (Slurm GPU, non-skip)

Focused CPU checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ISEEKYAN commented Jun 27, 2026 •

edited

Loading