Skip to content

[lite] GLM5.2 (DeepSeek-V3.2) IndexShare DSA support#66

Open
ISEEKYAN wants to merge 8 commits into
mainfrom
mlite-glm52-checkpoint-indexshare
Open

[lite] GLM5.2 (DeepSeek-V3.2) IndexShare DSA support#66
ISEEKYAN wants to merge 8 commits into
mainfrom
mlite-glm52-checkpoint-indexshare

Conversation

@ISEEKYAN

@ISEEKYAN ISEEKYAN commented Jun 27, 2026

Copy link
Copy Markdown
Owner

GLM5.2 (DeepSeek-V3.2) IndexShare DSA support

Adds GLM5.2 support to Megatron-Lite as a configuration variant of the existing GLM5 (deepseek_v3_2) implementation. The model-specific delta is IndexShare plus the GLM5.2 RoPE/context configuration; the DSA implementation remains in the shared attention primitive and reuses the existing fused kernels.

Implementation

  • IndexShare schedule: full layers compute top-k and following shared layers reuse that source. Shared layers have no indexer parameters or indexer loss. The feature is configuration-gated; index_topk_freq=1 keeps the existing GLM5 all-full path.
  • Packed/CP positions: the model protocol derives and CP-splits per-sequence positions, and the model/layer/DSA wrapper now forwards those positions unchanged. Packed DSA requires explicit positions; the dense CP fallback uses the shared zigzag position primitive.
  • Pipeline safety: the split guard covers both trunk and MTP logical layer indices, so a shared MTP layer cannot be placed on a different stage from its full source.
  • Bounded cache lifecycle: normal forward keeps only the current source group on GPU and drops it after its consumers. Recompute/offload retains old groups on CPU and pages at most one source group back to GPU during backward, then clears the resident group.
  • Checkpoint mapping: shared layers omit indexer weights on import/export, while full layers retain the existing mapping.

MTP schedule fact

The official GLM5.2 layout has 78 decoder layers followed by one MTP layer. The MTP transformer is zero-based global layer 78 (one-based layer 79), which is a full indexer layer under the freq=4, offset=3 schedule. It computes its own top-k and does not share the last trunk layer's top-k. index_share_for_mtp_iteration is retained as source-config compatibility and is not evidence of cross-layer MTP sharing.

Validation (Slurm GPU, non-skip)

  • Actual causal models vs HF, job 13182986 (COMPLETED, 0:0): reduced-size four-layer Glm5Model fused IndexShare forward, exact MLite-to-HF weight export, and independent transformers.GlmMoeDsaForCausalLM.forward over sequence length 1024. Full logits cosine: global 0.9998986721, token mean 0.9999044538, token minimum 0.9996886849, p01 0.9998022318; all values are finite. Combined instrumentation for the model forward and following recompute probe recorded fused_indexer_sparse_attn_with_topk=2, dsa_sparse_attn=10, indexer_topk=5.
  • Real protocol packed THD + CP2 + IndexShare + MTP, job 13182986 (COMPLETED, 0:0): two-GPU _forward_step with variable lengths [16, 20, 24], forward and backward, shared trunk layer and full MTP indexer; loss 5.269021e+00, grad norm 6.683617e+01, 1 passed with no skip.
  • Recompute cache lifecycle, job 13182986 (COMPLETED, 0:0): two source groups through actual checkpoint recomputation; CPU history 4,194,304 bytes, GPU-resident cache entries after forward/backward 0/0, finite grad norm 0.0194911640.
  • Fused DSA primitive vs Megatron reference, job 13128510 (COMPLETED, 0:0): source/shared/MTP attention max_abs = 0.001953125 / 0.001953125 / 0.0009765625; Megatron top-k support matched. This is primitive-level parity evidence, not a full-model-logits claim.
  • Weight IO: save-HF BF16 round trips with IndexShare off/on, job 13128494 (COMPLETED, 0:0), both max_abs=0.0.
  • Parallel/runtime regression: uneven 78-layer pipeline job 13135803, VERL runtime job 13135801, and fused run-to-run acceptance job 13138759 all completed successfully; the non-deterministic fused backward is assessed against deterministic unfused and golden-variance references.
  • Pipeline layout unit matrix: job 13183070 (COMPLETED, 0:0), 56 passed, 1 xfailed; the expected xfail is the existing unsupported standalone-MTP placement.

Focused CPU checks

  • Position forwarding, MTP pipeline guard, and cache drop/offload unit checks: 4 passed.
  • Runtime layering and packed bridge checks: 12 passed.
  • compileall, targeted Ruff, and git diff --check pass.

@ISEEKYAN ISEEKYAN force-pushed the mlite-glm52-checkpoint-indexshare branch from d9f5316 to f637760 Compare June 27, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant