[lite] GLM5.2 (DeepSeek-V3.2) IndexShare DSA support#66
Open
ISEEKYAN wants to merge 8 commits into
Open
Conversation
d9f5316 to
f637760
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GLM5.2 (DeepSeek-V3.2) IndexShare DSA support
Adds GLM5.2 support to Megatron-Lite as a configuration variant of the existing GLM5 (
deepseek_v3_2) implementation. The model-specific delta is IndexShare plus the GLM5.2 RoPE/context configuration; the DSA implementation remains in the shared attention primitive and reuses the existing fused kernels.Implementation
fulllayers compute top-k and followingsharedlayers reuse that source. Shared layers have no indexer parameters or indexer loss. The feature is configuration-gated;index_topk_freq=1keeps the existing GLM5 all-full path.MTP schedule fact
The official GLM5.2 layout has 78 decoder layers followed by one MTP layer. The MTP transformer is zero-based global layer 78 (one-based layer 79), which is a full indexer layer under the
freq=4, offset=3schedule. It computes its own top-k and does not share the last trunk layer's top-k.index_share_for_mtp_iterationis retained as source-config compatibility and is not evidence of cross-layer MTP sharing.Validation (Slurm GPU, non-skip)
COMPLETED,0:0): reduced-size four-layerGlm5Modelfused IndexShare forward, exact MLite-to-HF weight export, and independenttransformers.GlmMoeDsaForCausalLM.forwardover sequence length 1024. Full logits cosine: global0.9998986721, token mean0.9999044538, token minimum0.9996886849, p010.9998022318; all values are finite. Combined instrumentation for the model forward and following recompute probe recordedfused_indexer_sparse_attn_with_topk=2,dsa_sparse_attn=10,indexer_topk=5.COMPLETED,0:0): two-GPU_forward_stepwith variable lengths[16, 20, 24], forward and backward, shared trunk layer and full MTP indexer; loss5.269021e+00, grad norm6.683617e+01,1 passedwith no skip.COMPLETED,0:0): two source groups through actual checkpoint recomputation; CPU history4,194,304bytes, GPU-resident cache entries after forward/backward0/0, finite grad norm0.0194911640.COMPLETED,0:0): source/shared/MTP attentionmax_abs = 0.001953125 / 0.001953125 / 0.0009765625; Megatron top-k support matched. This is primitive-level parity evidence, not a full-model-logits claim.COMPLETED,0:0), bothmax_abs=0.0.COMPLETED,0:0),56 passed, 1 xfailed; the expected xfail is the existing unsupported standalone-MTP placement.Focused CPU checks
4 passed.12 passed.compileall, targeted Ruff, andgit diff --checkpass.