[codex] DSv4: derive config from HF + legacy-load + exporter DTensor materialize by Meirtz · Pull Request #65 · ISEEKYAN/Megatron-LM

Meirtz · 2026-06-23T06:42:38Z

DRAFT — source-only split; needs a torch/GPU unit run before it is marked ready (I could only py_compile it offline).

Summary

The load/save-correctness half of the DSv4 load-train work, split out so it can land
without the parts that change forward numerics. Three files, all additive/correctness:

config.py: derive compress_ratios from layer_types/compress_rates and
num_hash_layers from mlp_layer_types; parse nested rope_parameters (main/compress);
add hc_eps/rms_norm_eps fields and an untied-embeddings guard. (Current code leaves
compress_ratios=[] / num_hash_layers=3 defaulted, which is wrong for real
DeepSeek-V4-Flash configs.)
checkpoint.py: accept legacy model-rooted HF key aliases (expert/attention/compressor)
on load; fix the expert-name double-dot (mlp.experts..N → mlp.experts.N).
ckpt/hf_weights.py: materialize DTensor params to local cpu-contiguous before
safetensors save — avoids silent FSDP2-shard truncation on export.

Intentionally excluded (ship separately)

mHC numeric change (hca.py/mhc.py + model call-sites) — changes DSv4 forward numerics
vs the current golden; needs an HF-parity gate first (per the basic.constitution no-reference
rule, same bar as fix(deepseek_v4): scale MTP aux-loss gradient via pre_forward_hook #56/ds4 (deepseek_v4): DSA indexer aux-loss scale hook + lift forced CP=1 (MTP/dense-CSA support CP>1) #59). Separate PR.
dispatcher scatter_add_ hash-routing parity — correctness toward HF but changes hash-layer
golden; separate PR + golden refresh.
The combined static test file — to be carved to cover only config/checkpoint/exporter.

Validation status

py_compile clean; verified no references to the excluded mHC/dispatcher code.
Pending before un-draft: carve the config/checkpoint/exporter unit tests + one cluster
unit run (toy DSv4 load/export round-trip).

…aterialize Safe load/save-correctness split from the DSv4 load/train work. Excludes the mHC numeric change (parity-gated) and the dispatcher hash-routing change (golden-changing), which ship separately. - config.py: derive compress_ratios from layer_types/compress_rates and num_hash_layers from mlp_layer_types; parse nested rope_parameters (main/compress); add hc_eps/ rms_norm_eps fields and an untied-embeddings guard. - checkpoint.py: accept legacy model-rooted HF expert/attention/compressor key aliases on load; fix the expert-name double-dot (mlp.experts..N -> mlp.experts.N). - ckpt/hf_weights.py: materialize DTensor params to local cpu contiguous before safetensors save (avoids silent FSDP2-shard truncation on export). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

ISEEKYAN · 2026-06-25T04:55:20Z

why do we need so many legacy** stuff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] DSv4: derive config from HF + legacy-load + exporter DTensor materialize#65

[codex] DSv4: derive config from HF + legacy-load + exporter DTensor materialize#65
Meirtz wants to merge 1 commit into
ISEEKYAN:mainfrom
Meirtz:codex/mlite-dsv4-config-load

Meirtz commented Jun 23, 2026

Uh oh!

ISEEKYAN commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Meirtz commented Jun 23, 2026

Summary

Intentionally excluded (ship separately)

Validation status

Uh oh!

ISEEKYAN commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants