Skip to content

[codex] DSv4: derive config from HF + legacy-load + exporter DTensor materialize#65

Draft
Meirtz wants to merge 1 commit into
ISEEKYAN:mainfrom
Meirtz:codex/mlite-dsv4-config-load
Draft

[codex] DSv4: derive config from HF + legacy-load + exporter DTensor materialize#65
Meirtz wants to merge 1 commit into
ISEEKYAN:mainfrom
Meirtz:codex/mlite-dsv4-config-load

Conversation

@Meirtz

@Meirtz Meirtz commented Jun 23, 2026

Copy link
Copy Markdown

DRAFT — source-only split; needs a torch/GPU unit run before it is marked ready (I could only py_compile it offline).

Summary

The load/save-correctness half of the DSv4 load-train work, split out so it can land
without the parts that change forward numerics. Three files, all additive/correctness:

  • config.py: derive compress_ratios from layer_types/compress_rates and
    num_hash_layers from mlp_layer_types; parse nested rope_parameters (main/compress);
    add hc_eps/rms_norm_eps fields and an untied-embeddings guard. (Current code leaves
    compress_ratios=[] / num_hash_layers=3 defaulted, which is wrong for real
    DeepSeek-V4-Flash configs.)
  • checkpoint.py: accept legacy model-rooted HF key aliases (expert/attention/compressor)
    on load; fix the expert-name double-dot (mlp.experts..Nmlp.experts.N).
  • ckpt/hf_weights.py: materialize DTensor params to local cpu-contiguous before
    safetensors save — avoids silent FSDP2-shard truncation on export.

Intentionally excluded (ship separately)

Validation status

  • py_compile clean; verified no references to the excluded mHC/dispatcher code.
  • Pending before un-draft: carve the config/checkpoint/exporter unit tests + one cluster
    unit run (toy DSv4 load/export round-trip).

…aterialize

Safe load/save-correctness split from the DSv4 load/train work. Excludes the mHC
numeric change (parity-gated) and the dispatcher hash-routing change (golden-changing),
which ship separately.

- config.py: derive compress_ratios from layer_types/compress_rates and num_hash_layers
  from mlp_layer_types; parse nested rope_parameters (main/compress); add hc_eps/
  rms_norm_eps fields and an untied-embeddings guard.
- checkpoint.py: accept legacy model-rooted HF expert/attention/compressor key aliases
  on load; fix the expert-name double-dot (mlp.experts..N -> mlp.experts.N).
- ckpt/hf_weights.py: materialize DTensor params to local cpu contiguous before
  safetensors save (avoids silent FSDP2-shard truncation on export).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Lingrui Mei <lmei@nvidia.com>
@ISEEKYAN

Copy link
Copy Markdown
Owner

why do we need so many legacy** stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants