[codex] Fix DSv4 DCP checkpoint placements for DTensor-like params#61
Open
Meirtz wants to merge 2 commits into
Open
[codex] Fix DSv4 DCP checkpoint placements for DTensor-like params#61Meirtz wants to merge 2 commits into
Meirtz wants to merge 2 commits into
Conversation
Merge DTensor/FSDP2 parameter shard placement into the matching checkpoint mesh axis when model-parallel checkpoint placements also shard expert tensors. Keep all-replicate checkpoint placements on the original parameter mesh so dense and mHC tensors are not treated as model-parallel shards. Validation: OCI-HSG 2 nodes x 4 GB200 DSv4 tiny DCP continuity attempt oci-hsg-2node-8gpu-dsv4-dcp-mergefix-6, job 3488765, analyzer status=pass, all 8 ranks max_delta=0.0.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes the MLite DCP fallback path for DTensor-like parameters whose
checkpoint placement differs from the wrapped parameter placement.
For DSv4 expert weights under EP, the checkpoint protocol supplies the expert
mesh and placements, but the previous helper rebuilt the DCP tensor from the
FSDP2 parameter mesh/placements. In the reproduced failure this let rank 0 load
rank 1's expert shard for
layers.3.mlp.experts.fc1.weight0.Changes
real sharded checkpoint placements.
so empty local DTensor shards do not force full materialization.
placements when needed.
hc_head.*as replicated before the generic DSv4headplacementrule.
placements, multi-axis sharding, unsharded checkpoint behavior, and empty
local save behavior.
Validation
Local checks:
cuaPython/Torchenvironment for checkpoint-placement DTensor wrapping, multi-axis checkpoint
shard shape expansion, and unsharded checkpoint behavior that preserves the
parameter mesh.
cuaenvironment wasattempted, but the file skipped before collection because local
megatron.core.dist_checkpointingimports requiretriton, which is notinstalled. The no-pytest helper harness was rerun and passed for checkpoint
placements, multi-axis shape expansion, unsharded empty-local save behavior,
and empty local copy no-op.
GPU evidence from existing run artifacts:
ep2 SAVE_LOAD=1:analysis-save-load-ep2-b100-fix5.jsonreports
overall=smoke_passwith all rank max deltas at0.0.ep4 SAVE_LOAD=1:analysis-save-load-ep4-b100-fix5.jsonreports
overall=smoke_passwith all rank max deltas at0.0.ep2 SAVE_LOAD=1: job1346071,analysis-save-load-ep2-h100-fix6.jsonreportsoverall=smoke_pass, andrank 0/rank 1
save_load.comparison.max_delta=0.0.pp2 SAVE_LOAD=1: job1346142passed withfinite train metrics and rank 0/rank 1
save_load.comparison.max_delta=0.0.Local evidence is recorded from the launcher terminal stream at
runs/20260620-mlite-next-gates-pr-packaging/remote-results-dsv4_dcp_pp2_save_load-terminal/results/metrics.json.ep2,MTP_ENABLE=0: job1346239passedsave-load-continue versus uninterrupted training.
loaded_step=1; rank 0compared 112 local tensors and rank 1 compared 111 local tensors; both ranks
reported
comparison.max_delta=0.0, no missing keys, and no shapemismatches. Evidence:
runs/20260621-mlite-dsv4-dcp-continuity/remote-results-attempt2/results/metrics.json.ep2,MTP_ENABLE=0: job1346254passed thesame save-load-continue gate.
loaded_step=1; rank 0 compared 112 localtensors and rank 1 compared 111 local tensors; both ranks reported
comparison.max_delta=0.0, no missing keys, and no shape mismatches.Evidence:
runs/20260621-mlite-dsv4-dcp-continuity/remote-results-h100-attempt3/results/metrics.json.Boundaries
This PR does not claim:
isolate DCP from a known MTP/mHC DTensor/Tensor blocker outside this PR,
The full local pytest suite was not run locally. Targeted pytest collection now
starts in the
cuaenvironment, but the local environment is still missing thetritondependency needed bymegatron.core.dist_checkpointing.