Skip to content

Fix DeepSeek-V3 checkpoint export compatibility#1202

Open
lihongyang1990 wants to merge 2 commits into
flagos-ai:mainfrom
lihongyang1990:checkpoint-deepseek-v3-hf-export
Open

Fix DeepSeek-V3 checkpoint export compatibility#1202
lihongyang1990 wants to merge 2 commits into
flagos-ai:mainfrom
lihongyang1990:checkpoint-deepseek-v3-hf-export

Conversation

@lihongyang1990
Copy link
Copy Markdown
Contributor

Summary

  • Add an explicit --skip-mtp conversion option so DeepSeek-V3/Moonlight checkpoints can be exported to HF implementations that only contain the main LM layers.
  • Keep loader and saver MTP handling consistent when MTP is skipped, avoiding unconsumed queue messages and missing HF MTP layer lookups.
  • Align DeepSeek-V3 HF config export with current Megatron argument names and true vocabulary size handling.
  • Guard Engram-only checkpoint fields for non-Engram checkpoints and remove unsupported legacy Megatron conversion arguments.

Validation

  • PYTHONPYCACHEPREFIX=/private/tmp/flagscale_pycache python3 -m py_compile tools/checkpoint/convert.py tools/checkpoint/utils.py tools/checkpoint/loader_mcore.py tools/checkpoint/loader_transformers.py tools/checkpoint/saver_mcore.py tools/checkpoint/saver_transformers.py tools/checkpoint/deepseek_v3/args.py
  • git diff --check

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 20, 2026

CLA assistant check
All committers have signed the CLA.

lxd-cumt
lxd-cumt previously approved these changes May 20, 2026
Copy link
Copy Markdown
Collaborator

@lxd-cumt lxd-cumt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LiJunscs
Copy link
Copy Markdown
Collaborator

The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200.

@lihongyang1990 lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch 2 times, most recently from 8419380 to 48fc57a Compare May 21, 2026 03:15
@lihongyang1990
Copy link
Copy Markdown
Contributor Author

The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200.

Thanks for pointing this out.

I have updated the converter to handle the MCore legacy torch checkpoint shard layout. The mcore loader/saver no longer assumes tp * ep
checkpoint parts; it now uses max(tp, ep * etp) for ckpt_format=torch, and the rank mapping distinguishes tensor parallel rank, expert
parallel rank, and expert tensor parallel rank.

The DeepSeek-V3/Mixtral checkpoint plugins were also updated accordingly:

  • dense/attention/shared-expert tensors are gathered/split by tensor_model_parallel_size
  • routed expert tensors are gathered/split by expert_tensor_parallel_size
  • expert_model_parallel_size is used only for expert ownership

The reason I did not catch this initially is that my test config had:

tp = 1
ep = 4
etp = 1

So both formulas produced the same number of checkpoint parts:

tp * ep = 4
max(tp, ep * etp) = max(1, 4 * 1) = 4

Because of that, the old logic worked for my local case but would fail for layouts such as tp=4, ep=2, etp=1, where:

tp * ep = 8
max(tp, ep * etp) = 4

The PR has been updated with this fix.

@lihongyang1990 lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 48fc57a to 2606aa6 Compare May 21, 2026 03:28
@LiJunscs
Copy link
Copy Markdown
Collaborator

The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200.

Thanks for pointing this out.

I have updated the converter to handle the MCore legacy torch checkpoint shard layout. The mcore loader/saver no longer assumes tp * ep checkpoint parts; it now uses max(tp, ep * etp) for ckpt_format=torch, and the rank mapping distinguishes tensor parallel rank, expert parallel rank, and expert tensor parallel rank.

The DeepSeek-V3/Mixtral checkpoint plugins were also updated accordingly:

  • dense/attention/shared-expert tensors are gathered/split by tensor_model_parallel_size
  • routed expert tensors are gathered/split by expert_tensor_parallel_size
  • expert_model_parallel_size is used only for expert ownership

The reason I did not catch this initially is that my test config had:

tp = 1
ep = 4
etp = 1

So both formulas produced the same number of checkpoint parts:

tp * ep = 4
max(tp, ep * etp) = max(1, 4 * 1) = 4

Because of that, the old logic worked for my local case but would fail for layouts such as tp=4, ep=2, etp=1, where:

tp * ep = 8
max(tp, ep * etp) = 4

The PR has been updated with this fix.

@LiJunscs LiJunscs closed this May 21, 2026
@LiJunscs LiJunscs reopened this May 21, 2026
@LiJunscs
Copy link
Copy Markdown
Collaborator

Sorry, I mistakenly closed this issue.
And I create a new issue about saving the ckpt when tp > ep * etp.
#1204

@lihongyang1990 lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 2606aa6 to 4f6b19f Compare May 21, 2026 06:04
@lihongyang1990
Copy link
Copy Markdown
Contributor Author

Sorry, I mistakenly closed this issue. And I create a new issue about saving the ckpt when tp > ep * etp. #1204

Thanks for the clarification.

I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export
compatibility fixes.

This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:

  • add --skip-mtp so HF export can skip MTP modules when the target HF implementation only contains the main LM layers
  • guard Engram-specific checkpoint fields so non-Engram checkpoints do not process Engram paths
  • align DeepSeek-V3 HF config export with the current Megatron argument name for RMSNorm epsilon
  • keep HF config vocab size consistent with --true-vocab-size
  • remove unsupported legacy Megatron conversion args / fused kernel loading for the current Megatron version
  • use the current vocab_size_with_padding helper name

The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled
separately from this HF export compatibility PR.

@lihongyang1990 lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 4f6b19f to 59c2381 Compare May 21, 2026 06:14
@LiJunscs
Copy link
Copy Markdown
Collaborator

LiJunscs commented May 21, 2026

Sorry, I mistakenly closed this issue. And I create a new issue about saving the ckpt when tp > ep * etp. #1204

Thanks for the clarification.

I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export compatibility fixes.

This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:

  • add --skip-mtp so HF export can skip MTP modules when the target HF implementation only contains the main LM layers
  • guard Engram-specific checkpoint fields so non-Engram checkpoints do not process Engram paths
  • align DeepSeek-V3 HF config export with the current Megatron argument name for RMSNorm epsilon
  • keep HF config vocab size consistent with --true-vocab-size
  • remove unsupported legacy Megatron conversion args / fused kernel loading for the current Megatron version
  • use the current vocab_size_with_padding helper name

The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled separately from this HF export compatibility PR.

I think you misunderstand my meaning. The #1204 is only fix the bug of saving checkpoint, not the converting chechkpoint between mcore and hf. You previous commits are necessary. The main branch of now saves checkpoint only when edp_rank = 0, it does not save ckpt when dp = 0 and edp != 0(tp > ep * etp), some ckpts are missing.
My pr makes ckpts saved correctly when tp > ep * etp, your pr makes ckpt convert correctly.
Our prs work together to fix the whole bugs. I am sorry for the trouble I've caused you.

@lihongyang1990 lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 04b5466 to 64f0726 Compare May 22, 2026 03:34
@lihongyang1990
Copy link
Copy Markdown
Contributor Author

Sorry, I mistakenly closed this issue. And I create a new issue about saving the ckpt when tp > ep * etp. #1204

Thanks for the clarification.
I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export compatibility fixes.
This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:

  • add --skip-mtp so HF export can skip MTP modules when the target HF implementation only contains the main LM layers
  • guard Engram-specific checkpoint fields so non-Engram checkpoints do not process Engram paths
  • align DeepSeek-V3 HF config export with the current Megatron argument name for RMSNorm epsilon
  • keep HF config vocab size consistent with --true-vocab-size
  • remove unsupported legacy Megatron conversion args / fused kernel loading for the current Megatron version
  • use the current vocab_size_with_padding helper name

The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled separately from this HF export compatibility PR.

I think you misunderstand my meaning. The #1204 is only fix the bug of saving checkpoint, not the converting chechkpoint between mcore and hf. You previous commits are necessary. The main branch of now saves checkpoint only when edp_rank = 0, it does not save ckpt when dp = 0 and edp != 0(tp > ep * etp), some ckpts are missing. My pr makes ckpts saved correctly when tp > ep * etp, your pr makes ckpt convert correctly. Our prs work together to fix the whole bugs. I am sorry for the trouble I've caused you.

Thanks for the clarification, no worries.

I understand now: #1204 fixes checkpoint saving, while this PR still needs the MCore <-> HF TP/EP/ETP conversion fixes. I have restored those
converter changes and updated the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants