Fix DeepSeek-V3 checkpoint export compatibility by lihongyang1990 · Pull Request #1202 · flagos-ai/FlagScale

lihongyang1990 · 2026-05-20T02:15:57Z

Summary

Add an explicit --skip-mtp conversion option so DeepSeek-V3/Moonlight checkpoints can be exported to HF implementations that only contain the main LM layers.
Keep loader and saver MTP handling consistent when MTP is skipped, avoiding unconsumed queue messages and missing HF MTP layer lookups.
Align DeepSeek-V3 HF config export with current Megatron argument names and true vocabulary size handling.
Guard Engram-only checkpoint fields for non-Engram checkpoints and remove unsupported legacy Megatron conversion arguments.

Validation

PYTHONPYCACHEPREFIX=/private/tmp/flagscale_pycache python3 -m py_compile tools/checkpoint/convert.py tools/checkpoint/utils.py tools/checkpoint/loader_mcore.py tools/checkpoint/loader_transformers.py tools/checkpoint/saver_mcore.py tools/checkpoint/saver_transformers.py tools/checkpoint/deepseek_v3/args.py
git diff --check

CLAassistant · 2026-05-20T02:16:03Z

All committers have signed the CLA.

lxd-cumt

LGTM

LiJunscs · 2026-05-20T12:15:12Z

The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200.

lihongyang1990 · 2026-05-21T03:22:19Z

The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200.

Thanks for pointing this out.

I have updated the converter to handle the MCore legacy torch checkpoint shard layout. The mcore loader/saver no longer assumes tp * ep
checkpoint parts; it now uses max(tp, ep * etp) for ckpt_format=torch, and the rank mapping distinguishes tensor parallel rank, expert
parallel rank, and expert tensor parallel rank.

The DeepSeek-V3/Mixtral checkpoint plugins were also updated accordingly:

dense/attention/shared-expert tensors are gathered/split by tensor_model_parallel_size
routed expert tensors are gathered/split by expert_tensor_parallel_size
expert_model_parallel_size is used only for expert ownership

The reason I did not catch this initially is that my test config had:

tp = 1
ep = 4
etp = 1

So both formulas produced the same number of checkpoint parts:

tp * ep = 4
max(tp, ep * etp) = max(1, 4 * 1) = 4

Because of that, the old logic worked for my local case but would fail for layouts such as tp=4, ep=2, etp=1, where:

tp * ep = 8
max(tp, ep * etp) = 4

The PR has been updated with this fix.

LiJunscs · 2026-05-21T03:39:34Z

The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200.

Thanks for pointing this out.

I have updated the converter to handle the MCore legacy torch checkpoint shard layout. The mcore loader/saver no longer assumes tp * ep checkpoint parts; it now uses max(tp, ep * etp) for ckpt_format=torch, and the rank mapping distinguishes tensor parallel rank, expert parallel rank, and expert tensor parallel rank.

The DeepSeek-V3/Mixtral checkpoint plugins were also updated accordingly:

dense/attention/shared-expert tensors are gathered/split by tensor_model_parallel_size

routed expert tensors are gathered/split by expert_tensor_parallel_size

expert_model_parallel_size is used only for expert ownership

The reason I did not catch this initially is that my test config had:
tp = 1
ep = 4
etp = 1

So both formulas produced the same number of checkpoint parts:

tp * ep = 4
max(tp, ep * etp) = max(1, 4 * 1) = 4

Because of that, the old logic worked for my local case but would fail for layouts such as tp=4, ep=2, etp=1, where:

tp * ep = 8
max(tp, ep * etp) = 4

The PR has been updated with this fix.

LiJunscs · 2026-05-21T03:54:49Z

Sorry, I mistakenly closed this issue.
And I create a new issue about saving the ckpt when tp > ep * etp.
#1204

lihongyang1990 · 2026-05-21T06:06:45Z

Sorry, I mistakenly closed this issue. And I create a new issue about saving the ckpt when tp > ep * etp. #1204

Thanks for the clarification.

I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export
compatibility fixes.

This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:

add --skip-mtp so HF export can skip MTP modules when the target HF implementation only contains the main LM layers
guard Engram-specific checkpoint fields so non-Engram checkpoints do not process Engram paths
align DeepSeek-V3 HF config export with the current Megatron argument name for RMSNorm epsilon
keep HF config vocab size consistent with --true-vocab-size
remove unsupported legacy Megatron conversion args / fused kernel loading for the current Megatron version
use the current vocab_size_with_padding helper name

The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled
separately from this HF export compatibility PR.

LiJunscs · 2026-05-21T06:26:03Z

Sorry, I mistakenly closed this issue. And I create a new issue about saving the ckpt when tp > ep * etp. #1204

Thanks for the clarification.

I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export compatibility fixes.

This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:

add --skip-mtp so HF export can skip MTP modules when the target HF implementation only contains the main LM layers

guard Engram-specific checkpoint fields so non-Engram checkpoints do not process Engram paths

align DeepSeek-V3 HF config export with the current Megatron argument name for RMSNorm epsilon

keep HF config vocab size consistent with --true-vocab-size

remove unsupported legacy Megatron conversion args / fused kernel loading for the current Megatron version

use the current vocab_size_with_padding helper name

The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled separately from this HF export compatibility PR.

I think you misunderstand my meaning. The #1204 is only fix the bug of saving checkpoint, not the converting chechkpoint between mcore and hf. You previous commits are necessary. The main branch of now saves checkpoint only when edp_rank = 0, it does not save ckpt when dp = 0 and edp != 0(tp > ep * etp), some ckpts are missing.
My pr makes ckpts saved correctly when tp > ep * etp, your pr makes ckpt convert correctly.
Our prs work together to fix the whole bugs. I am sorry for the trouble I've caused you.

lihongyang1990 · 2026-05-22T06:21:11Z

Sorry, I mistakenly closed this issue. And I create a new issue about saving the ckpt when tp > ep * etp. #1204

Thanks for the clarification.
I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export compatibility fixes.
This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:

add --skip-mtp so HF export can skip MTP modules when the target HF implementation only contains the main LM layers

guard Engram-specific checkpoint fields so non-Engram checkpoints do not process Engram paths

align DeepSeek-V3 HF config export with the current Megatron argument name for RMSNorm epsilon

keep HF config vocab size consistent with --true-vocab-size

remove unsupported legacy Megatron conversion args / fused kernel loading for the current Megatron version

use the current vocab_size_with_padding helper name

The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled separately from this HF export compatibility PR.

I think you misunderstand my meaning. The #1204 is only fix the bug of saving checkpoint, not the converting chechkpoint between mcore and hf. You previous commits are necessary. The main branch of now saves checkpoint only when edp_rank = 0, it does not save ckpt when dp = 0 and edp != 0(tp > ep * etp), some ckpts are missing. My pr makes ckpts saved correctly when tp > ep * etp, your pr makes ckpt convert correctly. Our prs work together to fix the whole bugs. I am sorry for the trouble I've caused you.

Thanks for the clarification, no worries.

I understand now: #1204 fixes checkpoint saving, while this PR still needs the MCore <-> HF TP/EP/ETP conversion fixes. I have restored those
converter changes and updated the PR.

lxd-cumt previously approved these changes May 20, 2026

View reviewed changes

lihongyang1990 dismissed lxd-cumt’s stale review via 8419380 May 21, 2026 02:22

lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch 2 times, most recently from 8419380 to 48fc57a Compare May 21, 2026 03:15

lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 48fc57a to 2606aa6 Compare May 21, 2026 03:28

LiJunscs closed this May 21, 2026

LiJunscs reopened this May 21, 2026

lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 2606aa6 to 4f6b19f Compare May 21, 2026 06:04

Fix DeepSeek-V3 checkpoint export compatibility

59c2381

lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 4f6b19f to 59c2381 Compare May 21, 2026 06:14

Fix TP EP ETP checkpoint conversion

64f0726

lihongyang1990 force-pushed the checkpoint-deepseek-v3-hf-export branch from 04b5466 to 64f0726 Compare May 22, 2026 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DeepSeek-V3 checkpoint export compatibility#1202

Fix DeepSeek-V3 checkpoint export compatibility#1202
lihongyang1990 wants to merge 2 commits into
flagos-ai:mainfrom
lihongyang1990:checkpoint-deepseek-v3-hf-export

lihongyang1990 commented May 20, 2026

Uh oh!

CLAassistant commented May 20, 2026 •

edited

Loading

Uh oh!

lxd-cumt left a comment

Uh oh!

LiJunscs commented May 20, 2026

Uh oh!

lihongyang1990 commented May 21, 2026

Uh oh!

LiJunscs commented May 21, 2026

Uh oh!

LiJunscs commented May 21, 2026

Uh oh!

lihongyang1990 commented May 21, 2026

Uh oh!

LiJunscs commented May 21, 2026 •

edited

Loading

Uh oh!

lihongyang1990 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lihongyang1990 commented May 20, 2026

Summary

Validation

Uh oh!

CLAassistant commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lxd-cumt left a comment

Choose a reason for hiding this comment

Uh oh!

LiJunscs commented May 20, 2026

Uh oh!

lihongyang1990 commented May 21, 2026

Uh oh!

LiJunscs commented May 21, 2026

Uh oh!

LiJunscs commented May 21, 2026

Uh oh!

lihongyang1990 commented May 21, 2026

Uh oh!

LiJunscs commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihongyang1990 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented May 20, 2026 •

edited

Loading

LiJunscs commented May 21, 2026 •

edited

Loading