Fix DeepSeek-V3 checkpoint export compatibility#1202
Conversation
|
The checkpoint convert method is legacy. It does not handle the tensor_model_parallel_size and expert_tensor_parallel_size. Current codes load tp * ep parts of checkpoint. But mcore only max(tp, ep * etp) parts of checkpoint when ckpt_format is torch. See NVIDIA/Megatron-LM#4200. |
8419380 to
48fc57a
Compare
Thanks for pointing this out. I have updated the converter to handle the MCore legacy torch checkpoint shard layout. The mcore loader/saver no longer assumes The DeepSeek-V3/Mixtral checkpoint plugins were also updated accordingly:
The reason I did not catch this initially is that my test config had: |
48fc57a to
2606aa6
Compare
|
|
Sorry, I mistakenly closed this issue. |
2606aa6 to
4f6b19f
Compare
Thanks for the clarification. I have removed the broader TP/EP/ETP checkpoint converter changes from this PR and kept this PR focused on the DeepSeek-V3/Moonlight HF export This PR now only keeps the conversion-side fixes needed for the DeepSeek-V3/Moonlight HF export path:
The generic TP/EP/ETP checkpoint saving issue is now left to #1204, since that is a training checkpoint save-path fix and should be handled |
4f6b19f to
59c2381
Compare
I think you misunderstand my meaning. The #1204 is only fix the bug of saving checkpoint, not the converting chechkpoint between mcore and hf. You previous commits are necessary. The main branch of now saves checkpoint only when edp_rank = 0, it does not save ckpt when dp = 0 and edp != 0(tp > ep * etp), some ckpts are missing. |
04b5466 to
64f0726
Compare
Thanks for the clarification, no worries. I understand now: #1204 fixes checkpoint saving, while this PR still needs the MCore <-> HF TP/EP/ETP conversion fixes. I have restored those |
Summary
--skip-mtpconversion option so DeepSeek-V3/Moonlight checkpoints can be exported to HF implementations that only contain the main LM layers.Validation
PYTHONPYCACHEPREFIX=/private/tmp/flagscale_pycache python3 -m py_compile tools/checkpoint/convert.py tools/checkpoint/utils.py tools/checkpoint/loader_mcore.py tools/checkpoint/loader_transformers.py tools/checkpoint/saver_mcore.py tools/checkpoint/saver_transformers.py tools/checkpoint/deepseek_v3/args.pygit diff --check