Skip to content

[fsdp] fix: Fix Qwen3 MoE FSDP weight sync for vLLM rollout in Transformers 5#6863

Open
lxb007981 wants to merge 1 commit into
verl-project:mainfrom
lxb007981:fix_moe_fsdp_weight_sync
Open

[fsdp] fix: Fix Qwen3 MoE FSDP weight sync for vLLM rollout in Transformers 5#6863
lxb007981 wants to merge 1 commit into
verl-project:mainfrom
lxb007981:fix_moe_fsdp_weight_sync

Conversation

@lxb007981

@lxb007981 lxb007981 commented Jun 27, 2026

Copy link
Copy Markdown

What does this PR do?

Transformers 5 stores Qwen-style MoE expert weights as packed 3D mlp.experts.gate_up_proj and mlp.experts.down_proj tensors. During live FSDP-to-vLLM rollout weight sync, those packed keys were sent directly, but vLLM's Qwen3 MoE reload path expects the original per-expert checkpoint keys.

Expand packed MoE expert tensors during FSDP parameter streaming so vLLM receives per-expert gate_proj, up_proj, and down_proj weights. Dense models and non-packed tensors continue to pass through unchanged.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: https://github.com/verl-project/verl/pulls?q=is%3Apr+is%3Aopen+moe+transformers
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Transformers 5 stores Qwen-style MoE expert weights as packed 3D
`mlp.experts.gate_up_proj` and `mlp.experts.down_proj` tensors. During
live FSDP-to-vLLM rollout weight sync, those packed keys were sent
directly, but vLLM's Qwen3 MoE reload path expects the original
per-expert checkpoint keys.

Expand packed MoE expert tensors during FSDP parameter streaming so vLLM
receives per-expert `gate_proj`, `up_proj`, and `down_proj` weights.
Dense models and non-packed tensors continue to pass through unchanged.
@CLAassistant

CLAassistant commented Jun 27, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces iter_vllm_compatible_moe_params to expand Transformers 5 packed MoE expert tensors into vLLM-compatible checkpoint keys during live weight sync, and integrates it into the FSDP transformer implementation. The reviewer suggested stripping the .weight suffix from parameter names before matching to improve robustness, and removing redundant .contiguous() calls on sliced tensors to avoid unnecessary overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread verl/utils/model.py
Comment on lines +266 to +291
def iter_vllm_compatible_moe_params(name: str, tensor: torch.Tensor) -> Iterable[tuple[str, torch.Tensor]]:
"""Expand Transformers 5 packed MoE expert tensors to vLLM checkpoint keys.

Transformers 5 stores Qwen-style MoE experts as packed 3D parameters:
``mlp.experts.gate_up_proj`` with shape
``[num_experts, 2 * intermediate_size, hidden_size]`` and
``mlp.experts.down_proj`` with shape
``[num_experts, hidden_size, intermediate_size]``. vLLM's Qwen MoE reload
path still accepts the original per-expert checkpoint keys during live
weight sync, so stream those keys without materializing a full dict.
"""
if name.endswith(".mlp.experts.gate_up_proj") and tensor.dim() == 3:
gate, up = tensor.chunk(2, dim=1)
base = name.removesuffix(".gate_up_proj")
for expert_id in range(tensor.size(0)):
yield f"{base}.{expert_id}.gate_proj.weight", gate[expert_id].contiguous()
yield f"{base}.{expert_id}.up_proj.weight", up[expert_id].contiguous()
return

if name.endswith(".mlp.experts.down_proj") and tensor.dim() == 3:
base = name.removesuffix(".down_proj")
for expert_id in range(tensor.size(0)):
yield f"{base}.{expert_id}.down_proj.weight", tensor[expert_id].contiguous()
return

yield name, tensor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Robustness and Performance Improvements

  1. Robustness of Parameter Name Matching: The current implementation checks name.endswith(".mlp.experts.gate_up_proj"). Depending on how the model is loaded or wrapped, the parameter name in the state dict might have a .weight suffix (e.g., ...mlp.experts.gate_up_proj.weight). Stripping .weight first makes the matching much more robust.
  2. Redundant .contiguous() Calls: Since tensor is gathered from FSDP or is a model parameter, it is contiguous. Slicing it along the first dimension (e.g., gate[expert_id]) produces a slice that is also contiguous because the remaining dimensions have contiguous strides. Therefore, calling .contiguous() is redundant and can be omitted to avoid unnecessary overhead.
def iter_vllm_compatible_moe_params(name: str, tensor: torch.Tensor) -> Iterable[tuple[str, torch.Tensor]]:
    """Expand Transformers 5 packed MoE expert tensors to vLLM checkpoint keys.

    Transformers 5 stores Qwen-style MoE experts as packed 3D parameters:
    mlp.experts.gate_up_proj with shape
    [num_experts, 2 * intermediate_size, hidden_size] and
    mlp.experts.down_proj with shape
    [num_experts, hidden_size, intermediate_size]. vLLM's Qwen MoE reload
    path still accepts the original per-expert checkpoint keys during live
    weight sync, so stream those keys without materializing a full dict.
    """
    name_stripped = name.removesuffix(".weight")

    if name_stripped.endswith(".mlp.experts.gate_up_proj") and tensor.dim() == 3:
        gate, up = tensor.chunk(2, dim=1)
        base = name_stripped.removesuffix(".gate_up_proj")
        for expert_id in range(tensor.size(0)):
            yield f"{base}.{expert_id}.gate_proj.weight", gate[expert_id]
            yield f"{base}.{expert_id}.up_proj.weight", up[expert_id]
        return

    if name_stripped.endswith(".mlp.experts.down_proj") and tensor.dim() == 3:
        base = name_stripped.removesuffix(".down_proj")
        for expert_id in range(tensor.size(0)):
            yield f"{base}.{expert_id}.down_proj.weight", tensor[expert_id]
        return

    yield name, tensor

@lxb007981 lxb007981 changed the title [model, fsdp] fix: Fix Qwen3 MoE FSDP weight sync for vLLM rollout [fsdp] fix: Fix Qwen3 MoE FSDP weight sync for vLLM rollout in Transformers 5 Jun 27, 2026
@lxb007981 lxb007981 marked this pull request as ready for review June 27, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants