Skip to content

[training] fix Qwen3-VL packed vlm_step MRoPE#4532

Open
cuichenx wants to merge 5 commits into
mainfrom
nora/vlm-mrope-20260625
Open

[training] fix Qwen3-VL packed vlm_step MRoPE#4532
cuichenx wants to merge 5 commits into
mainfrom
nora/vlm-mrope-20260625

Conversation

@cuichenx

@cuichenx cuichenx commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR makes the generic vlm_step path work for Qwen3.5-VL in-batch packed sequence training, so the specialized qwen3_vl_step path can eventually be deprecated without losing Qwen3-VL MRoPE behavior.

The original bug was in the packed generic VLM path: packed batches carried Qwen3-VL 3D MRoPE position metadata, but vlm_step/Qwen3VLModel did not preserve and consume that metadata the same way as qwen3_vl_step. That made packed vlm_step produce a different loss curve from the Qwen3-specific step.

This update also fixes the CP=4 variant. In packed CP, the model was mixing padded CP metadata with a differently compacted hidden-state/supervision stream. In non-packed CP, hidden states could be CP-local while MTP inputs, positions, labels, and loss masks remained full-sequence. Both cases now use consistent CP-local tensors.

Changes:

  • Preserve explicit 3D Qwen3-VL MRoPE position_ids through generic vlm_step packing.
  • Let Qwen3VLModel use explicit MRoPE metadata in the packed path instead of recomputing incompatible positions.
  • For packed CP, derive one TE THD partition index from padded cu_seqlens and apply it consistently to LM inputs, embeddings, vision/deepstack masks, MRoPE position IDs, labels, and loss masks.
  • For non-packed CP, slice full-sequence LM/MTP inputs, MRoPE position IDs, labels, and loss masks only when they are still full-length, avoiding double-slicing already-local tensors.
  • Avoid sending full visual payloads to non-first PP ranks; only the metadata needed by later ranks is retained.
  • Fix Qwen3-VL PP + MTP tied-embedding checkpoint state-dict generation by dropping the redundant output-layer checkpoint entry on non-preprocess MTP ranks.
  • Add focused unit coverage for the packed MRoPE metadata path, PP metadata-only batches, CP padded cu-seqlens handling, dense CP no-double-slice behavior, and tied MTP state dict handling.

Experiment Labels

  • A: pre-fix generic vlm_step with in-batch packing enabled. This is the broken path; it runs but has the wrong loss trajectory.
  • B: fixed generic vlm_step from this PR with the same in-batch packing setup as A.
  • C: existing specialized qwen3_vl_step baseline with packing deferred to the step function.

The expected result is that B matches C where C has a valid baseline, while A differs from C.

Validation

Local checks:

  • python -m py_compile on changed Python files
  • uv run --no-sync --with ruff ruff check src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_utils.py
  • uv run --no-sync --with pre-commit pre-commit run --files src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_utils.py
  • git diff --check
  • Attempted uv run pre-commit run --all-files, but the local host cannot resolve nvidia-resiliency-ext==0.6.0 because that package has no wheel for this host platform.
  • Focused pytest was not rerun locally because the local environment is not synced for this checkout.

H100 26.06 container validation:

  • Main packed comparison:
    • A, pre-fix vlm_step + packing: completed, but produced the wrong loss trajectory compared with C.
    • B, fixed vlm_step + packing: matches C.
    • C, qwen3_vl_step + packing: baseline.
  • TP=4, 20 iterations:
    • B final LM/MTP: 0.610717 / 0.865842
    • C final LM/MTP: 0.606260 / 0.864398
  • PP=4, 20 iterations:
    • B final LM/MTP: 0.605223 / 0.863726
    • C final LM/MTP: 0.604834 / 0.861366
  • CP=4 packed vlm_step:
    • Before the CP fix, B completed but had the wrong loss scale, final LM/MTP: 3.329275 / 4.187633.
    • After the CP fix, B completed 20 iterations with final LM/MTP: 0.606448 / 0.641229, with 0 skipped and 0 NaN iterations.
  • CP=4 non-packed vlm_step:
    • Before the CP fix, this path failed in MTP embedding concatenation because hidden states were CP-local while MTP inputs were still full-sequence.
    • After the CP fix, B completed 10 iterations with final LM/MTP: 0.834404 / 0.906320, with 0 skipped and 0 NaN iterations.
  • CP=4 qwen3_vl_step:
    • No valid C loss baseline from this launcher: the existing specialized step fails before the first logged iteration in CP batch splitting.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cuichenx added 2 commits June 26, 2026 14:46
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as ready for review June 26, 2026 22:32
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

LGTM — clean, well-tested fix for the packed vlm_step MRoPE path.

Observations (non-blocking):

  • The polymorphic return type from Qwen3VLModel.forward (output vs (output, loss_mask)) is correctly handled by the isinstance(model_output, tuple) check in vlm_step.py:379. The use_dist_train dict-return path at line 928-930 takes precedence, but this is safe because labels/loss_mask are None on non-last PP stages, so return_compacted_loss_mask stays False.
  • _prepare_packed_bshd_tensor types attention_mask as torch.Tensor (required), but the CP-index branch at line 202 never uses it. Not a runtime issue — just a minor type annotation looseness.
  • Good use of copy.copy() for _project_visual_inputs_for_pp_stage — the shallow copy correctly avoids mutating the original GenericVisualInputs while zeroing out payload tensors on non-first PP stages.
  • The _build_llm_rope_positions extraction from the inline loop in get_rope_index is a faithful refactor — the logic matches the original exactly.

Suggested test cases:

No perf tests impacted (no files under scripts/performance/configs/ were changed).

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx

Copy link
Copy Markdown
Contributor Author

/claude review

Comment on lines +946 to 948
if return_compacted_loss_mask:
return output, loss_mask
return output

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Medium] The new tuple return (output, loss_mask) is handled correctly by vlm_step.py (line 379), but the deprecated qwen3_vl_step.py (line 328) does output_tensor = model(**forward_args) without unpacking the tuple. When return_compacted_loss_mask is True (packed + sequence_parallel, or packed CP), qwen3_vl_step would assign the tuple to output_tensor and use a stale loss_mask for the loss function.

This only affects the qwen3_vl_step path which the PR description says is being deprecated in favor of vlm_step. If that path is truly no longer exercised, this is fine — but if any recipe still routes through qwen3_vl_step with packing + SP, it would silently compute the wrong loss. Consider adding the same isinstance(model_output, tuple) guard in qwen3_vl_step.py, or documenting the limitation.

@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Light Code Review - The PR fixes Qwen3-VL packed vlm_step MRoPE by preserving explicit 3D position IDs through the generic packing path, fixing CP consistency for both packed and non-packed modes, avoiding full visual payloads on non-first PP ranks, and fixing MTP tied-embedding state-dict generation. Finding: [Medium] Tuple return not handled in deprecated qwen3_vl_step -- see inline comment on model.py:946-948. The new (output, loss_mask) return path from Qwen3VLModel.forward() is handled in vlm_step.py but not in qwen3_vl_step.py. Test coverage: The PR adds 15 focused unit tests. Coverage is thorough. Suggested test cases: No perf tests impacted.

@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 26, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants