[training] fix Qwen3-VL packed vlm_step MRoPE by cuichenx · Pull Request #4532 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-06-26T18:12:17Z

Summary

This PR makes the generic vlm_step path work for Qwen3.5-VL in-batch packed sequence training, so the specialized qwen3_vl_step path can eventually be deprecated without losing Qwen3-VL MRoPE behavior.

The original bug was in the packed generic VLM path: packed batches carried Qwen3-VL 3D MRoPE position metadata, but vlm_step/Qwen3VLModel did not preserve and consume that metadata the same way as qwen3_vl_step. That made packed vlm_step produce a different loss curve from the Qwen3-specific step.

This update also fixes the CP=4 variant. In packed CP, the model was mixing padded CP metadata with a differently compacted hidden-state/supervision stream. In non-packed CP, hidden states could be CP-local while MTP inputs, positions, labels, and loss masks remained full-sequence. Both cases now use consistent CP-local tensors.

Changes:

Preserve explicit 3D Qwen3-VL MRoPE position_ids through generic vlm_step packing.
Let Qwen3VLModel use explicit MRoPE metadata in the packed path instead of recomputing incompatible positions.
For packed CP, derive one TE THD partition index from padded cu_seqlens and apply it consistently to LM inputs, embeddings, vision/deepstack masks, MRoPE position IDs, labels, and loss masks.
For non-packed CP, slice full-sequence LM/MTP inputs, MRoPE position IDs, labels, and loss masks only when they are still full-length, avoiding double-slicing already-local tensors.
Avoid sending full visual payloads to non-first PP ranks; only the metadata needed by later ranks is retained.
Fix Qwen3-VL PP + MTP tied-embedding checkpoint state-dict generation by dropping the redundant output-layer checkpoint entry on non-preprocess MTP ranks.
Add focused unit coverage for the packed MRoPE metadata path, PP metadata-only batches, CP padded cu-seqlens handling, dense CP no-double-slice behavior, and tied MTP state dict handling.

Experiment Labels

A: pre-fix generic vlm_step with in-batch packing enabled. This is the broken path; it runs but has the wrong loss trajectory.
B: fixed generic vlm_step from this PR with the same in-batch packing setup as A.
C: existing specialized qwen3_vl_step baseline with packing deferred to the step function.

The expected result is that B matches C where C has a valid baseline, while A differs from C.

Validation

Local checks:

python -m py_compile on changed Python files
uv run --no-sync --with ruff ruff check src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_utils.py
uv run --no-sync --with pre-commit pre-commit run --files src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_utils.py
git diff --check
Attempted uv run pre-commit run --all-files, but the local host cannot resolve nvidia-resiliency-ext==0.6.0 because that package has no wheel for this host platform.
Focused pytest was not rerun locally because the local environment is not synced for this checkout.

H100 26.06 container validation:

Main packed comparison:
- A, pre-fix vlm_step + packing: completed, but produced the wrong loss trajectory compared with C.
- B, fixed vlm_step + packing: matches C.
- C, qwen3_vl_step + packing: baseline.
TP=4, 20 iterations:
- B final LM/MTP: 0.610717 / 0.865842
- C final LM/MTP: 0.606260 / 0.864398
PP=4, 20 iterations:
- B final LM/MTP: 0.605223 / 0.863726
- C final LM/MTP: 0.604834 / 0.861366
CP=4 packed vlm_step:
- Before the CP fix, B completed but had the wrong loss scale, final LM/MTP: 3.329275 / 4.187633.
- After the CP fix, B completed 20 iterations with final LM/MTP: 0.606448 / 0.641229, with 0 skipped and 0 NaN iterations.
CP=4 non-packed vlm_step:
- Before the CP fix, this path failed in MTP embedding concatenation because hidden states were CP-local while MTP inputs were still full-sequence.
- After the CP fix, B completed 10 iterations with final LM/MTP: 0.834404 / 0.906320, with 0 skipped and 0 NaN iterations.
CP=4 qwen3_vl_step:
- No valid C loss baseline from this launcher: the existing specialized step fails before the first logged iteration in CP batch splitting.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-06-26T18:12:21Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>

claude · 2026-06-26T22:38:57Z

LGTM — clean, well-tested fix for the packed vlm_step MRoPE path.

Observations (non-blocking):

The polymorphic return type from Qwen3VLModel.forward (output vs (output, loss_mask)) is correctly handled by the isinstance(model_output, tuple) check in vlm_step.py:379. The use_dist_train dict-return path at line 928-930 takes precedence, but this is safe because labels/loss_mask are None on non-last PP stages, so return_compacted_loss_mask stays False.
_prepare_packed_bshd_tensor types attention_mask as torch.Tensor (required), but the CP-index branch at line 202 never uses it. Not a runtime issue — just a minor type annotation looseness.
Good use of copy.copy() for _project_visual_inputs_for_pp_stage — the shallow copy correctly avoids mutating the original GenericVisualInputs while zeroing out payload tensors on non-first PP stages.
The _build_llm_rope_positions extraction from the inline loop in get_rope_index is a faithful refactor — the logic matches the original exactly.

Suggested test cases:

No perf tests impacted (no files under scripts/performance/configs/ were changed).

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx · 2026-06-26T22:44:22Z

/claude review

claude · 2026-06-26T22:50:14Z

+        if return_compacted_loss_mask:
+            return output, loss_mask
        return output


[Medium] The new tuple return (output, loss_mask) is handled correctly by vlm_step.py (line 379), but the deprecated qwen3_vl_step.py (line 328) does output_tensor = model(**forward_args) without unpacking the tuple. When return_compacted_loss_mask is True (packed + sequence_parallel, or packed CP), qwen3_vl_step would assign the tuple to output_tensor and use a stale loss_mask for the loss function.

This only affects the qwen3_vl_step path which the PR description says is being deprecated in favor of vlm_step. If that path is truly no longer exercised, this is fine — but if any recipe still routes through qwen3_vl_step with packing + SP, it would silently compute the wrong loss. Consider adding the same isinstance(model_output, tuple) guard in qwen3_vl_step.py, or documenting the limitation.

claude · 2026-06-26T22:53:39Z

Light Code Review - The PR fixes Qwen3-VL packed vlm_step MRoPE by preserving explicit 3D position IDs through the generic packing path, fixing CP consistency for both packed and non-packed modes, avoiding full visual payloads on non-first PP ranks, and fixing MTP tied-embedding state-dict generation. Finding: [Medium] Tuple return not handled in deprecated qwen3_vl_step -- see inline comment on model.py:946-948. The new (output, loss_mask) return path from Qwen3VLModel.forward() is handled in vlm_step.py but not in qwen3_vl_step.py. Test coverage: The PR adds 15 focused unit tests. Coverage is thorough. Suggested test cases: No perf tests impacted.

Signed-off-by: Chen Cui <chcui@nvidia.com>

[training] fix: support packed Qwen3-VL vlm step

55f4bd9

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx added 2 commits June 26, 2026 14:46

fix(training): align Qwen3-VL CP inputs

9c8ac65

Signed-off-by: Chen Cui <chcui@nvidia.com>

refactor(training): simplify Qwen3-VL CP handling

4d3d611

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review June 26, 2026 22:32

copy-pr-bot Bot temporarily deployed to public June 26, 2026 22:33 Inactive

copy-pr-bot Bot had a problem deploying to test June 26, 2026 22:33 Error

refactor(models): share Qwen packed sequence helpers

0d606b6

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 26, 2026 22:46 Inactive

copy-pr-bot Bot temporarily deployed to test June 26, 2026 22:46 Inactive

claude Bot reviewed Jun 26, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 26, 2026 23:00 Inactive

yaoyu-33 added area:model Model implementations and HF bridge logic bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 26, 2026

copy-pr-bot Bot temporarily deployed to public June 26, 2026 23:05 Inactive

copy-pr-bot Bot temporarily deployed to public June 26, 2026 23:07 Inactive

copy-pr-bot Bot temporarily deployed to public June 26, 2026 23:28 Inactive

test(training): cover Qwen packed sequence helpers

f03e953

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 27, 2026 18:05 Inactive

copy-pr-bot Bot temporarily deployed to test June 27, 2026 18:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 27, 2026 18:15 Inactive

copy-pr-bot Bot temporarily deployed to public June 27, 2026 18:36 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[training] fix Qwen3-VL packed vlm_step MRoPE#4532

[training] fix Qwen3-VL packed vlm_step MRoPE#4532
cuichenx wants to merge 5 commits into
mainfrom
nora/vlm-mrope-20260625

cuichenx commented Jun 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

cuichenx commented Jun 26, 2026

Uh oh!

claude Bot Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cuichenx commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Experiment Labels

Validation

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

cuichenx commented Jun 26, 2026

Uh oh!

claude Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cuichenx commented Jun 26, 2026 •

edited

Loading