[training] fix Qwen3-VL packed vlm_step MRoPE#4532
Conversation
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
LGTM — clean, well-tested fix for the packed Observations (non-blocking):
Suggested test cases: No perf tests impacted (no files under |
Signed-off-by: Chen Cui <chcui@nvidia.com>
|
/claude review |
| if return_compacted_loss_mask: | ||
| return output, loss_mask | ||
| return output |
There was a problem hiding this comment.
[Medium] The new tuple return (output, loss_mask) is handled correctly by vlm_step.py (line 379), but the deprecated qwen3_vl_step.py (line 328) does output_tensor = model(**forward_args) without unpacking the tuple. When return_compacted_loss_mask is True (packed + sequence_parallel, or packed CP), qwen3_vl_step would assign the tuple to output_tensor and use a stale loss_mask for the loss function.
This only affects the qwen3_vl_step path which the PR description says is being deprecated in favor of vlm_step. If that path is truly no longer exercised, this is fine — but if any recipe still routes through qwen3_vl_step with packing + SP, it would silently compute the wrong loss. Consider adding the same isinstance(model_output, tuple) guard in qwen3_vl_step.py, or documenting the limitation.
|
Light Code Review - The PR fixes Qwen3-VL packed vlm_step MRoPE by preserving explicit 3D position IDs through the generic packing path, fixing CP consistency for both packed and non-packed modes, avoiding full visual payloads on non-first PP ranks, and fixing MTP tied-embedding state-dict generation. Finding: [Medium] Tuple return not handled in deprecated qwen3_vl_step -- see inline comment on model.py:946-948. The new (output, loss_mask) return path from Qwen3VLModel.forward() is handled in vlm_step.py but not in qwen3_vl_step.py. Test coverage: The PR adds 15 focused unit tests. Coverage is thorough. Suggested test cases: No perf tests impacted. |
Signed-off-by: Chen Cui <chcui@nvidia.com>
Summary
This PR makes the generic
vlm_steppath work for Qwen3.5-VL in-batch packed sequence training, so the specializedqwen3_vl_steppath can eventually be deprecated without losing Qwen3-VL MRoPE behavior.The original bug was in the packed generic VLM path: packed batches carried Qwen3-VL 3D MRoPE position metadata, but
vlm_step/Qwen3VLModeldid not preserve and consume that metadata the same way asqwen3_vl_step. That made packedvlm_stepproduce a different loss curve from the Qwen3-specific step.This update also fixes the CP=4 variant. In packed CP, the model was mixing padded CP metadata with a differently compacted hidden-state/supervision stream. In non-packed CP, hidden states could be CP-local while MTP inputs, positions, labels, and loss masks remained full-sequence. Both cases now use consistent CP-local tensors.
Changes:
position_idsthrough genericvlm_steppacking.Qwen3VLModeluse explicit MRoPE metadata in the packed path instead of recomputing incompatible positions.cu_seqlensand apply it consistently to LM inputs, embeddings, vision/deepstack masks, MRoPE position IDs, labels, and loss masks.Experiment Labels
vlm_stepwith in-batch packing enabled. This is the broken path; it runs but has the wrong loss trajectory.vlm_stepfrom this PR with the same in-batch packing setup as A.qwen3_vl_stepbaseline with packing deferred to the step function.The expected result is that B matches C where C has a valid baseline, while A differs from C.
Validation
Local checks:
python -m py_compileon changed Python filesuv run --no-sync --with ruff ruff check src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_utils.pyuv run --no-sync --with pre-commit pre-commit run --files src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/model.py tests/unit_tests/models/qwen_vl/modelling_qwen3_vl/test_utils.pygit diff --checkuv run pre-commit run --all-files, but the local host cannot resolvenvidia-resiliency-ext==0.6.0because that package has no wheel for this host platform.H100 26.06 container validation:
vlm_step+ packing: completed, but produced the wrong loss trajectory compared with C.vlm_step+ packing: matches C.qwen3_vl_step+ packing: baseline.0.610717/0.8658420.606260/0.8643980.605223/0.8637260.604834/0.861366vlm_step:3.329275/4.187633.0.606448/0.641229, with0skipped and0NaN iterations.vlm_step:0.834404/0.906320, with0skipped and0NaN iterations.qwen3_vl_step: