Skip to content

[Question] Need help to support Qwen3.5 dense(/moe) VLM megatron.bridge plugin together #2073

Description

@demouo

Your Question

Qwen3.5 VLM megatron.bridge plugin is missing now.
And I try to develop this part without changing the Megatron Bridge,
but I can't handle it completely (~70% done).
So I'm asking for help here.

What I've Tried

Summary

I have added or modified four files in my commit:

  • slime_plugins/megatron_bridge/qwen3_5.py(megatron bridge)
  • slime_plugins/megatron_bridge/__init__.py
  • slime_plugins/models/qwen3_5.py(grad sync hook)
  • slime/backends/megatron_utils/megatron_to_hf/qwen3_5.py(VLM weight prefix + vision transform)

Progress

  • rollout -> ok
  • reward -> ok
  • training -> partial

Problem

  • The first rollout-then-train turn's grad_norm was too big (about 15M):
step 0: {'train/loss': -1.0281801223754883e-06, 'train/pg_loss': -1.0281801223754883e-06, 'train/entropy_loss': 5.39764404296875, 'train/pg_clipfrac': 0.0, 'train/ppo_kl': 0.0, 'train/train_rollout_logprob_abs_diff': 9.247343063354492, 'train/grad_norm': 15310134.6350668, 'train/lr-pg_0': 1e-06, 'train/lr-pg_1': 1e-06, 'train/global_batch_size': 4, 'train/step': 0}
  • The second turn got NaN grad:
actor train:   0%|          | 0/2 [00:00<?, ?microbatch/s]
(MegatronTrainRayActor pid=757548) 
actor train:  50%|█████     | 1/2 [00:02<00:02,  2.89s/microbatch]
(MegatronTrainRayActor pid=757548) 
actor train: 100%|██████████| 2/2 [00:05<00:00,  2.44s/microbatch]
(MegatronTrainRayActor pid=758060) [2026-06-13 22:11:59] reloadable_process_group.py:181 - Reloading 18 process groups in pid 758060 [repeated 7x across cluster]
(MegatronTrainRayActor pid=758060) [2026-06-13 22:11:59] memory_utils.py:47 - [Rank 7] Memory-Usage after wake_up model: {'gpu': '7', 'total_GB': 95.0, 'free_GB': 46.83, 'used_GB': 48.18, 'allocated_GB': 40.17, 'reserved_GB': 44.38, 'host_total_GB': 2265.25, 'host_available_GB': 1399.05, 'host_used_GB': 866.21, 'host_free_GB': 1698.2} [repeated 7x across cluster]
Traceback (most recent call last):
  File "/root/slime/train.py", line 103, in <module>
    train(args)
  File "/root/slime/train.py", line 81, in train
    ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2980, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1023, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::MegatronTrainRayActor.train() (pid=758055, ip=29.213.105.187, actor_id=6f7be1aa1d16dd88d256f55302000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7f980fbc44a0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 416, in train
    self.train_actor(rollout_id, rollout_data, external_data=external_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/actor.py", line 547, in train_actor
    train(
  File "/root/slime/slime/backends/megatron_utils/model.py", line 733, in train
    loss_dict, grad_norm = train_one_step(
                           ^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/megatron_utils/model.py", line 551, in train_one_step
    losses_reduced = forward_backward_func(
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 683, in forward_backward_no_pipelining
    config.finalize_model_grads_func(
  File "/root/Megatron-LM/megatron/core/distributed/finalize_model_grads.py", line 443, in finalize_model_grads
    model_chunk.finish_grad_sync(force_all_reduce=force_all_reduce)
  File "/root/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 539, in finish_grad_sync
    bucket_group.finish_grad_sync(force_all_reduce=force_all_reduce)
  File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 522, in finish_grad_sync
    self.start_grad_sync(force_all_reduce=force_all_reduce)
  File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 386, in start_grad_sync
    self.check_grads(
  File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 225, in check_grads
    rerun_state_machine.validate_result(
  File "/root/Megatron-LM/megatron/core/rerun_state_machine.py", line 532, in validate_result
    raise RuntimeError(full_message)
RuntimeError: Rank 2, node <xxx>, device 2, iteration -1: Unexpected result nan (message='found NaN in local grad norm for bucket #0 in backward pass before data-parallel communication collective')

Environment (if relevant)

Megatron Bridge keep the original env, NOT changed like examples/geo3k_vlm/run_geo3k_qwen35.sh

# Qwen3.5-35B-A3B VL RL training on geo3k dataset

pip install -U transformers

# IMPORTANT: This branch is specially modified for slime's current Megatron
# version and Qwen3.5 from the main Megatron Bridge. Other models are not verified!
# To restore the original Megatron Bridge, run:
#   pip install git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl --no-build-isolation
# TODO: Remove this once Megatron & Megatron Bridge are upgraded upstream.
pip install git+https://github.com/coding-famer/Megatron-Bridge-slime.git@qwen35 --no-build-isolation

Additional Context

No

Pre-submission Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions