Your Question
Qwen3.5 VLM megatron.bridge plugin is missing now.
And I try to develop this part without changing the Megatron Bridge,
but I can't handle it completely (~70% done).
So I'm asking for help here.
What I've Tried
Summary
I have added or modified four files in my commit:
- slime_plugins/megatron_bridge/qwen3_5.py(megatron bridge)
- slime_plugins/megatron_bridge/__init__.py
- slime_plugins/models/qwen3_5.py(grad sync hook)
- slime/backends/megatron_utils/megatron_to_hf/qwen3_5.py(VLM weight prefix + vision transform)
Progress
Problem
- The first rollout-then-train turn's grad_norm was too big (about 15M):
step 0: {'train/loss': -1.0281801223754883e-06, 'train/pg_loss': -1.0281801223754883e-06, 'train/entropy_loss': 5.39764404296875, 'train/pg_clipfrac': 0.0, 'train/ppo_kl': 0.0, 'train/train_rollout_logprob_abs_diff': 9.247343063354492, 'train/grad_norm': 15310134.6350668, 'train/lr-pg_0': 1e-06, 'train/lr-pg_1': 1e-06, 'train/global_batch_size': 4, 'train/step': 0}
- The second turn got NaN grad:
actor train: 0%| | 0/2 [00:00<?, ?microbatch/s]
(MegatronTrainRayActor pid=757548)
actor train: 50%|█████ | 1/2 [00:02<00:02, 2.89s/microbatch]
(MegatronTrainRayActor pid=757548)
actor train: 100%|██████████| 2/2 [00:05<00:00, 2.44s/microbatch]
(MegatronTrainRayActor pid=758060) [2026-06-13 22:11:59] reloadable_process_group.py:181 - Reloading 18 process groups in pid 758060 [repeated 7x across cluster]
(MegatronTrainRayActor pid=758060) [2026-06-13 22:11:59] memory_utils.py:47 - [Rank 7] Memory-Usage after wake_up model: {'gpu': '7', 'total_GB': 95.0, 'free_GB': 46.83, 'used_GB': 48.18, 'allocated_GB': 40.17, 'reserved_GB': 44.38, 'host_total_GB': 2265.25, 'host_available_GB': 1399.05, 'host_used_GB': 866.21, 'host_free_GB': 1698.2} [repeated 7x across cluster]
Traceback (most recent call last):
File "/root/slime/train.py", line 103, in <module>
train(args)
File "/root/slime/train.py", line 81, in train
ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2980, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1023, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::MegatronTrainRayActor.train() (pid=758055, ip=29.213.105.187, actor_id=6f7be1aa1d16dd88d256f55302000000, repr=<slime.backends.megatron_utils.actor.MegatronTrainRayActor object at 0x7f980fbc44a0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/actor.py", line 416, in train
self.train_actor(rollout_id, rollout_data, external_data=external_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/actor.py", line 547, in train_actor
train(
File "/root/slime/slime/backends/megatron_utils/model.py", line 733, in train
loss_dict, grad_norm = train_one_step(
^^^^^^^^^^^^^^^
File "/root/slime/slime/backends/megatron_utils/model.py", line 551, in train_one_step
losses_reduced = forward_backward_func(
^^^^^^^^^^^^^^^^^^^^^^
File "/root/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 683, in forward_backward_no_pipelining
config.finalize_model_grads_func(
File "/root/Megatron-LM/megatron/core/distributed/finalize_model_grads.py", line 443, in finalize_model_grads
model_chunk.finish_grad_sync(force_all_reduce=force_all_reduce)
File "/root/Megatron-LM/megatron/core/distributed/distributed_data_parallel.py", line 539, in finish_grad_sync
bucket_group.finish_grad_sync(force_all_reduce=force_all_reduce)
File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 522, in finish_grad_sync
self.start_grad_sync(force_all_reduce=force_all_reduce)
File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 386, in start_grad_sync
self.check_grads(
File "/root/Megatron-LM/megatron/core/distributed/param_and_grad_buffer.py", line 225, in check_grads
rerun_state_machine.validate_result(
File "/root/Megatron-LM/megatron/core/rerun_state_machine.py", line 532, in validate_result
raise RuntimeError(full_message)
RuntimeError: Rank 2, node <xxx>, device 2, iteration -1: Unexpected result nan (message='found NaN in local grad norm for bucket #0 in backward pass before data-parallel communication collective')
Environment (if relevant)
Megatron Bridge keep the original env, NOT changed like examples/geo3k_vlm/run_geo3k_qwen35.sh
# Qwen3.5-35B-A3B VL RL training on geo3k dataset
pip install -U transformers
# IMPORTANT: This branch is specially modified for slime's current Megatron
# version and Qwen3.5 from the main Megatron Bridge. Other models are not verified!
# To restore the original Megatron Bridge, run:
# pip install git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl --no-build-isolation
# TODO: Remove this once Megatron & Megatron Bridge are upgraded upstream.
pip install git+https://github.com/coding-famer/Megatron-Bridge-slime.git@qwen35 --no-build-isolation
Additional Context
No
Pre-submission Checklist
Your Question
Qwen3.5 VLM megatron.bridge plugin is missing now.
And I try to develop this part without changing the Megatron Bridge,
but I can't handle it completely (~70% done).
So I'm asking for help here.
What I've Tried
Summary
I have added or modified four files in my commit:
Progress
Problem
step 0: {'train/loss': -1.0281801223754883e-06, 'train/pg_loss': -1.0281801223754883e-06, 'train/entropy_loss': 5.39764404296875, 'train/pg_clipfrac': 0.0, 'train/ppo_kl': 0.0, 'train/train_rollout_logprob_abs_diff': 9.247343063354492, 'train/grad_norm': 15310134.6350668, 'train/lr-pg_0': 1e-06, 'train/lr-pg_1': 1e-06, 'train/global_batch_size': 4, 'train/step': 0}Environment (if relevant)
Megatron Bridge keep the original env, NOT changed like
examples/geo3k_vlm/run_geo3k_qwen35.shAdditional Context
No
Pre-submission Checklist