Bug Description
slime's per-step MTP-loss logging hard-codes a single-MTP-layer assumption. When the model has more than one MTP head and MTP training is enabled, training crashes with logging below.
Trigger conditions (both required):
--mtp-num-layers > 1 — a multi-head MTP model (e.g. MiMo-7B with 3 MTP heads).
--enable-mtp-training — this turns on the MTP-loss logging branch unconditionally (the crash is in the logging code, not the forward/backward pass).
With --mtp-num-layers 1 (single head) the bug does not reproduce.
Steps to Reproduce
Trigger: --mtp-num-layers > 1 together with --enable-mtp-training.
Minimal command (only the relevant args are shown; the rest is standard GRPO config):
python train.py \
--mtp-num-layers 3 \ #<-- >1 triggers the crash`
--enable-mtp-training \ # <-- enables the crashing log path
--mtp-loss-scaling-factor 0.35
... (standard rollout/optimizer/perf args)
Reproduced with: MiMo-7B-Base converted to Megatron torch_dist with 3 MTP heads (MTP3), GRPO + MTP training, TP=4 / PP=1 / CP=1, single node.
Switching to --mtp-num-layers 1 makes the crash disappear.
Expected Behavior
Run and log normally
Actual Behavior
Crash as logging below
Environment
- slime version: 0.2.4 (pip) — source commit
4bd75ad1
- Python version: 3.12.0
- PyTorch version: 2.9.1+cu128
- CUDA/ROCm version: CUDA 12.8 (driver 565.57.01)
- GPU type and count: NVIDIA RTX A6000 (single node, TP=4)
- OS: Linux (kernel 5.15.0-60-generic)
- SGLang version: 0.5.10.post1
- Megatron-LM version: local clone on PYTHONPATH (
core_v0.15.0rc7-548-g3714d81d4)
Logs
## Logs
Traceback (most recent call last):
File "train.py", line 103, in <module>
train(args)
File "train.py", line 81, in train
ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
File ".../site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File ".../site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File ".../site-packages/ray/_private/worker.py", line 2981, in get
values, debugger_breakpoint = worker.get_objects(
File ".../site-packages/ray/_private/worker.py", line 1012, in get_objects
raise value.as_instance_of_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::MegatronTrainRayActor.train() (pid=..., ip=..., actor_id=..., repr=<MegatronTrainRayActor object at 0x...>)
File "slime/backends/megatron_utils/actor.py", line 416, in train
self.train_actor(rollout_id, rollout_data, external_data=external_data)
File "slime/backends/megatron_utils/actor.py", line 547, in train_actor
train(
File "slime/backends/megatron_utils/model.py", line 783, in train
mtp_losses = (tracker["values"] * mtp_loss_scale).item()
RuntimeError: a Tensor with 3 elements cannot be converted to Scalar
Additional Context
I plan to propose a PR to fix it,and I am still testing it.
Pre-submission Checklist
Bug Description
slime's per-step MTP-loss logging hard-codes a single-MTP-layer assumption. When the model has more than one MTP head and MTP training is enabled, training crashes with logging below.
Trigger conditions (both required):
--mtp-num-layers > 1— a multi-head MTP model (e.g. MiMo-7B with 3 MTP heads).--enable-mtp-training— this turns on the MTP-loss logging branch unconditionally (the crash is in the logging code, not the forward/backward pass).With
--mtp-num-layers 1(single head) the bug does not reproduce.Steps to Reproduce
Trigger:
--mtp-num-layers > 1together with--enable-mtp-training.Minimal command (only the relevant args are shown; the rest is standard GRPO config):
Reproduced with: MiMo-7B-Base converted to Megatron
torch_distwith 3 MTP heads (MTP3), GRPO + MTP training, TP=4 / PP=1 / CP=1, single node.Switching to
--mtp-num-layers 1makes the crash disappear.Expected Behavior
Run and log normally
Actual Behavior
Crash as logging below
Environment
4bd75ad1core_v0.15.0rc7-548-g3714d81d4)Logs
Additional Context
I plan to propose a PR to fix it,and I am still testing it.
Pre-submission Checklist