Skip to content

[Bug] Multi-head MTP (--mtp-num-layers > 1) crashes at training-step logging #2131

Description

@ZiyiTsang

Bug Description

slime's per-step MTP-loss logging hard-codes a single-MTP-layer assumption. When the model has more than one MTP head and MTP training is enabled, training crashes with logging below.

Trigger conditions (both required):

  1. --mtp-num-layers > 1 — a multi-head MTP model (e.g. MiMo-7B with 3 MTP heads).
  2. --enable-mtp-training — this turns on the MTP-loss logging branch unconditionally (the crash is in the logging code, not the forward/backward pass).

With --mtp-num-layers 1 (single head) the bug does not reproduce.

Steps to Reproduce

Trigger: --mtp-num-layers > 1 together with --enable-mtp-training.

Minimal command (only the relevant args are shown; the rest is standard GRPO config):

python train.py \
  --mtp-num-layers 3 \ #<-- >1 triggers the crash` 
  --enable-mtp-training \ # <-- enables the crashing log path
  --mtp-loss-scaling-factor 0.35
  ... (standard rollout/optimizer/perf args)

Reproduced with: MiMo-7B-Base converted to Megatron torch_dist with 3 MTP heads (MTP3), GRPO + MTP training, TP=4 / PP=1 / CP=1, single node.

Switching to --mtp-num-layers 1 makes the crash disappear.

Expected Behavior

Run and log normally

Actual Behavior

Crash as logging below

Environment

  • slime version: 0.2.4 (pip) — source commit 4bd75ad1
  • Python version: 3.12.0
  • PyTorch version: 2.9.1+cu128
  • CUDA/ROCm version: CUDA 12.8 (driver 565.57.01)
  • GPU type and count: NVIDIA RTX A6000 (single node, TP=4)
  • OS: Linux (kernel 5.15.0-60-generic)
  • SGLang version: 0.5.10.post1
  • Megatron-LM version: local clone on PYTHONPATH (core_v0.15.0rc7-548-g3714d81d4)

Logs

## Logs


Traceback (most recent call last):
  File "train.py", line 103, in <module>
    train(args)
  File "train.py", line 81, in train
    ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
  File ".../site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File ".../site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File ".../site-packages/ray/_private/worker.py", line 2981, in get
    values, debugger_breakpoint = worker.get_objects(
  File ".../site-packages/ray/_private/worker.py", line 1012, in get_objects
    raise value.as_instance_of_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::MegatronTrainRayActor.train() (pid=..., ip=..., actor_id=..., repr=<MegatronTrainRayActor object at 0x...>)
  File "slime/backends/megatron_utils/actor.py", line 416, in train
    self.train_actor(rollout_id, rollout_data, external_data=external_data)
  File "slime/backends/megatron_utils/actor.py", line 547, in train_actor
    train(
  File "slime/backends/megatron_utils/model.py", line 783, in train
    mtp_losses = (tracker["values"] * mtp_loss_scale).item()
RuntimeError: a Tensor with 3 elements cannot be converted to Scalar

Additional Context

I plan to propose a PR to fix it,and I am still testing it.

Pre-submission Checklist

  • I have read the CONTRIBUTING.md and understand the collaboration scope.
  • I have read the documentation and my issue is not addressed there.
  • I have searched for existing issues and this is not a duplicate.
  • I have provided a minimal, reproducible example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions