Fix MLite microbatch loss and forward-only output contracts by ISEEKYAN · Pull Request #68 · ISEEKYAN/Megatron-LM

ISEEKYAN · 2026-06-28T03:16:16Z

Summary

Mirror VERL’s Megatron loss-reduction hook: pass logical_loss * num_microbatches to the schedule while retaining MLite’s standard schedule-side microbatch averaging.
Keep backward loss separate from per-microbatch reporting, preserving every original loss, model output, Metric accumulator, and plain metric through the reduction store.
Preserve loss-context propagation across PP/VPP and PP1 forward-only token log probabilities with optional entropy.

Why

VERL PPO/SFT losses are already contributions normalized against the logical global batch. Megatron does not change its runtime API for this case: its VERL postprocess hook compensates for the schedule’s fixed microbatch averaging and reports the unscaled reduction payload separately. MLite now follows the same contract, keeping connector-specific normalization out of the public runtime interface.

Scope

This PR contains only MLite runtime/connector code and focused tests. It does not include launch scripts, training configurations, or changes to the external VERL repository.

Validation

Focused pytest: 59 passed (test_loss_microbatch_contract, test_ops_data_trainstep_unit, test_runtime_backend_unit, test_bridge_backend, and test_mlite_engine_forward_only).
Slurm validation job 13202007: COMPLETED, exit code 0:0.
Local focused pytest: 17 passed.
Ruff checks, Python compile checks, and git diff --check passed.
Commit history, filenames, and full diff passed mechanical internal-identifier scans; the branch contains one commit on current main.

Mirror VERL’s Megatron loss-reduction hook so schedules retain standard microbatch averaging while logical-batch PPO gradients and per-micro reporting remain correct. Preserve loss context propagation, all-micro metric aggregation, and PP1 forward-only outputs.

ISEEKYAN mentioned this pull request Jun 28, 2026

Fix MLite DAPO loss and forward-only microbatch contracts #69

Closed

ISEEKYAN force-pushed the mlite-dapo-loss-micro-forward-only branch from e9ca87c to 8a5d864 Compare June 28, 2026 06:09

ISEEKYAN force-pushed the mlite-dapo-loss-micro-forward-only branch from 8a5d864 to 5d2f0c9 Compare June 28, 2026 06:15

ISEEKYAN added 3 commits June 28, 2026 05:18

Fix FSDP2 master-to-model sync on state load

9b0ff5c

Use memory-efficient dist optimizer checkpoint gather

cb46b73

Select memory-efficient checkpoint gather defensively

6241aeb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MLite microbatch loss and forward-only output contracts#68

Fix MLite microbatch loss and forward-only output contracts#68
ISEEKYAN wants to merge 4 commits into
mainfrom
mlite-dapo-loss-micro-forward-only

ISEEKYAN commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ISEEKYAN commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Scope

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ISEEKYAN commented Jun 28, 2026 •

edited

Loading