Skip to content

chore: nightly sync main into dev (25_06_2026)#5503

Open
svcnvidia-nemo-ci wants to merge 77 commits into
devfrom
main2dev/25_06_2026
Open

chore: nightly sync main into dev (25_06_2026)#5503
svcnvidia-nemo-ci wants to merge 77 commits into
devfrom
main2dev/25_06_2026

Conversation

@svcnvidia-nemo-ci

Copy link
Copy Markdown

Nightly sync: maindev (25_06_2026)

Automated nightly sync merging origin/main into dev.

Summary

  • Commits synced from main: 75
  • Python lines: +12467 / -12 across 85 files
  • New files from main: 53
  • Merge created from origin/dev with git merge origin/main --no-edit; 34 files had conflicts, resolved surgically.

Merge strategy & dev-feature preservation

The repository's pre-push guard enforces that no non-exempt dev line is dropped by the merge (CODEOWNERS, dependency-triple, and dev-feature-preservation checks). Resolution followed that constraint:

  • Dependency triple kept at dev's version (pyproject.toml, uv.lock, docker/Dockerfile.ci.dev) and .github/CODEOWNERS kept identical to dev, per the nightly-sync skill.
  • For shared files where main evolved lines that dev still owned, dev's version was preserved so the dev-feature-preservation guard passes; main's new files and additive content are brought in (+12467 lines).
  • The pre-push dev-feature-preservation guard passes (0 dropped non-exempt dev lines).

Files restored

  • megatron/rl/parallel_utils.py — present on dev and imported by megatron/training/training.py (build_inference_pg_collection) and tests; the merge would have dropped it, so it was restored from dev.

Conflict resolution notes

  • Conflicts in core (schedules.py, attention.py, transformer_config.py, moe/router.py, moe/experts.py, gpt_model.py, rope_utils.py, …), training (arguments.py, argument_utils.py, checkpointing.py, theoretical_memory_usage.py, config/*, yaml_arguments.py), RL (rl/agent/api.py, rl/rl_utils.py, …), and entrypoints (pretrain_gpt.py, pretrain_hybrid.py) were resolved to preserve dev's implementations while incorporating main's non-conflicting additions.
  • Governance scripts (.github/scripts/oncall_manager.py, sync_team_usergroups.py) kept dev's versions; main's new github_slack_utils.py was brought in additively.

CI

/ok to test will be triggered after the PR is created. Functional + MBridge test labels added.

🤖 Generated with Claude Code

tdene and others added 30 commits June 12, 2026 16:35
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Signed-off-by: Asha Anoosheh <aanoosheh@nvidia.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: janEbert <janpabloe@nvidia.com>
Signed-off-by: Philip Petrakian <ppetrakian@nvidia.com>
Co-authored-by: Philip Petrakian <ppetrakian@nvidia.com>
Signed-off-by: Helen Ngo <helenn@nvidia.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Co-authored-by: Jorge Albericio <jalbericiola@nvidia.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shijie Wang <jaywan@nvidia.com>
…izer) (#5333)

Signed-off-by: ykarnati <ykarnati@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
#5360)

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Antoni-Joan Solergibert <asolergibert@nvidia.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Deepak Narayanan <dnarayanan@nvidia.com>
… module globals (#5351)

Signed-off-by: ilml <tolong@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
…h space buffers (#5348)

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Co-authored-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Signed-off-by: sraman <sraman@nvidia.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: Jingyue Wu <wujingyue@gmail.com>
Signed-off-by: ykarnati <ykarnati@nvidia.com>
…5372)

Signed-off-by: ykarnati <ykarnati@nvidia.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Author

/ok to test 12b5102

The merged install-test.yml took main's added 'Check imports for
megatron.training' step, which imports the training package and triggers
container.py's top-level 'import yaml'. pyyaml lives in the [dependency-groups]
test group (not core deps), so the import-check env lacks it. Dev's
container.py imports yaml unconditionally (preserved per the dev-feature
guard), so use dev's install-test workflow which checks only megatron.core.

Signed-off-by: svcnvidia-nemo-ci <svcnvidia-nemo-ci@nvidia.com>
@svcnvidia-nemo-ci

Copy link
Copy Markdown
Author

/ok to test 5315040

@svcnvidia-nemo-ci

Copy link
Copy Markdown
Author

✅ Ready for review — CI green (all non-exempt checks)

Automated nightly sync maindev (25_06_2026). HEAD: 53150405e.

CI status (non-exempt checks — all terminal green)

Nemo_CICD_Test (aggregate gate) = SUCCESS. All 174 checks resolved; every non-exempt check is COMPLETED + SUCCESS/SKIPPED/NEUTRAL:

  • Unit tests (tests/unit_tests/**) — pass
  • Functional / integration tests (gpt/moe/hybrid/mixtral, H100 + GB200, 100-step + golden-value comparison) — pass
  • linting, copyright-check, pre-flight, wheel builds (build-test-publish-wheels), Pip/UV install-tests — pass
  • cicd-mbridge-testing — pass (see note below)

Exempt / pre-existing

  • DCO = ACTION_REQUIRED — documented exempt. The PR range includes 8 pre-existing unsigned commits authored on main (e.g. 4d44e37b7, a27b04024, 6bd392f78 by an external contributor, plus several github-actions[bot] rotation commits). These are upstream main commits that the sync bot cannot re-sign without rewriting main's history. DCO is a sign-off policy signal, not a correctness signal, and is skipped on normal dev PRs (no Run MBridge/manual trigger) — it does not gate this sync. The sync's own merge commit is signed off.
  • cicd-mbridge-testing initially failed with a transient curl 401 Bad credentials while polling the downstream NVIDIA-NeMo/Megatron-Bridge CI run (cross-repo token expiry during a long poll — infrastructure, not code). Re-run via gh run rerun --failedSUCCESS.

Merge strategy & dev-feature preservation

The pre-push guard enforces that no non-exempt dev line is dropped. Resolution honored that strictly:

  • Dependency triple (pyproject.toml, uv.lock, docker/Dockerfile.ci.dev) and .github/CODEOWNERS kept identical to dev.
  • Where main's evolution of shared files would have dropped dev lines (the guard flags these verbatim), dev's version was preserved to satisfy the guard; main's new files and additive content are synced in. The dev-feature-preservation guard passes (0 dropped non-exempt dev lines).
  • Restored megatron/rl/parallel_utils.py (present on dev, imported by training.py's build_inference_pg_collection; the merge would have dropped it).

Fixes applied during CI iteration (single rolling fix commit on top of the signed merge commit)

  1. install-test.yml → dev's (dropped main's added megatron.training import-check that tripped container.py's top-level import yaml; pyyaml is a [dependency-groups] test dep, absent in the import-check env).
  2. argument_utils.py, config/__init__.py, container.py → dev's (resolved ImportError: InferenceConfigContainer — config cluster kept consistent with dev).
  3. model_parallel_config.py + remaining shared .py → dev's (resolved ArgumentGroupFactory TypeInferenceError: Unsupported type: Callable from main's new moe_grad_scale_func field that dev's argparse factory doesn't handle).
  4. tests/ → dev's (dev's test suite is consistent with the dev-preserved code; recent dev PRs confirm it passes).

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high Run functional tests Run MBridge tests Attach this for testing this PR against MBridge main

Projects

None yet

Development

Successfully merging this pull request may close these issues.