Skip to content

Weekly CI failure: GB200 Nemotron3 Super mem-max-allocated mismatch #5457

Description

@balasaajay

Describe the bug

The GB200 weekly release test nemotron3_super_release_gb200_sm is failing on the latest run because the mem-max-allocated-bytes metric is outside the 5% approximate comparison tolerance.

Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

Failing run

Field Value
Parent scheduled pipeline 55286509
GB200 child pipeline 55288455
Commit d1410e15 - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)
Latest failing job 344197346
Job name nemotron3_super_release_gb200_sm
Test case tests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_sm

History

Date Pipeline / job Status Note
2026-05-15 first scheduled run failed Earliest scheduled run observed for this GB200-sm case; run logs use weekly-2026-05-16
2026-06-20 55288455 / 344197346 failed Latest failure, current signature is mem-max-allocated-bytes only

Earlier sampled runs for this test had related but different signatures, including a golden/actual shape mismatch and lm loss failures. This issue is scoped to the latest mem-max-allocated-bytes mismatch.

Error

tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

APPROXIMATE test for metric iteration-time: PASSED
APPROXIMATE test for metric lm loss: PASSED
APPROXIMATE test for metric mem-allocated-bytes: PASSED
Approximate comparison of mem-max-allocated-bytes: FAILED

FAILED tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline - AssertionError: The following metrics failed: mem-max-allocated-bytes

Deviation summary for the latest failing job:

Metric Tolerance Failed points Max deviation Median failing deviation Example actual vs golden
mem-max-allocated-bytes 5% 11 / 954, allowed 9 11.73% 11.65% 110.34B vs 125.01B, lower by 13.66 GiB

All other checked metrics passed in the latest failure.

Steps/Code to reproduce bug

Run the GB200 release functional test for:

pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

using:

TRAINING_PARAMS_PATH=./tests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_sm/model_config.yaml
GOLDEN_VALUES_PATH=./tests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_sm/golden_values_dev_dgx_gb200.json

Additional context

  • Duplicate searches for nemotron3_super_release_gb200_sm and Nemotron GB200 memory mismatch mem-max-allocated did not find an open matching issue.
  • CI system links are intentionally omitted; pipeline and job IDs are included for internal lookup.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions