Describe the bug
The GB200 weekly release test nemotron3_super_release_gb200_sm is failing on the latest run because the mem-max-allocated-bytes metric is outside the 5% approximate comparison tolerance.
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Failing run
| Field |
Value |
| Parent scheduled pipeline |
55286509 |
| GB200 child pipeline |
55288455 |
| Commit |
d1410e15 - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285) |
| Latest failing job |
344197346 |
| Job name |
nemotron3_super_release_gb200_sm |
| Test case |
tests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_sm |
History
| Date |
Pipeline / job |
Status |
Note |
| 2026-05-15 |
first scheduled run |
failed |
Earliest scheduled run observed for this GB200-sm case; run logs use weekly-2026-05-16 |
| 2026-06-20 |
55288455 / 344197346 |
failed |
Latest failure, current signature is mem-max-allocated-bytes only |
Earlier sampled runs for this test had related but different signatures, including a golden/actual shape mismatch and lm loss failures. This issue is scoped to the latest mem-max-allocated-bytes mismatch.
Error
tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
APPROXIMATE test for metric iteration-time: PASSED
APPROXIMATE test for metric lm loss: PASSED
APPROXIMATE test for metric mem-allocated-bytes: PASSED
Approximate comparison of mem-max-allocated-bytes: FAILED
FAILED tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline - AssertionError: The following metrics failed: mem-max-allocated-bytes
Deviation summary for the latest failing job:
| Metric |
Tolerance |
Failed points |
Max deviation |
Median failing deviation |
Example actual vs golden |
mem-max-allocated-bytes |
5% |
11 / 954, allowed 9 |
11.73% |
11.65% |
110.34B vs 125.01B, lower by 13.66 GiB |
All other checked metrics passed in the latest failure.
Steps/Code to reproduce bug
Run the GB200 release functional test for:
pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
using:
TRAINING_PARAMS_PATH=./tests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_sm/model_config.yaml
GOLDEN_VALUES_PATH=./tests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_sm/golden_values_dev_dgx_gb200.json
Additional context
- Duplicate searches for
nemotron3_super_release_gb200_sm and Nemotron GB200 memory mismatch mem-max-allocated did not find an open matching issue.
- CI system links are intentionally omitted; pipeline and job IDs are included for internal lookup.
Describe the bug
The GB200 weekly release test
nemotron3_super_release_gb200_smis failing on the latest run because themem-max-allocated-bytesmetric is outside the 5% approximate comparison tolerance.Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Failing run
5528650955288455d1410e15- Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)344197346nemotron3_super_release_gb200_smtests/functional_tests/test_cases/nemotron/nemotron3_super_release_gb200_smHistory
weekly-2026-05-1655288455/344197346mem-max-allocated-bytesonlyEarlier sampled runs for this test had related but different signatures, including a golden/actual shape mismatch and
lm lossfailures. This issue is scoped to the latestmem-max-allocated-bytesmismatch.Error
Deviation summary for the latest failing job:
mem-max-allocated-bytesAll other checked metrics passed in the latest failure.
Steps/Code to reproduce bug
Run the GB200 release functional test for:
using:
Additional context
nemotron3_super_release_gb200_smandNemotron GB200 memory mismatch mem-max-allocateddid not find an open matching issue.