Describe the bug
GPT3 weekly functional tests are failing on both GB200 and H100 in the ci-weekly scheduled run. Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Affected tests:
gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
- GB200 and H100 both fail with
lm loss and num-zeros comparison mismatches.
gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap
- GB200 fails with
iteration-time and lm loss comparison mismatches.
- H100 fails with
lm loss comparison mismatch.
Failing run
| Field |
Value |
| Scheduled pipeline |
55286509 |
| Commit |
d1410e15 - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285) |
| GB200 child pipeline |
55288455 |
| H100 child pipeline |
55287420 |
Failing jobs
| Platform |
Job ID |
Test case |
Current error signature |
Started failing |
| GB200 |
344197344 |
gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap |
lm loss + num-zeros exact/approx mismatches |
2026-04-17 after a 2026-04-10 pass |
| GB200 |
344197343 |
gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap |
iteration-time + lm loss mismatches |
2026-04-17 after a 2026-04-10 pass |
| H100 |
344191891 |
gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap |
lm loss + num-zeros exact/approx mismatches |
2026-03-27 after a 2026-03-20 pass |
| H100 |
344191890 |
gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap |
lm loss mismatch |
2026-04-17 after a 2026-04-10 pass |
Error
TP4/CP2 on GB200 and H100:
Exact comparison of lm loss: FAILED
Approximate comparison of lm loss: FAILED
Exact comparison of num-zeros: FAILED
Approximate comparison of num-zeros: FAILED
AssertionError: The following metrics failed: lm loss, lm loss, num-zeros, num-zeros
ERROR:__main__:Non-determinism, let's try another node.
TP2/PP2 on GB200:
Approximate comparison of iteration-time: FAILED
Approximate comparison of lm loss: FAILED
AssertionError: The following metrics failed: iteration-time, lm loss
ERROR:__main__:Non-determinism, let's try another node.
TP2/PP2 on H100:
Approximate comparison of lm loss: FAILED
AssertionError: The following metrics failed: lm loss
ERROR:__main__:Non-determinism, let's try another node.
The retries did not recover the failures, so these look like persistent metric/golden mismatches rather than a single-node transient.
Steps/Code to reproduce bug
Run the corresponding weekly JET recipe entries:
tests/test_utils/recipes/gb200/gpt.yaml
gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap
tests/test_utils/recipes/h100/gpt.yaml
gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap
The pytest entry point reported by the workload logs is:
pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
Additional context
- Only the GPT3 failures from the scheduled run are included here.
- GB200 also had non-GPT failures, and H100 also had a Mixtral failure, but those are intentionally excluded from this issue.
- Duplicate searches were run for the GPT3 weekly TP4/TP2 metric mismatch signatures and did not find an open matching issue.
- CI system links are intentionally omitted per request; pipeline and job IDs are included for internal lookup.
Describe the bug
GPT3 weekly functional tests are failing on both GB200 and H100 in the
ci-weeklyscheduled run. Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.Affected tests:
gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlaplm lossandnum-zeroscomparison mismatches.gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlapiteration-timeandlm losscomparison mismatches.lm losscomparison mismatch.Failing run
55286509d1410e15- Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)5528845555287420Failing jobs
344197344gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlaplm loss+num-zerosexact/approx mismatches344197343gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlapiteration-time+lm lossmismatches344191891gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlaplm loss+num-zerosexact/approx mismatches344191890gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlaplm lossmismatchError
TP4/CP2 on GB200 and H100:
TP2/PP2 on GB200:
TP2/PP2 on H100:
The retries did not recover the failures, so these look like persistent metric/golden mismatches rather than a single-node transient.
Steps/Code to reproduce bug
Run the corresponding weekly JET recipe entries:
The pytest entry point reported by the workload logs is:
Additional context