Skip to content

Weekly CI failure: GPT3 weekly GB200/H100 metric mismatches #5455

Description

@balasaajay

Describe the bug

GPT3 weekly functional tests are failing on both GB200 and H100 in the ci-weekly scheduled run. Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

Affected tests:

  • gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
    • GB200 and H100 both fail with lm loss and num-zeros comparison mismatches.
  • gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap
    • GB200 fails with iteration-time and lm loss comparison mismatches.
    • H100 fails with lm loss comparison mismatch.

Failing run

Field Value
Scheduled pipeline 55286509
Commit d1410e15 - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)
GB200 child pipeline 55288455
H100 child pipeline 55287420

Failing jobs

Platform Job ID Test case Current error signature Started failing
GB200 344197344 gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap lm loss + num-zeros exact/approx mismatches 2026-04-17 after a 2026-04-10 pass
GB200 344197343 gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap iteration-time + lm loss mismatches 2026-04-17 after a 2026-04-10 pass
H100 344191891 gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap lm loss + num-zeros exact/approx mismatches 2026-03-27 after a 2026-03-20 pass
H100 344191890 gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap lm loss mismatch 2026-04-17 after a 2026-04-10 pass

Error

TP4/CP2 on GB200 and H100:

Exact comparison of lm loss: FAILED
Approximate comparison of lm loss: FAILED
Exact comparison of num-zeros: FAILED
Approximate comparison of num-zeros: FAILED
AssertionError: The following metrics failed: lm loss, lm loss, num-zeros, num-zeros
ERROR:__main__:Non-determinism, let's try another node.

TP2/PP2 on GB200:

Approximate comparison of iteration-time: FAILED
Approximate comparison of lm loss: FAILED
AssertionError: The following metrics failed: iteration-time, lm loss
ERROR:__main__:Non-determinism, let's try another node.

TP2/PP2 on H100:

Approximate comparison of lm loss: FAILED
AssertionError: The following metrics failed: lm loss
ERROR:__main__:Non-determinism, let's try another node.

The retries did not recover the failures, so these look like persistent metric/golden mismatches rather than a single-node transient.

Steps/Code to reproduce bug

Run the corresponding weekly JET recipe entries:

tests/test_utils/recipes/gb200/gpt.yaml
  gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
  gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap

tests/test_utils/recipes/h100/gpt.yaml
  gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
  gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap

The pytest entry point reported by the workload logs is:

pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

Additional context

  • Only the GPT3 failures from the scheduled run are included here.
  • GB200 also had non-GPT failures, and H100 also had a Mixtral failure, but those are intentionally excluded from this issue.
  • Duplicate searches were run for the GPT3 weekly TP4/TP2 metric mismatch signatures and did not find an open matching issue.
  • CI system links are intentionally omitted per request; pipeline and job IDs are included for internal lookup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions