Weekly CI failure: GPT3 weekly GB200/H100 metric mismatches

**Describe the bug**

GPT3 weekly functional tests are failing on both GB200 and H100 in the `ci-weekly` scheduled run. Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

Affected tests:

- `gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap`
  - GB200 and H100 both fail with `lm loss` and `num-zeros` comparison mismatches.
- `gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap`
  - GB200 fails with `iteration-time` and `lm loss` comparison mismatches.
  - H100 fails with `lm loss` comparison mismatch.

**Failing run**

| Field | Value |
|-------|-------|
| Scheduled pipeline | `55286509` |
| Commit | `d1410e15` - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285) |
| GB200 child pipeline | `55288455` |
| H100 child pipeline | `55287420` |

**Failing jobs**

| Platform | Job ID | Test case | Current error signature | Started failing |
|----------|--------|-----------|-------------------------|----------------|
| GB200 | `344197344` | `gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap` | `lm loss` + `num-zeros` exact/approx mismatches | 2026-04-17 after a 2026-04-10 pass |
| GB200 | `344197343` | `gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap` | `iteration-time` + `lm loss` mismatches | 2026-04-17 after a 2026-04-10 pass |
| H100 | `344191891` | `gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap` | `lm loss` + `num-zeros` exact/approx mismatches | 2026-03-27 after a 2026-03-20 pass |
| H100 | `344191890` | `gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap` | `lm loss` mismatch | 2026-04-17 after a 2026-04-10 pass |

**Error**

TP4/CP2 on GB200 and H100:

```text
Exact comparison of lm loss: FAILED
Approximate comparison of lm loss: FAILED
Exact comparison of num-zeros: FAILED
Approximate comparison of num-zeros: FAILED
AssertionError: The following metrics failed: lm loss, lm loss, num-zeros, num-zeros
ERROR:__main__:Non-determinism, let's try another node.
```

TP2/PP2 on GB200:

```text
Approximate comparison of iteration-time: FAILED
Approximate comparison of lm loss: FAILED
AssertionError: The following metrics failed: iteration-time, lm loss
ERROR:__main__:Non-determinism, let's try another node.
```

TP2/PP2 on H100:

```text
Approximate comparison of lm loss: FAILED
AssertionError: The following metrics failed: lm loss
ERROR:__main__:Non-determinism, let's try another node.
```

The retries did not recover the failures, so these look like persistent metric/golden mismatches rather than a single-node transient.

**Steps/Code to reproduce bug**

Run the corresponding weekly JET recipe entries:

```text
tests/test_utils/recipes/gb200/gpt.yaml
  gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
  gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap

tests/test_utils/recipes/h100/gpt.yaml
  gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap
  gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap
```

The pytest entry point reported by the workload logs is:

```bash
pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
```

**Additional context**

- Only the GPT3 failures from the scheduled run are included here.
- GB200 also had non-GPT failures, and H100 also had a Mixtral failure, but those are intentionally excluded from this issue.
- Duplicate searches were run for the GPT3 weekly TP4/TP2 metric mismatch signatures and did not find an open matching issue.
- CI system links are intentionally omitted per request; pipeline and job IDs are included for internal lookup.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Weekly CI failure: GPT3 weekly GB200/H100 metric mismatches #5455

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Scheduled pipeline	`55286509`
Commit	`d1410e15` - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)
GB200 child pipeline	`55288455`
H100 child pipeline	`55287420`

Platform	Job ID	Test case	Current error signature	Started failing
GB200	`344197344`	`gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap`	`lm loss` + `num-zeros` exact/approx mismatches	2026-04-17 after a 2026-04-10 pass
GB200	`344197343`	`gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap`	`iteration-time` + `lm loss` mismatches	2026-04-17 after a 2026-04-10 pass
H100	`344191891`	`gpt3_weekly_mcore_tp4_cp2_current_scaling_native_fp8_tp_sp_cp_tp_overlap`	`lm loss` + `num-zeros` exact/approx mismatches	2026-03-27 after a 2026-03-20 pass
H100	`344191890`	`gpt3_weekly_mcore_tp2_pp2_current_scaling_native_fp8_tp_pp_sp_tp_overlap`	`lm loss` mismatch	2026-04-17 after a 2026-04-10 pass

Uh oh!

Weekly CI failure: GPT3 weekly GB200/H100 metric mismatches #5455

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions