Skip to content

Weekly CI failure: H100 Mixtral 8x7B iteration-time mismatch #5456

Description

@balasaajay

Describe the bug

The H100 weekly functional test mixtral_8x7b_tp1pp4ep8vpp8_release is intermittently failing in ci-weekly because the iteration-time metric is outside the approximate comparison tolerance.

Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

Failing run

Field Value
Scheduled pipeline 55286509
Commit d1410e15 - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)
H100 child pipeline 55287420
Latest failing job 344191894
Job name mixtral_8x7b_tp1pp4ep8vpp8_release
Test case tests/functional_tests/test_cases/mixtral/mixtral_8x7b_tp1pp4ep8vpp8_release

History

Date H100 child pipeline Job ID Status Note
2026-05-26 52639682 326451132 success Last known pass before the intermittent failures
2026-05-29 53136964 329741403 failed First recent failure
2026-06-05 53873633 334879220 failed Same iteration-time mismatch
2026-06-12 54641227 340436897 success Recovered temporarily
2026-06-19 55287420 344191894 failed Failed again with the same signature

Error

tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

Actual values: 4.2905750000000005
Golden values: 3.8215149999999998
Approximate comparison of iteration-time: FAILED
APPROXIMATE test for metric lm loss: PASSED
APPROXIMATE test for metric mem-allocated-bytes: PASSED
APPROXIMATE test for metric mem-max-allocated-bytes: PASSED

AssertionError: The following metrics failed: iteration-time

The latest failing job also contains another extracted artifact with actual 4.0858 versus golden 3.821515, so the symptom is variable H100 timing drift above the 5% iteration-time tolerance rather than a deterministic loss or memory mismatch.

Steps/Code to reproduce bug

Run the H100 release functional test for:

pytest tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline

using:

TRAINING_PARAMS_PATH=./tests/functional_tests/test_cases/mixtral/mixtral_8x7b_tp1pp4ep8vpp8_release/model_config.yaml
GOLDEN_VALUES_PATH=./tests/functional_tests/test_cases/mixtral/mixtral_8x7b_tp1pp4ep8vpp8_release/golden_values_dev_dgx_h100.json

Additional context

  • This is separate from Weekly CI failure: GPT3 weekly GB200/H100 metric mismatches #5455, which tracks GPT3 weekly metric mismatches.
  • Duplicate searches for mixtral_8x7b_tp1pp4ep8vpp8 iteration-time and Mixtral H100 iteration-time approximate mismatch did not find an open Mixtral-specific issue.
  • CI system links are intentionally omitted; pipeline and job IDs are included for internal lookup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions