You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The H100 weekly functional test mixtral_8x7b_tp1pp4ep8vpp8_release is intermittently failing in ci-weekly because the iteration-time metric is outside the approximate comparison tolerance.
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Failing run
Field
Value
Scheduled pipeline
55286509
Commit
d1410e15 - Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)
tests/functional_tests/python_test_utils/test_pretraining_regular_pipeline.py::test_regular_pipeline
Actual values: 4.2905750000000005
Golden values: 3.8215149999999998
Approximate comparison of iteration-time: FAILED
APPROXIMATE test for metric lm loss: PASSED
APPROXIMATE test for metric mem-allocated-bytes: PASSED
APPROXIMATE test for metric mem-max-allocated-bytes: PASSED
AssertionError: The following metrics failed: iteration-time
The latest failing job also contains another extracted artifact with actual 4.0858 versus golden 3.821515, so the symptom is variable H100 timing drift above the 5% iteration-time tolerance rather than a deterministic loss or memory mismatch.
Duplicate searches for mixtral_8x7b_tp1pp4ep8vpp8 iteration-time and Mixtral H100 iteration-time approximate mismatch did not find an open Mixtral-specific issue.
CI system links are intentionally omitted; pipeline and job IDs are included for internal lookup.
Describe the bug
The H100 weekly functional test
mixtral_8x7b_tp1pp4ep8vpp8_releaseis intermittently failing inci-weeklybecause theiteration-timemetric is outside the approximate comparison tolerance.Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
Failing run
55286509d1410e15- Add MIMO runtime setup: per-role RNG seeding and DDP wrapping (#5285)55287420344191894mixtral_8x7b_tp1pp4ep8vpp8_releasetests/functional_tests/test_cases/mixtral/mixtral_8x7b_tp1pp4ep8vpp8_releaseHistory
526396823264511325313696432974140353873633334879220iteration-timemismatch5464122734043689755287420344191894Error
The latest failing job also contains another extracted artifact with
actual 4.0858versusgolden 3.821515, so the symptom is variable H100 timing drift above the 5%iteration-timetolerance rather than a deterministic loss or memory mismatch.Steps/Code to reproduce bug
Run the H100 release functional test for:
using:
Additional context
mixtral_8x7b_tp1pp4ep8vpp8 iteration-timeandMixtral H100 iteration-time approximate mismatchdid not find an open Mixtral-specific issue.