Guard against non-positive num_tokens_per_epoch in GPTDataset by Yifei-Zuo · Pull Request #5519 · NVIDIA/Megatron-LM

Yifei-Zuo · 2026-06-26T23:36:21Z

What

Fix a possible infinite loop in GPTDataset._get_num_epochs when a data split
is allocated zero tokens.

Why

_get_num_epochs accumulated num_tokens one epoch at a time in a while
loop until it reached the requested token count:

num_epochs = 1
num_tokens = num_tokens_per_epoch
...
while num_tokens < num_tokens_requested:
    num_epochs += 1
    num_tokens += num_tokens_per_epoch

When num_tokens_per_epoch == 0 — which happens when the train/valid/test
split assigns too small a fraction to a split for the dataset size (e.g.
--split 9998,0.002,0.002 on a small corpus) — num_tokens never grows and
the loop spins forever with no diagnostic. This was reported in #957 while
training Mixtral-MoE 8x7B.

What changed

Raise a RuntimeError when num_tokens_per_epoch <= 0. The message names the
affected split and explains how to fix the configuration, instead of hanging
silently.
Replace the accumulation loop with an equivalent integer ceiling division
(max(1, -(-num_tokens_requested // num_tokens_per_epoch))). Integer math is
used rather than math.ceil(a / b) to avoid floating-point drift at exact
boundaries for large token counts. Verified to return identical values to the
old loop across 200k randomized cases.

Fixes #957

🤖 Generated with Claude Code

GPTDataset._get_num_epochs grew num_epochs in a loop until the accumulated token count reached the requested amount. When num_tokens_per_epoch is zero -- which happens when a data split is allocated too small a fraction for the dataset size -- num_tokens never grows and the loop spins forever with no diagnostic. Raise a RuntimeError that names the affected split and explains how to adjust the configuration, and replace the accumulation loop with an equivalent integer ceiling division. Fixes NVIDIA#957 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yifei-Zuo <yifeizuo2029@u.northwestern.edu>

copy-pr-bot · 2026-06-26T23:36:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions Bot added the community-request label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guard against non-positive num_tokens_per_epoch in GPTDataset#5519

Guard against non-positive num_tokens_per_epoch in GPTDataset#5519
Yifei-Zuo wants to merge 1 commit into
NVIDIA:mainfrom
Yifei-Zuo:fix/gpt-dataset-num-tokens-per-epoch-guard

Yifei-Zuo commented Jun 26, 2026

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Yifei-Zuo commented Jun 26, 2026

What

Why

What changed

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants