Skip to content

Guard against non-positive num_tokens_per_epoch in GPTDataset#5519

Draft
Yifei-Zuo wants to merge 1 commit into
NVIDIA:mainfrom
Yifei-Zuo:fix/gpt-dataset-num-tokens-per-epoch-guard
Draft

Guard against non-positive num_tokens_per_epoch in GPTDataset#5519
Yifei-Zuo wants to merge 1 commit into
NVIDIA:mainfrom
Yifei-Zuo:fix/gpt-dataset-num-tokens-per-epoch-guard

Conversation

@Yifei-Zuo

Copy link
Copy Markdown

What

Fix a possible infinite loop in GPTDataset._get_num_epochs when a data split
is allocated zero tokens.

Why

_get_num_epochs accumulated num_tokens one epoch at a time in a while
loop until it reached the requested token count:

num_epochs = 1
num_tokens = num_tokens_per_epoch
...
while num_tokens < num_tokens_requested:
    num_epochs += 1
    num_tokens += num_tokens_per_epoch

When num_tokens_per_epoch == 0 — which happens when the train/valid/test
split assigns too small a fraction to a split for the dataset size (e.g.
--split 9998,0.002,0.002 on a small corpus) — num_tokens never grows and
the loop spins forever with no diagnostic. This was reported in #957 while
training Mixtral-MoE 8x7B.

What changed

  • Raise a RuntimeError when num_tokens_per_epoch <= 0. The message names the
    affected split and explains how to fix the configuration, instead of hanging
    silently.
  • Replace the accumulation loop with an equivalent integer ceiling division
    (max(1, -(-num_tokens_requested // num_tokens_per_epoch))). Integer math is
    used rather than math.ceil(a / b) to avoid floating-point drift at exact
    boundaries for large token counts. Verified to return identical values to the
    old loop across 200k randomized cases.

Fixes #957

🤖 Generated with Claude Code

GPTDataset._get_num_epochs grew num_epochs in a loop until the accumulated
token count reached the requested amount. When num_tokens_per_epoch is zero --
which happens when a data split is allocated too small a fraction for the
dataset size -- num_tokens never grows and the loop spins forever with no
diagnostic.

Raise a RuntimeError that names the affected split and explains how to adjust
the configuration, and replace the accumulation loop with an equivalent integer
ceiling division.

Fixes NVIDIA#957

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yifei-Zuo <yifeizuo2029@u.northwestern.edu>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Infinite Loop in _get_num_epochs Function of GPTDataset Class When num_tokens_per_epoch is Zero

2 participants