Guard against non-positive num_tokens_per_epoch in GPTDataset#5519
Draft
Yifei-Zuo wants to merge 1 commit into
Draft
Guard against non-positive num_tokens_per_epoch in GPTDataset#5519Yifei-Zuo wants to merge 1 commit into
Yifei-Zuo wants to merge 1 commit into
Conversation
GPTDataset._get_num_epochs grew num_epochs in a loop until the accumulated token count reached the requested amount. When num_tokens_per_epoch is zero -- which happens when a data split is allocated too small a fraction for the dataset size -- num_tokens never grows and the loop spins forever with no diagnostic. Raise a RuntimeError that names the affected split and explains how to adjust the configuration, and replace the accumulation loop with an equivalent integer ceiling division. Fixes NVIDIA#957 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yifei-Zuo <yifeizuo2029@u.northwestern.edu>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fix a possible infinite loop in
GPTDataset._get_num_epochswhen a data splitis allocated zero tokens.
Why
_get_num_epochsaccumulatednum_tokensone epoch at a time in awhileloop until it reached the requested token count:
When
num_tokens_per_epoch == 0— which happens when the train/valid/testsplit assigns too small a fraction to a split for the dataset size (e.g.
--split 9998,0.002,0.002on a small corpus) —num_tokensnever grows andthe loop spins forever with no diagnostic. This was reported in #957 while
training Mixtral-MoE 8x7B.
What changed
RuntimeErrorwhennum_tokens_per_epoch <= 0. The message names theaffected split and explains how to fix the configuration, instead of hanging
silently.
(
max(1, -(-num_tokens_requested // num_tokens_per_epoch))). Integer math isused rather than
math.ceil(a / b)to avoid floating-point drift at exactboundaries for large token counts. Verified to return identical values to the
old loop across 200k randomized cases.
Fixes #957
🤖 Generated with Claude Code