Update pretraining dataloader and torchtitan runner by T4ras123 · Pull Request #33 · YerevaNN/3DMolGen

T4ras123 · 2026-05-20T10:21:15Z

This pull request introduces several performance and robustness improvements to the data loading and checkpoint handling logic for pretraining in molgen3D. The most significant changes are the refactoring of the data loader to use NumPy arrays for pair management (improving efficiency and scalability), a robust mechanism to patch HuggingFace checkpoints safely in distributed environments, and the initialization of a multi-backend PyTorch distributed group to avoid InfiniBand resource exhaustion. Additionally, some unused arguments were removed from tokenizer initialization.

DataLoader performance and memory improvements:

Changed storage of training file/line pairs from a Python list to a NumPy array in dataloader.py, and updated all related logic to use efficient NumPy operations for shuffling and worker assignment. This should significantly improve memory usage and speed when handling large datasets. [1] [2] [3] [4] [5]

Distributed checkpoint patching and safety:

Updated _ensure_hf_checkpoint_has_lm_head to ensure only global rank 0 performs checkpoint patching, using a .ready sentinel file to signal completion. Other ranks wait for the sentinel, preventing race conditions and file corruption. This is especially important for multi-node training. [1] [2]

Distributed training initialization robustness:

Added pre-initialization of torch.distributed with both CUDA and CPU backends to ensure CPU-based collectives use GLOO, avoiding InfiniBand completion queue exhaustion and related NCCL errors during planning phases.

Tokenizer loading simplification:

Removed the unused fix_mistral_regex argument from AutoTokenizer.from_pretrained calls in both the data loader and runner, simplifying tokenizer initialization. [1] [2]

Configuration and path improvements:

Changed the checkpoint root path key from "qwen_yerevann_root" to "ckpts_root" for clarity and consistency in torchtitan_runner.py.

Update pretraining dataloader and torchtitan runner

e666137

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update pretraining dataloader and torchtitan runner#33

Update pretraining dataloader and torchtitan runner#33
T4ras123 wants to merge 1 commit into
mainfrom
pr/training-runner

T4ras123 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

T4ras123 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant