Skip to content

Update pretraining dataloader and torchtitan runner#33

Open
T4ras123 wants to merge 1 commit into
mainfrom
pr/training-runner
Open

Update pretraining dataloader and torchtitan runner#33
T4ras123 wants to merge 1 commit into
mainfrom
pr/training-runner

Conversation

@T4ras123

Copy link
Copy Markdown
Contributor

This pull request introduces several performance and robustness improvements to the data loading and checkpoint handling logic for pretraining in molgen3D. The most significant changes are the refactoring of the data loader to use NumPy arrays for pair management (improving efficiency and scalability), a robust mechanism to patch HuggingFace checkpoints safely in distributed environments, and the initialization of a multi-backend PyTorch distributed group to avoid InfiniBand resource exhaustion. Additionally, some unused arguments were removed from tokenizer initialization.

DataLoader performance and memory improvements:

  • Changed storage of training file/line pairs from a Python list to a NumPy array in dataloader.py, and updated all related logic to use efficient NumPy operations for shuffling and worker assignment. This should significantly improve memory usage and speed when handling large datasets. [1] [2] [3] [4] [5]

Distributed checkpoint patching and safety:

  • Updated _ensure_hf_checkpoint_has_lm_head to ensure only global rank 0 performs checkpoint patching, using a .ready sentinel file to signal completion. Other ranks wait for the sentinel, preventing race conditions and file corruption. This is especially important for multi-node training. [1] [2]

Distributed training initialization robustness:

  • Added pre-initialization of torch.distributed with both CUDA and CPU backends to ensure CPU-based collectives use GLOO, avoiding InfiniBand completion queue exhaustion and related NCCL errors during planning phases.

Tokenizer loading simplification:

  • Removed the unused fix_mistral_regex argument from AutoTokenizer.from_pretrained calls in both the data loader and runner, simplifying tokenizer initialization. [1] [2]

Configuration and path improvements:

  • Changed the checkpoint root path key from "qwen_yerevann_root" to "ckpts_root" for clarity and consistency in torchtitan_runner.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant