Skip to content

Add helper scripts for big-data processing, coord stats, and training#34

Open
T4ras123 wants to merge 1 commit into
mainfrom
pr/helper-scripts
Open

Add helper scripts for big-data processing, coord stats, and training#34
T4ras123 wants to merge 1 commit into
mainfrom
pr/helper-scripts

Conversation

@T4ras123

Copy link
Copy Markdown
Contributor

This pull request introduces several new scripts and significant improvements to the distributed training infrastructure and data processing workflow. The most important changes include the addition of new utility scripts for data conversion and statistics estimation, major enhancements to the SLURM launch script for distributed training (including robust multi-node support and environment handling), and an update to the Qwen3 model conversion helper to ensure compatibility with Torchtitan's configuration.

New utility scripts:

  • Added scripts/convert_big_data_format.sh, a SLURM-ready shell script to efficiently convert large CSV datasets into sharded pickle files, with extensive environment and resource configuration options.
  • Added scripts/estimate_bigdata_coord_stats.py, a Python script that samples conformer coordinates from sharded pickle files to quickly estimate coordinate statistics and fit quantile bins.

Distributed training infrastructure improvements (scripts/launch_torchtitan_qwen3.sh):

  • Enhanced SLURM resource configuration and environment variable handling, including support for NCCL/IB tuning, workspace setup, and WANDB configuration. [1] [2]
  • Implemented robust multi-node and multi-GPU orchestration using SSH-based fan-out to worker nodes, automatic network interface detection, and shared workspace for configuration files, improving reliability on heterogeneous clusters. [1] [2]

Qwen3 model conversion compatibility:

  • Updated convert_qwen3_dcp_to_hf.py to construct a Qwen3ModelArgs object from the HuggingFace config, ensuring compatibility with Torchtitan's Qwen3StateDictAdapter for model weight conversion. [1] [2]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant