This repository provides code for hyperparameter optimization, feature extraction, training, and evaluation of the GMC-MPNN model, specifically designed for predicting blood-brain barrier permeability (BBBP).
Install Chemprop following the official guide: 🔗 https://chemprop.readthedocs.io/en/latest/installation.html
📚 Additional Dependencies
pip install pandas numpy scipy scikit-learn biopandas rdkitDepending on your system, rdkit may require Conda:
conda install -c rdkit rdkitWe recommend setting up a clean environment using conda or virtualenv, and ensuring all dependencies are satisfied as specified in the Chemprop documentation.
To search for optimal hyperparameters using Chemprop's CLI:
chemprop hpopt \
--data-path <path_to_dataset.csv> \
--task-type <classification|regression> \
--search-parameter-keywords all \
--split-type SCAFFOLD_BALANCED \
--hpopt-save-dir <path_to_output_dir> \
--raytune-num-gpus 1To compute GGL-based ligand features:
python <script_path> -k <kernel_index> -c <cutoff> -f <csv_file> -dd <data_folder> -fd <feature_folder>Example:
python get_ggl_ligand_features.py -k 1551 -c 12.0 -f dataset.csv -dd ./mol2_files -fd ./featuresUse the provided SLURM job script:
sbatch extract_ggl_features.shTo train models with multiple seeds (0-4) and automatically average test results:
python train.py \
--training_script <training_script> \
--input_path <path_to_dataset.csv> \
--features_folder <path_to_features> \
--results_path <path_to_results> \
--seeds 0 1 2 3 4 \
--target_columns <target_column_name>Example for B3DB_cls:
python train.py \
--training_script train_b3db_cls.py \
--input_path /path/to/B3DB_cls.csv \
--features_folder /path/to/features/B3DB_cls \
--results_path /path/to/results/B3DB_cls/multi_seed \
--seeds 0 1 2 3 4 \
--target_columns labels \
--batch_size 32 \
--max_epochs 100 \
--split_type SCAFFOLD_BALANCEDThe script will train models for each seed and automatically calculate averaged test results across seeds.
To submit parallel training jobs:
sbatch train.shUpdate train.sh with your dataset-specific paths and configuration before submitting.
To reproduce results by training only the best kernel for each seed:
python test.py \
--dataset <dataset_name> \
--input_path <path_to_dataset.csv> \
--features_folder <path_to_features> \
--results_path <path_to_results>Example for B3DB_cls:
python test.py \
--dataset B3DB_cls \
--input_path /path/to/B3DB_cls.csv \
--features_folder /path/to/features/B3DB_cls \
--results_path /path/to/results/B3DB_cls/testThe script automatically uses the best kernel for each seed
We provide the following for reproducibility and testing:
- ✅ All datasets
- ✅ GGL feature files (
.npz)
📥 Access via OneDrive
🔗 http://bit.ly/4558Ovg
For questions or support, please contact: 📧 ducnguyen@utk.edu