This code was tested with PyTorch 2.0.1, cuda 11.8 and torch_geometrics 2.3.1
-
Download anaconda/miniconda if needed
-
Create a rdkit environment that directly contains rdkit:
conda create -c conda-forge -n dmol rdkit=2023.03.2 python=3.9 -
conda activate dmol -
Check that this line does not return an error:
python3 -c 'from rdkit import Chem' -
Install graph-tool (https://graph-tool.skewed.de/):
conda install -c conda-forge graph-tool=2.45 -
Check that this line does not return an error:
python3 -c 'import graph_tool as gt' -
Install the nvcc drivers for your cuda version. For example:
conda install -c "nvidia/label/cuda-11.8.0" cuda -
Install a corresponding version of pytorch, for example:
pip3 install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118 -
Install other packages using the requirement file:
pip install -r requirements.txt -
Navigate to the ./src/analysis/orca directory and compile orca.cpp:
g++ -O2 -std=c++11 -o orca orca.cpp
Note: graph_tool and torch_geometric currently seem to conflict on MacOS, I have not solved this issue yet.
-
All code is currently launched through
python3 main.py. Check hydra documentation (https://hydra.cc/) for overriding default parameters. -
To run the discrete model with the default QM9 configuration:
python3 main.py
-
To select a different dataset, override the
dataset.nameparameter with Hydra. For example, to use Guacamol:python3 main.py dataset=guacamol
See the YAML files under
configs/datasetfor the list of supported datasets and their options.
Before running any experiment, you must:
-
1. Download raw datasets
- QM9
- The code will automatically download from the PyG/DeepChem URLs if files are missing (see
QM9Dataset.raw_urlandQM9Dataset.raw_url2insrc/datasets/qm9_dataset.py):https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/molnet_publish/qm9.ziphttps://ndownloader.figshare.com/files/3195404
- If you prefer manual download, place the extracted files under
data/qm9/raw/(e.g.gdb9.sdf,gdb9.sdf.csv,uncharacterized.txt,train.csv,val.csv,test.csv).
- The code will automatically download from the PyG/DeepChem URLs if files are missing (see
- MOSES
- Automatic download URLs are defined in
MOSESDataset(src/datasets/moses_dataset.py):- Train:
https://media.githubusercontent.com/media/molecularsets/moses/master/data/train.csv - Test:
https://media.githubusercontent.com/media/molecularsets/moses/master/data/test.csv - Scaffolds:
https://media.githubusercontent.com/media/molecularsets/moses/master/data/test_scaffolds.csv
- Train:
- To download manually, save them as
train_moses.csv,val_moses.csv,test_moses.csvunder
data/moses/raw/.
- Automatic download URLs are defined in
- Guacamol
- Download URLs are defined in
GuacamolDataset(src/datasets/guacamol_dataset.py):- Train:
https://figshare.com/ndownloader/files/13612760 - Test:
https://figshare.com/ndownloader/files/13612757 - Valid:
https://figshare.com/ndownloader/files/13612766 - (All:
https://figshare.com/ndownloader/files/13612745, used only if you need the full set)
- Train:
- Save them as
guacamol_v1_train.smiles,guacamol_v1_valid.smiles,guacamol_v1_test.smilesunder
data/guacamol/raw/.
- Download URLs are defined in
- Make sure the folder structure matches what is expected in the corresponding dataset classes under
src/datasets/.
- QM9
-
2. Collect ring statistics (optional but recommended)
-
Use
ring_statistics.pyat the project root to analyze which ring structures are most frequent in your dataset:# Example: compute ring statistics for a SMILES file python3 ring_statistics.py data/qm9/raw/qm9_train.smiles -
The script will output a text file (e.g.
*_ring_stats.txt) containing all observed rings and their frequencies. -
Inspect this file and decide which ring SMILES you want to compress into supernodes (e.g.
C1CCC1,C1CC1,N1CCC1).
-
-
3. Configure ring compression in the dataset config
- Open the corresponding YAML file under
configs/dataset/, for example:- QM9:
configs/dataset/qm9.yaml - MOSES:
configs/dataset/moses.yaml - Guacamol:
configs/dataset/guacamol.yaml
- QM9:
- In each of these files there is a ring compression section with the keys:
ring_types: list of ring SMILES that will be compressed into supernodes.ring_weights: list of molecular weights (one per entry inring_types), used to compute total molecular mass after compression.statistics_after_4(orstatistics_after_compression): path to the JSON file where compressed-dataset statistics are stored.
- Replace the default
ring_typesandring_weightswith the ring structures you selected from the statistics file, making sure the lengths of both lists match. - After changing these parameters, re-run the dataset preprocessing (by running
python3 main.pyonce) so that the compressed graphs and updated statistics are recomputed.
- Open the corresponding YAML file under
In summary, the full workflow is:
- Download the raw dataset files into
data/<dataset_name>/raw/. - (Optional but recommended) Run
ring_statistics.pyon a SMILES file from that dataset to obtain ring frequency statistics. - Choose a small set of ring SMILES to compress and put them into the
ring_typesandring_weightsfields of the corresponding YAML config underconfigs/dataset/. - Launch experiments with
python3 main.py(and overridedataset=...if you want a dataset other than the default QM9).