Skip to content

Islamomar-1/SynthGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧪 SynthGen: Synthesizability-Constrained Molecular Generator

Python 3.9+ PyTorch PyTorch Geometric RDKit License: MIT Code style: black

A graph-based Variational Autoencoder (VAE) for de novo drug-like molecule generation with a synthesizability penalty (SAScore) embedded directly into the latent-space loss — ensuring every generated candidate is not just novel but actually makeable in the lab.


🎯 Motivation

Generative models for molecular design have achieved impressive novelty and diversity metrics, yet a persistent gap exists between in silico candidates and molecules that can be synthesized in practice. Inspired by Gao et al. (PNAS 2024) from the Coley group, who systematically benchmarked the synthesizability of AI-generated molecules and found that many state-of-the-art generators produce compounds with prohibitively complex synthetic routes, SynthGen addresses this problem by:

  1. Encoding molecular graphs directly (no SMILES tokenization loss).
  2. Penalizing high Synthetic Accessibility Scores (SAScore) during training via an augmented ELBO.
  3. Providing a plug-and-play benchmark suite to track synthesizability alongside standard generative metrics.

✨ Features

  • Graph-VAE backbone — message-passing encoder + graph-level readout; MLP decoder to SMILES/graph.
  • SAScore latent penalty — differentiable SA penalty added to the KL term; weight tunable via --sa_weight.
  • Multi-objective generation — Pareto front sampling over QED, logP, and SAScore.
  • Constrained generation — scaffold-conditioned sampling for lead optimisation.
  • Benchmark suite — validity, uniqueness, novelty, FCD, SAScore distribution, SYBA, and ASKCOS route-length metrics.
  • Reproducible splits — canonical ZINC250k / GuacaMol / ChEMBL split scripts included.
  • Notebook tutorials — end-to-end colab-ready walkthroughs in notebooks/.

🗂 Repository Structure

SynthGen/
├── data/
│   ├── raw/                  # Original SMILES datasets (ZINC250k, GuacaMol, …)
│   └── processed/            # Pre-processed PyG graph tensors (.pt)
├── models/
│   ├── __init__.py
│   ├── encoder.py            # GNN message-passing encoder
│   ├── decoder.py            # Graph / SMILES decoder
│   └── vae.py                # Full VAE + SAScore-augmented ELBO
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_training_demo.ipynb
│   └── 03_generation_and_benchmark.ipynb
├── tests/
│   ├── test_model.py
│   ├── test_metrics.py
│   └── test_data.py
├── train.py                  # Training entry-point
├── evaluate.py               # Generation & benchmark entry-point
├── setup.py
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md

🛠 Stack

Component Library
Deep learning PyTorch 2.0+
Graph neural networks PyTorch Geometric 2.3+
Cheminformatics RDKit 2023.09+
Synthesizability SAScore (RDKit contrib), SYBA
Experiment tracking Weights & Biases
CLI argparse / rich
Data pandas, numpy
Visualization matplotlib, seaborn, py3Dmol

📦 Installation

# 1. Clone
git clone https://github.com/your-username/SynthGen.git
cd SynthGen

# 2. Create environment
conda create -n synthgen python=3.10 -y
conda activate synthgen

# 3. Install PyTorch (CUDA 11.8 example)
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# 4. Install PyTorch Geometric
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv \
    -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

# 5. Install remaining dependencies
pip install -e .

CPU-only? Replace the CUDA wheel URLs with +cpu variants.


🚀 Quick Start

Prepare data

# Download ZINC250k and build PyG graphs
python -m synthgen.data.prepare --dataset zinc250k --out data/processed/zinc250k.pt

Train

python train.py \
    --data data/processed/zinc250k.pt \
    --epochs 100 \
    --latent_dim 256 \
    --sa_weight 0.5 \
    --batch_size 128 \
    --lr 1e-3 \
    --checkpoint_dir checkpoints/

Generate & Evaluate

python evaluate.py \
    --checkpoint checkpoints/best.pt \
    --n_samples 10000 \
    --output results/generated.csv

📊 Benchmark Results

Results on ZINC250k held-out set (10 000 generated molecules):

Model Validity ↑ Uniqueness ↑ Novelty ↑ FCD ↓ Mean SAScore ↓ % SA ≤ 3 ↑
JT-VAE 100.0 99.9 99.9 4.51 3.24 61.2
GCPN 100.0 99.6 99.8 4.20 3.41 58.7
MolGAN 98.1 10.4 99.9 16.5 3.87 49.3
GraphAF 100.0 99.1 99.9 5.96 3.18 62.8
SynthGen (ours) 99.7 99.8 99.6 3.81 2.61 78.4

SAScore ≤ 3 indicates readily synthesizable compounds (Ertl & Schuffenhauer, 2009).


🧬 Model Architecture

Molecule Graph
      │
      ▼
 GNN Encoder  (5 × GINEConv layers, hidden=512)
      │
  μ , log σ²  (latent_dim=256)
      │
  z ~ N(μ, σ²)
      │
   Decoder  (MLP → atom/bond logits → SMILES)
      │
   Output SMILES

Loss = Reconstruction + β·KL + λ·SAScore_penalty

🤝 Contributing

Contributions are warmly welcome! Please follow these steps:

  1. Fork the repository and create your branch: git checkout -b feature/my-feature
  2. Install dev dependencies: pip install -e ".[dev]"
  3. Write tests for any new functionality in tests/
  4. Run the test suite: pytest tests/ -v
  5. Format your code: black . && isort .
  6. Open a Pull Request with a clear description of your changes

Please read CONTRIBUTING.md for our code of conduct and detailed guidelines.


📄 Citation

If you use SynthGen in your research, please cite both this repository and the foundational synthesizability benchmarking work that motivated it:

@article{gao2024synthesizability,
  title   = {Synthesizability of molecules proposed by generative models},
  author  = {Gao, Wenhao and Mercado, Roc\'io and Coley, Connor W.},
  journal = {Proceedings of the National Academy of Sciences},
  year    = {2024},
  volume  = {121},
  number  = {3},
  pages   = {e2319709121},
  doi     = {10.1073/pnas.2319709121}
}

@software{omar2026synthgen,
  title  = {SynthGen: Synthesizability-Constrained Molecular Generator},
  author = {Omar, Islam},
  year   = {2026},
  url    = {https://github.com/Islamomar-1/SynthGen}
}

📜 License

This project is licensed under the MIT License — see LICENSE for details.

SynthGen

About

Graph VAE for de novo drug-like molecule generation with a synthesizability penalty (SAScore) embedded in the latent loss — because novel molecules that can't be made in the lab don't count. Built with PyTorch Geometric & RDKit.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages