🧪 SynthGen: Synthesizability-Constrained Molecular Generator

A graph-based Variational Autoencoder (VAE) for de novo drug-like molecule generation with a synthesizability penalty (SAScore) embedded directly into the latent-space loss — ensuring every generated candidate is not just novel but actually makeable in the lab.

🎯 Motivation

Generative models for molecular design have achieved impressive novelty and diversity metrics, yet a persistent gap exists between in silico candidates and molecules that can be synthesized in practice. Inspired by Gao et al. (PNAS 2024) from the Coley group, who systematically benchmarked the synthesizability of AI-generated molecules and found that many state-of-the-art generators produce compounds with prohibitively complex synthetic routes, SynthGen addresses this problem by:

Encoding molecular graphs directly (no SMILES tokenization loss).
Penalizing high Synthetic Accessibility Scores (SAScore) during training via an augmented ELBO.
Providing a plug-and-play benchmark suite to track synthesizability alongside standard generative metrics.

✨ Features

Graph-VAE backbone — message-passing encoder + graph-level readout; MLP decoder to SMILES/graph.
SAScore latent penalty — differentiable SA penalty added to the KL term; weight tunable via --sa_weight.
Multi-objective generation — Pareto front sampling over QED, logP, and SAScore.
Constrained generation — scaffold-conditioned sampling for lead optimisation.
Benchmark suite — validity, uniqueness, novelty, FCD, SAScore distribution, SYBA, and ASKCOS route-length metrics.
Reproducible splits — canonical ZINC250k / GuacaMol / ChEMBL split scripts included.
Notebook tutorials — end-to-end colab-ready walkthroughs in notebooks/.

🗂 Repository Structure

SynthGen/
├── data/
│   ├── raw/                  # Original SMILES datasets (ZINC250k, GuacaMol, …)
│   └── processed/            # Pre-processed PyG graph tensors (.pt)
├── models/
│   ├── __init__.py
│   ├── encoder.py            # GNN message-passing encoder
│   ├── decoder.py            # Graph / SMILES decoder
│   └── vae.py                # Full VAE + SAScore-augmented ELBO
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_training_demo.ipynb
│   └── 03_generation_and_benchmark.ipynb
├── tests/
│   ├── test_model.py
│   ├── test_metrics.py
│   └── test_data.py
├── train.py                  # Training entry-point
├── evaluate.py               # Generation & benchmark entry-point
├── setup.py
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md

🛠 Stack

Component	Library
Deep learning	PyTorch 2.0+
Graph neural networks	PyTorch Geometric 2.3+
Cheminformatics	RDKit 2023.09+
Synthesizability	SAScore (RDKit contrib), SYBA
Experiment tracking	Weights & Biases
CLI	argparse / rich
Data	pandas, numpy
Visualization	matplotlib, seaborn, py3Dmol

📦 Installation

# 1. Clone
git clone https://github.com/your-username/SynthGen.git
cd SynthGen

# 2. Create environment
conda create -n synthgen python=3.10 -y
conda activate synthgen

# 3. Install PyTorch (CUDA 11.8 example)
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# 4. Install PyTorch Geometric
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv \
    -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

# 5. Install remaining dependencies
pip install -e .

CPU-only? Replace the CUDA wheel URLs with +cpu variants.

🚀 Quick Start

Prepare data

# Download ZINC250k and build PyG graphs
python -m synthgen.data.prepare --dataset zinc250k --out data/processed/zinc250k.pt

Train

python train.py \
    --data data/processed/zinc250k.pt \
    --epochs 100 \
    --latent_dim 256 \
    --sa_weight 0.5 \
    --batch_size 128 \
    --lr 1e-3 \
    --checkpoint_dir checkpoints/

Generate & Evaluate

python evaluate.py \
    --checkpoint checkpoints/best.pt \
    --n_samples 10000 \
    --output results/generated.csv

📊 Benchmark Results

Results on ZINC250k held-out set (10 000 generated molecules):

Model	Validity ↑	Uniqueness ↑	Novelty ↑	FCD ↓	Mean SAScore ↓	% SA ≤ 3 ↑
JT-VAE	100.0	99.9	99.9	4.51	3.24	61.2
GCPN	100.0	99.6	99.8	4.20	3.41	58.7
MolGAN	98.1	10.4	99.9	16.5	3.87	49.3
GraphAF	100.0	99.1	99.9	5.96	3.18	62.8
SynthGen (ours)	99.7	99.8	99.6	3.81	2.61	78.4

SAScore ≤ 3 indicates readily synthesizable compounds (Ertl & Schuffenhauer, 2009).

🧬 Model Architecture

Molecule Graph
      │
      ▼
 GNN Encoder  (5 × GINEConv layers, hidden=512)
      │
  μ , log σ²  (latent_dim=256)
      │
  z ~ N(μ, σ²)
      │
   Decoder  (MLP → atom/bond logits → SMILES)
      │
   Output SMILES

Loss = Reconstruction + β·KL + λ·SAScore_penalty

🤝 Contributing

Contributions are warmly welcome! Please follow these steps:

Fork the repository and create your branch: git checkout -b feature/my-feature
Install dev dependencies: pip install -e ".[dev]"
Write tests for any new functionality in tests/
Run the test suite: pytest tests/ -v
Format your code: black . && isort .
Open a Pull Request with a clear description of your changes

Please read CONTRIBUTING.md for our code of conduct and detailed guidelines.

📄 Citation

If you use SynthGen in your research, please cite both this repository and the foundational synthesizability benchmarking work that motivated it:

@article{gao2024synthesizability,
  title   = {Synthesizability of molecules proposed by generative models},
  author  = {Gao, Wenhao and Mercado, Roc\'io and Coley, Connor W.},
  journal = {Proceedings of the National Academy of Sciences},
  year    = {2024},
  volume  = {121},
  number  = {3},
  pages   = {e2319709121},
  doi     = {10.1073/pnas.2319709121}
}

@software{omar2026synthgen,
  title  = {SynthGen: Synthesizability-Constrained Molecular Generator},
  author = {Omar, Islam},
  year   = {2026},
  url    = {https://github.com/Islamomar-1/SynthGen}
}

📜 License

This project is licensed under the MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 SynthGen: Synthesizability-Constrained Molecular Generator

🎯 Motivation

✨ Features

🗂 Repository Structure

🛠 Stack

📦 Installation

🚀 Quick Start

Prepare data

Train

Generate & Evaluate

📊 Benchmark Results

🧬 Model Architecture

🤝 Contributing

📄 Citation

📜 License

SynthGen

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
models		models
notebooks		notebooks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
setup.py		setup.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

🧪 SynthGen: Synthesizability-Constrained Molecular Generator

🎯 Motivation

✨ Features

🗂 Repository Structure

🛠 Stack

📦 Installation

🚀 Quick Start

Prepare data

Train

Generate & Evaluate

📊 Benchmark Results

🧬 Model Architecture

🤝 Contributing

📄 Citation

📜 License

SynthGen

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages