A graph-based Variational Autoencoder (VAE) for de novo drug-like molecule generation with a synthesizability penalty (SAScore) embedded directly into the latent-space loss — ensuring every generated candidate is not just novel but actually makeable in the lab.
Generative models for molecular design have achieved impressive novelty and diversity metrics, yet a persistent gap exists between in silico candidates and molecules that can be synthesized in practice. Inspired by Gao et al. (PNAS 2024) from the Coley group, who systematically benchmarked the synthesizability of AI-generated molecules and found that many state-of-the-art generators produce compounds with prohibitively complex synthetic routes, SynthGen addresses this problem by:
- Encoding molecular graphs directly (no SMILES tokenization loss).
- Penalizing high Synthetic Accessibility Scores (SAScore) during training via an augmented ELBO.
- Providing a plug-and-play benchmark suite to track synthesizability alongside standard generative metrics.
- Graph-VAE backbone — message-passing encoder + graph-level readout; MLP decoder to SMILES/graph.
- SAScore latent penalty — differentiable SA penalty added to the KL term; weight tunable via
--sa_weight. - Multi-objective generation — Pareto front sampling over QED, logP, and SAScore.
- Constrained generation — scaffold-conditioned sampling for lead optimisation.
- Benchmark suite — validity, uniqueness, novelty, FCD, SAScore distribution, SYBA, and ASKCOS route-length metrics.
- Reproducible splits — canonical ZINC250k / GuacaMol / ChEMBL split scripts included.
- Notebook tutorials — end-to-end colab-ready walkthroughs in
notebooks/.
SynthGen/
├── data/
│ ├── raw/ # Original SMILES datasets (ZINC250k, GuacaMol, …)
│ └── processed/ # Pre-processed PyG graph tensors (.pt)
├── models/
│ ├── __init__.py
│ ├── encoder.py # GNN message-passing encoder
│ ├── decoder.py # Graph / SMILES decoder
│ └── vae.py # Full VAE + SAScore-augmented ELBO
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_training_demo.ipynb
│ └── 03_generation_and_benchmark.ipynb
├── tests/
│ ├── test_model.py
│ ├── test_metrics.py
│ └── test_data.py
├── train.py # Training entry-point
├── evaluate.py # Generation & benchmark entry-point
├── setup.py
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md
| Component | Library |
|---|---|
| Deep learning | PyTorch 2.0+ |
| Graph neural networks | PyTorch Geometric 2.3+ |
| Cheminformatics | RDKit 2023.09+ |
| Synthesizability | SAScore (RDKit contrib), SYBA |
| Experiment tracking | Weights & Biases |
| CLI | argparse / rich |
| Data | pandas, numpy |
| Visualization | matplotlib, seaborn, py3Dmol |
# 1. Clone
git clone https://github.com/your-username/SynthGen.git
cd SynthGen
# 2. Create environment
conda create -n synthgen python=3.10 -y
conda activate synthgen
# 3. Install PyTorch (CUDA 11.8 example)
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
# 4. Install PyTorch Geometric
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv \
-f https://data.pyg.org/whl/torch-2.1.0+cu118.html
# 5. Install remaining dependencies
pip install -e .CPU-only? Replace the CUDA wheel URLs with
+cpuvariants.
# Download ZINC250k and build PyG graphs
python -m synthgen.data.prepare --dataset zinc250k --out data/processed/zinc250k.ptpython train.py \
--data data/processed/zinc250k.pt \
--epochs 100 \
--latent_dim 256 \
--sa_weight 0.5 \
--batch_size 128 \
--lr 1e-3 \
--checkpoint_dir checkpoints/python evaluate.py \
--checkpoint checkpoints/best.pt \
--n_samples 10000 \
--output results/generated.csvResults on ZINC250k held-out set (10 000 generated molecules):
| Model | Validity ↑ | Uniqueness ↑ | Novelty ↑ | FCD ↓ | Mean SAScore ↓ | % SA ≤ 3 ↑ |
|---|---|---|---|---|---|---|
| JT-VAE | 100.0 | 99.9 | 99.9 | 4.51 | 3.24 | 61.2 |
| GCPN | 100.0 | 99.6 | 99.8 | 4.20 | 3.41 | 58.7 |
| MolGAN | 98.1 | 10.4 | 99.9 | 16.5 | 3.87 | 49.3 |
| GraphAF | 100.0 | 99.1 | 99.9 | 5.96 | 3.18 | 62.8 |
| SynthGen (ours) | 99.7 | 99.8 | 99.6 | 3.81 | 2.61 | 78.4 |
SAScore ≤ 3 indicates readily synthesizable compounds (Ertl & Schuffenhauer, 2009).
Molecule Graph
│
▼
GNN Encoder (5 × GINEConv layers, hidden=512)
│
μ , log σ² (latent_dim=256)
│
z ~ N(μ, σ²)
│
Decoder (MLP → atom/bond logits → SMILES)
│
Output SMILES
Loss = Reconstruction + β·KL + λ·SAScore_penalty
Contributions are warmly welcome! Please follow these steps:
- Fork the repository and create your branch:
git checkout -b feature/my-feature - Install dev dependencies:
pip install -e ".[dev]" - Write tests for any new functionality in
tests/ - Run the test suite:
pytest tests/ -v - Format your code:
black . && isort . - Open a Pull Request with a clear description of your changes
Please read CONTRIBUTING.md for our code of conduct and detailed guidelines.
If you use SynthGen in your research, please cite both this repository and the foundational synthesizability benchmarking work that motivated it:
@article{gao2024synthesizability,
title = {Synthesizability of molecules proposed by generative models},
author = {Gao, Wenhao and Mercado, Roc\'io and Coley, Connor W.},
journal = {Proceedings of the National Academy of Sciences},
year = {2024},
volume = {121},
number = {3},
pages = {e2319709121},
doi = {10.1073/pnas.2319709121}
}
@software{omar2026synthgen,
title = {SynthGen: Synthesizability-Constrained Molecular Generator},
author = {Omar, Islam},
year = {2026},
url = {https://github.com/Islamomar-1/SynthGen}
}This project is licensed under the MIT License — see LICENSE for details.