Pareto-optimal molecule discovery via qNEHVI Bayesian optimization over large molecular libraries.
Drug discovery demands simultaneous optimization of conflicting properties β high binding affinity must be balanced against metabolic stability, low toxicity, aqueous solubility, and synthetic accessibility. Classical single-objective optimization fails here: optimizing one property almost always degrades another.
MolBO addresses this by framing molecular design as a multi-objective Bayesian optimization (MOBO) problem. Building on the framework of Fromer & Coley (2024), we implement the q-Noisy Expected Hypervolume Improvement (qNEHVI) acquisition function from BoTorch over molecular fingerprint spaces, enabling sample-efficient discovery of the full Pareto front across any set of molecular objectives.
- 𧬠qNEHVI acquisition β batch-parallel, noisy multi-objective BO with state-of-the-art sample efficiency
- βοΈ RDKit integration β Morgan fingerprints, property oracles (logP, QED, SA score, Tanimoto similarity)
- π Pareto front tracking β per-iteration hypervolume computation and front visualization
- π Modular oracle API β plug in any scoring function (docking scores, ADMET predictors, ML surrogates)
- π Jupyter walkthrough β end-to-end notebook with visualizations and ablations
- π§ͺ Full test suite β pytest coverage for all core modules
- ποΈ GPU-ready β automatic CUDA detection via PyTorch
MolBO/
βββ data/
β βββ raw/ # Input molecular libraries (SMILES files)
β βββ processed/ # Featurized fingerprint tensors
βββ models/
β βββ __init__.py
β βββ gp_model.py # Batched GP surrogate (BoTorch)
β βββ fingerprints.py # Morgan fingerprint featurizer (RDKit)
βββ notebooks/
β βββ 01_quickstart.ipynb # End-to-end walkthrough
βββ tests/
β βββ test_fingerprints.py
β βββ test_gp_model.py
β βββ test_optimize.py
β βββ test_evaluate.py
βββ optimize.py # Main Pareto optimization loop (qNEHVI)
βββ evaluate.py # Oracle definitions & Pareto metrics
βββ requirements.txt
βββ setup.py
βββ .gitignore
βββ README.md
git clone https://github.com/Islamomar-1/MolBO.git
cd MolBOconda create -n molbo python=3.10
conda activate molbopip install torch torchvision --index-url https://download.pytorch.org/whl/cu118conda install -c conda-forge rdkitpip install -e .python optimize.py \
--library data/raw/zinc_10k.smi \
--objectives qed logp sa \
--n_init 50 \
--n_iter 20 \
--batch_size 5 \
--output results/pareto_run1.jsonfrom models.fingerprints import MorganFeaturizer
from evaluate import ObjectiveOracle
from optimize import ParetoOptimizer
# Load a SMILES library
smiles_list = open("data/raw/zinc_10k.smi").read().splitlines()
# Define objectives (any callable returning a float, higher = better)
oracle = ObjectiveOracle(objectives=["qed", "logp", "sa_score"])
# Featurize
featurizer = MorganFeaturizer(radius=2, n_bits=2048)
X = featurizer.transform(smiles_list) # (N, 2048) tensor
Y = oracle.evaluate_batch(smiles_list) # (N, n_obj) tensor
# Run Pareto optimization
optimizer = ParetoOptimizer(
X=X,
Y=Y,
smiles=smiles_list,
n_iter=20,
batch_size=5,
)
pareto_smiles, pareto_scores = optimizer.run()
print(f"Pareto front size: {len(pareto_smiles)}")Results on the ZINC 10k subset, 20 BO iterations Γ batch size 5 (100 oracle calls total). Hypervolume (HV) normalized by the analytic maximum.
| Method | HV (β) | Pareto Set Size | Oracle Calls |
|---|---|---|---|
| Random | 0.41 | 12 | 100 |
| NSGA-II | 0.59 | 18 | 100 |
| Single-obj BO (QED) | 0.52 | 9 | 100 |
| MolBO (qNEHVI) | 0.78 | 31 | 100 |
| MolBO (qEHVI) | 0.74 | 27 | 100 |
| MolBO (qParEGO) | 0.69 | 22 | 100 |
Objectives: QED, logP β [2, 5], SA Score (inverted). Mean Β± std over 5 seeds.
| Component | Library / Tool |
|---|---|
| Bayesian Optimization | BoTorch |
| GP & Tensors | PyTorch |
| Molecular Features | RDKit |
| Acquisition Fn | qNEHVI (BoTorch built-in) |
| Property Oracles | RDKit, SA Score (Ertl) |
| Visualization | Matplotlib, Plotly |
| Testing | pytest |
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch:
git checkout -b feature/my-feature - Write tests for any new functionality in
tests/ - Run the test suite:
pytest tests/ -v - Submit a pull request with a clear description
Implement the BaseOracle interface in evaluate.py:
class MyOracle(BaseOracle):
name = "my_property"
def __call__(self, smiles: str) -> float:
mol = Chem.MolFromSmiles(smiles)
# ... compute and return float (higher = better)
return scoreThen pass "my_property" to ObjectiveOracle(objectives=[..., "my_property"]).
If you use MolBO in your research, please cite:
@article{fromer2024computer,
title = {Computer-aided multi-objective optimization in small molecule discovery},
author = {Fromer, Jenna C and Coley, Connor W},
journal = {Nature Computational Science},
volume = {4},
pages = {22--33},
year = {2024},
doi = {10.1038/s43588-023-00601-0}
}And the BoTorch qNEHVI implementation:
@inproceedings{daulton2021parallel,
title = {Parallel Bayesian Optimization of Multiple Noisy Objectives with Expected Hypervolume Improvement},
author = {Daulton, Samuel and Balandat, Maximilian and Bakshy, Eytan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2021}
}MIT Β© Islam Omar