This repository contains the implementation of DumplingGNN, a hybrid Graph Neural Network designed to predict ADC (Antibody-Drug Conjugate) payload activity based on chemical structure.
DumplingGNN is a novel hybrid Graph Neural Network architecture that combines multiple GNN components:
- Message Passing Neural Networks (MPNN) for capturing local chemical interactions
- Graph Attention Networks (GAT) for identifying important substructures
- GraphSAGE for aggregating information across different scales
The model was developed to effectively capture multi-scale molecular features using both 2D topological and 3D structural information, providing accurate predictions and interpretable results.
model.py: Implementation of the DumplingGNN model and baseline models (MPNN, GCN, GAT, GraphSAGE)explainmodel.py: Extended implementation with explainability featurestrain_explainable.py: Training script for the explainable version of DumplingGNNbaselines.ipynb: Jupyter notebook for running baseline model comparisons
The repository contains two versions of DumplingGNN:
-
Standard Version (model.py):
- Basic implementation focused on prediction performance
- More efficient for training and inference
- Includes baseline model implementations
- Suitable for production deployment
-
Explainable Version (explainmodel.py):
- Enhanced with interpretability features
- Provides attention visualization capabilities
- Enables substructure identification
- Includes tools for analyzing attention patterns
- Better for research and analysis purposes
The repository includes several baseline models for comparison:
-
FIVEMPNN:
- Five-layer Message Passing Neural Network
- Focuses on local chemical interactions
- Serves as baseline for pure message-passing approach
-
GCN (Graph Convolutional Network):
- Five-layer implementation
- Standard graph convolution operations
- Baseline for basic graph learning
-
FiveLayerGAT:
- Five-layer Graph Attention Network
- Tests pure attention-based approach
- Baseline for attention mechanism effectiveness
-
FiveLayerSAGE:
- Five-layer GraphSAGE implementation
- Tests neighborhood aggregation strategy
- Baseline for sampling-based approaches
These baselines help evaluate the advantages of DumplingGNN's hybrid architecture.
- Clone this repository:
git clone https://github.com/Iayce/DumplingGnn.git
cd DumplingGnn- Install dependencies:
pip install -r requirements.txtNote: This requires Python 3.10 or higher.
The model expects molecular data in SDF format, with each molecule having a "label" property indicating its activity. The training script processes this data into graph representations suitable for the model.
The train_chembl.sdf dataset contains a curated collection of molecules with DNA Topoisomerase I inhibitory activity:
- Source: Molecules collected from ChEMBL database (CHEMBL1781)
- Size: 190 unique molecular structures after processing
- Activity Labels: Binary classification
- Positive: IC50 < 100 µM
- Negative: IC50 ≥ 100 µM
- Structure Information:
- 2D topological information (SMILES)
- 3D conformations generated using OpenBabel and MMFF94 force field
- Top 3 docking poses per molecule from NLDock
- Quality Control:
- Removed redundant conformers (RMSD cutoff 1.5 Å)
- Verified structural integrity
- Standardized activity measurements
To train the explainable version of DumplingGNN:
python train_explainable.pyBy default, this will:
- Load and process molecules from
train_chembl.sdf - Split them into training, validation, and test sets
- Train the model with early stopping
- Save the best model to
saved_models/best_model.pth - Display training metrics and performance on the test set
import torch
from explainmodel import ExplainableDumplingGNN
# Initialize the model
model = ExplainableDumplingGNN(hidden_channels=32)
# Load trained weights
model.load_state_dict(torch.load('saved_models/best_model.pth'))
# Set to evaluation mode
model.eval()
# Pass your data through the model
output = model(data)The explainable version of DumplingGNN allows you to analyze the attention weights to understand which parts of a molecule contribute most to its predicted activity:
from explainmodel import ExplainableDumplingGNNExplainer
from rdkit import Chem
# Initialize the explainer
explainer = ExplainableDumplingGNNExplainer(model)
# Load a molecule
mol = Chem.MolFromSmiles('SMILES_STRING')
# Analyze the molecule
results = explainer.analyze_case(data, mol)
# The results contain attention weights, important substructures, etc.To compare DumplingGNN with baseline models:
from model import FIVEMPNN, GCN, FiveLayerGAT, FiveLayerSAGE
# Initialize models
models = {
'MPNN': FIVEMPNN(in_channels=8, hidden_channels=32),
'GCN': GCN(hidden_channels=32, num_classes=2),
'GAT': FiveLayerGAT(in_channels=8, hidden_channels=32, out_channels=2),
'SAGE': FiveLayerSAGE(in_channels=8, hidden_channels=32, out_channels=2)
}
# Train and evaluate each model
for name, model in models.items():
train_model(model, train_loader)
results = evaluate_model(model, test_loader)
print(f"{name} Results:", results)This project is licensed under the MIT License - see the LICENSE file for details.
This work was supported by the National Key Research and Development Program of China under Grant 2023YFC3304501.