DeepGreenGO

A multilabel protein function prediction model for Viridiplantae (green plants) using Graph Neural Networks with ProtBERT embeddings.

Environment Setup

Option A — Conda (recommended)

conda env create -f environment.yml
conda activate deepgreengo

Note on PyTorch Geometric extras: After activating the env, install the C++ extension wheels matching your exact PyTorch + CUDA version from https://data.pyg.org/whl/:
# Example for torch 2.1.0 + CUDA 12.1:
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv \
    -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

Option B — pip

# 1. Install PyTorch first (choose CUDA version at https://pytorch.org):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# 2. Install PyTorch Geometric:
pip install torch-geometric
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv \
    -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

# 3. Install remaining dependencies:
pip install -r requirements.txt

External tools (via conda)

conda install -c conda-forge -c bioconda mmseqs2  # Homology clustering
conda install -c bioconda blast                   # BLAST baseline (optional)
conda install -c bioconda diamond                 # DIAMOND baseline (optional)

Data Preparation

Place your downloaded Viridiplantae PDB structures (.cif.gz) in:

preprocessing/data/structure_files/

You also need the SIFTS annotation file and GO OBO file in preprocessing/data/.

Run the Full Pipeline

Before running, set your HuggingFace token as an environment variable to prevent rate limits and unauthenticated download errors for ProtBERT:

export HF_TOKEN="your_hf_token_here"

bash run_all.sh

The script will:

Extract sequences and build GO annotations from CIF files
Cluster sequences at 30% identity (MMseqs2) and split into Train/Valid/Test
Compute pLDDT-filtered contact maps and build PyG graph datasets
Run BLAST / DIAMOND / Naive baselines
Train all model ablations (MLP / GCN / GAT / Hybrid × BCE / Focal, 3 seeds, 3 ontologies)
Run per-cluster generalisation evaluation
Aggregate results and generate figures

Skip flags

bash run_all.sh --skip-preprocess   # Preprocessing already done
bash run_all.sh --skip-ablations    # Only run preprocessing + baselines
bash run_all.sh --skip-plots        # Skip figure generation

Environment overrides

EPOCHS=50 BATCH_SIZE=16 MAIN_MODEL=GAT MAIN_LOSS=BCE bash run_all.sh

Hyperparameter Ablations

To run the automated hyperparameter sensitivity sweeps for the Hybrid model:

bash run_hyperparam_ablations.sh

Train a Single Model

python3 train.py \
    --model Hybrid \
    --loss  Focal  \
    --seed  42     \
    --ontology biological_process \
    --epochs 200

Run Inference

python3 predictions.py \
    -struc_dir  examples/structure_files \
    -model_path runs/bp_Hybrid_Focal_s42/best_model.pth \
    -output     examples/my_predictions.csv

Project Structure

deep-green-GO/
├── preprocessing/
│   ├── extract_seqs_from_cif.py  # Sequence extraction + GO annotation
│   ├── cluster_and_split.py      # MMseqs2 clustering + cluster-aware split
│   ├── create_cmaps.py           # pLDDT-filtered contact maps
│   └── create_batch_dataset.py   # PyG graph dataset builder (ProtBERT)
├── baselines/
│   ├── blast/                    # BLASTp nearest-neighbour baseline
│   ├── diamond/                  # DIAMOND nearest-neighbour baseline
│   ├── naive_frequency/          # GO term frequency prior baseline
│   └── deepfri_comparison/       # Comparison notes vs DeepFRI
├── model.py                      # GCN / GAT / Hybrid / MLP architectures
├── train.py                      # Training script with early stopping
├── evals.py                      # Micro/Macro Fmax, Smin, AUROC, AUPRC
├── focal_loss.py                 # Focal loss implementation
├── per_cluster_eval.py           # Per homology-cluster generalisation eval
├── aggregate_results.py          # Aggregate runs into mean±std tables
├── plot_results.py               # Publication-quality figure generation
├── predictions.py                # Inference on new structures
├── run_all.sh                    # ONE-CLICK full pipeline
├── run_ablations.sh              # Ablation sweep helper
├── run_hyperparam_ablations.sh   # Hyperparameter sensitivity helper
├── generate_supp_tables.py       # LaTeX config table generator
├── environment.yml               # Conda environment
└── requirements.txt              # pip requirements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepGreenGO

Environment Setup

Option A — Conda (recommended)

Option B — pip

External tools (via conda)

Data Preparation

Run the Full Pipeline

Skip flags

Environment overrides

Hyperparameter Ablations

Train a Single Model

Run Inference

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
__pycache__		__pycache__
baselines		baselines
logs		logs
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
aggregate_results.py		aggregate_results.py
arc_submit.slurm		arc_submit.slurm
environment.yml		environment.yml
evals.py		evals.py
focal_loss.py		focal_loss.py
generate_supp_tables.py		generate_supp_tables.py
layout.txt		layout.txt
manuscript.txt		manuscript.txt
model.py		model.py
per_cluster_eval.py		per_cluster_eval.py
plot_results.py		plot_results.py
predictions.py		predictions.py
requirements.txt		requirements.txt
run_ablations.sh		run_ablations.sh
run_all.sh		run_all.sh
run_experiments.py		run_experiments.py
run_hyperparam_ablations.sh		run_hyperparam_ablations.sh
run_pipeline.sh		run_pipeline.sh
sanity_check_cmaps.py		sanity_check_cmaps.py
test.py		test.py
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

DeepGreenGO

Environment Setup

Option A — Conda (recommended)

Option B — pip

External tools (via conda)

Data Preparation

Run the Full Pipeline

Skip flags

Environment overrides

Hyperparameter Ablations

Train a Single Model

Run Inference

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages