PredictPhagePPI

Predicting phage–host interactions from downsampled bacterial and bacteriophage genomes using machine learning.

Given a phage genome and a bacterial genome, the model predicts whether the phage can infect that bacterium. Genome representation is achieved by downsampling each genome into a fixed-size MinHash sketch (k-mer set), either via Sourmash or a custom MurmurHash3 / xxHash implementation. The concatenated sketches in a constructed presence matrix format, form the input vector to a Feed-Forward Neural Network (FFNN), which is evaluated by a hostrange/EOP interaction set.

Project overview

PredictPhagePPI/
├── raw_data/               # Raw FASTA genomes + interaction tables
├── data_prod/              # Produced data: sketches, presence matrices, results
├── nn_runs/                # Per-run output from FFNN training
├── tmp/                    # Temporary Slurm task files
├── notebooks/              # Exploratory Jupyter notebooks
├── scripts/                # Core pipeline scripts
│   └── slurm_iec/          # Iterative cluster-exclusion Slurm pipeline
├── Makefile                # Environment setup helpers
└── README.md

Data

Raw data (`raw_data/`)

Two datasets are supported, referred to throughout the code as data1 (default) and data2.

Path	Description
`raw_data/phagehost_KU/bacteriaKU_cleaned.fasta`	Bacterial genomes (data1)
`raw_data/phagehost_KU/phage_cleaned.fasta`	Phage genomes (data1)
`raw_data/phagehost_KU/Hostrange_data_all_crisp_iso.xlsx`	Lab-tested phage–host interactions (EOP / binary) for data1
`raw_data/phagehost_KU/data2_bacts.fasta`	Bacterial genomes (data2)
`raw_data/phagehost_KU/data2_phages.fasta`	Phage genomes (data2)
`raw_data/phagehost_KU/data2_EOP.xlsx`	EOP interaction table for data2

Produced data (`data_prod/`)

Path	Description
`SM_sketches/`	Sourmash MinHash sketches (`BactMinhash_n<N>_k<K>/`, `PhageMinhash_n<N>_k<K>/`) - similar structure for Murmurhash3 / xxHash, but substitutes "SM" prefix with "encoded"
`SM_sketches/sim_matrices/`	Pairwise Jaccard similarity matrices + cluster assignment CSVs
`PresMat_*/`	Precomputed binary presence matrices used as NN input
`IterExclClus_*/`	Collected results from iterative cluster-exclusion runs
`NN_files/`	Train/test split definitions and saved model files

Pipeline

Step 1 — Genome downsampling

Converts raw FASTA files into MinHash sketches. Each genome is decomposed into k-mers of length k, and the n smallest hashes are retained.

# Sourmash-backed sketches (default)
python scripts/downsampling.py --nk 500 12 --method sourmash

# Custom MurmurHash3 sketches
python scripts/downsampling.py --nk 500 12 --method minhash --hash mmh3

# data2 dataset
python scripts/downsampling.py --nk 500 12 --method sourmash --data2

# Batch downsampling across many n/k combinations (via Slurm)
bash scripts/queue_batch_downsampling.sh

For bacterial genomes with multi-contig assemblies, use bact_downsampling.py (wraps downsampling.py with per-contig merging).

Key parameters:

Flag	Description
`--nk N K`	Sketch size N and k-mer length K (applied to both bacteria and phages)
`--split_nk BN BK PN PK`	Independent sketch parameters for bacteria and phages
`--method`	`sourmash` (default), `minhash`, or `ohe` (one-hot encoding)
`--hash`	Hash function: `mmh3` (default), `xxhash`, or `ohe_custom`
`--data2`	Use the data2 dataset

Output lands in data_prod/SM_sketches/ (or SM_sketches_data2/).

Step 2 — Cluster analysis & data composition

Computes pairwise Jaccard similarity between sketches and clusters genomes for use in the iterative-exclusion pipeline.

# Run full Sourmash compare + cluster pipeline
python scripts/compare_sigs.py --config scripts/config_compare_sigs.yaml

# Dry-run to preview commands
python scripts/compare_sigs.py --config scripts/config_compare_sigs.yaml --dry-run

Outputs similarity matrices, heatmap PNGs, and cluster assignment CSVs to data_prod/SM_sketches/sim_matrices/.

Step 3 — FFNN training (`FFNN_inner.py`)

The core model script. Builds a presence matrix from the MinHash sketches, trains an FFNN, and optionally runs feature importance analysis.

# Basic run (n=500, k=12, Sourmash sketches)
python scripts/FFNN_inner.py --nk 500 12 --logging

# Cross-validation with SMOTE, saving the model
python scripts/FFNN_inner.py --nk 500 12 --cv --kf_n_splits 5 --smote --save_model --logging

# Exclude a bacterial cluster and test on unseen data
python scripts/FFNN_inner.py --nk 500 12 \
    --exclude_clusters \
    --exclude_bact_clusters StrainA StrainB \
    --test_on_excluded --logging

# Pairwise feature importance + gene annotation
python scripts/FFNN_inner.py --nk 500 12 --perform_pfi --perform_ga --logging

# data2 with EOP values
python scripts/FFNN_inner.py --nk 500 12 --data2 --logging

Selected flags:

Flag	Description
`--nk N K`	Sketch parameters (n hashes, k-mer size)
`--data2`	Use the data2 EOP dataset
`--use_encoded`	Use custom-encoded sketches instead of Sourmash sketches
`--cv` / `--kf_n_splits`	K-Fold cross-validation
`--exclude_clusters`	Hold out entire clusters for out-of-distribution testing
`--test_on_excluded`	Evaluate the held-out cluster as the test set
`--test_on_unseen`	Evaluate on completely unseen (zero overlap) data
`--perform_pfi`	Pairwise feature importance analysis
`--save_model`	Persist trained model to disk
`--n_epochs` / `--learning_rate` / `--batch_size`	Hyperparameters

Output goes to nn_runs/<run_dir>/.

Step 4 — Iterative cluster-exclusion pipeline (`slurm_iec/`)

Runs the full leave-one-cluster-out experiment (IterExcl): every (bacteria-cluster, phage-cluster) pair is excluded in turn, the model is trained on the remainder, and performance on the held-out pair is recorded. Results are aggregated by collect_iterres.py.

# Submit the full array + postprocess job to Slurm
bash scripts/slurm_iec/submit_iter_excl.sh 500 12 SM_sketches

# Local (non-Slurm) run for testing
bash scripts/slurm_iec/submit_iter_local.sh 500 12 SM_sketches

submit_iter_excl.sh generates a task map, submits ffnn_train_task.sh as a Slurm array, then chains ffnn_postprocess.sh with --dependency=afterok to run collect_iterres.py once all tasks finish.

Step 5 — Result collection & analysis (`collect_iterres.py`)

Aggregates per-run results from an iterative-exclusion experiment into summary CSVs and figures.

python scripts/collect_iterres.py \
    --run_dir nn_runs/IterExcl_SM_sketches_n500_k12 \
    --out data_prod/IterExclClus_SM_sketches_n500_k12

Produces: all_runs_summary.csv, accuracy/F1/balanced-accuracy plots by n/k, confusion matrices, k-mer distribution plots, top gene annotations, and a full run log.

Environment setup

# Create conda environment and install dependencies
make setup

# Or step by step:
make env          # create conda env 'PredPPI' (Python 3.11)
make requirements # scan and freeze requirements.txt

conda activate PredPPI
pip install -e .  # install package in editable mode (run from repo root)

Notebooks (`notebooks/`)

Exploratory Jupyter notebooks for each stage of the project:

Notebook	Description
`downsampling.ipynb`	MinHash sketch exploration
`bact_eda.ipynb` / `phage_eda.ipynb`	Genome-level EDA for bacteria and phages
`phylogeny_and_minhash.ipynb`	Phylogenetic structure vs. sketch similarity
`preprocces_data.ipynb`	Interaction matrix preprocessing
`random_forest.ipynb`	Random Forest baseline
`k_nearest_neighbor.ipynb`	k-NN baseline
`nn_torch.ipynb`	Early FFNN prototyping
`FFNN_outer_dev.ipynb`	Outer-loop development (model selection)
`gene_investigation.ipynb`	Gene-level feature analysis
`cnn.ipynb` / `rnn.ipynb`	CNN and RNN experiments
`playground.ipynb`	Scratch space

Key scripts reference

Script	Role
`downsampling.py`	Genome → MinHash sketch (Sourmash / mmh3 / xxhash)
`bact_downsampling.py`	Bacterial-specific downsampling (multi-contig support)
`batch_downsampling.py`	Batch over multiple configurations
`compare_sigs.py`	Pairwise Jaccard similarity + (hierarchial) clustering
`FFNN_inner.py`	Core FFNN training & evaluation
`collect_iterres.py`	Aggregate iterative-exclusion results
`decompositions.py`	k-mer decomposition opposing Sourmash, `KmerCodec` (4-bit encoding), `Decompose` class
`io_operations.py`	Presence matrix I/O, host-range loading
`manipulations.py`	Feature construction, host-range binarisation
`analysis.py`	Feature importance, gene analysis, plotting
`data_generators.py`	PyTorch dataset utilities
`nn_torch.py` / `torchmlp.py` / `simpleffnn.py`	Network architecture definitions
`paths.py`	Centralised path constants
`utils.py`	Taxonomy / strain ID helpers
`slurm_iec/submit_iter_excl.sh`	Launch full iterative-exclusion array
`slurm_iec/ffnn_train_task.sh`	Single Slurm array task (one cluster pair)
`slurm_iec/ffnn_postprocess.sh`	Post-processing job (runs after array completes)
`slurm_iec/extract_accuracy.sh`	Extract per-run accuracy files

Dependencies

Core libraries: torch, scikit-learn, imbalanced-learn, sourmash, mmh3, xxhash, biopython, pandas, numpy, matplotlib, seaborn, networkx, tqdm, pyyaml, joblib.

See requirements.txt (auto-generated by make requirements) for the full pinned list.

Notes

KU library preparation uses the Hackflex protocol: https://github.com/GaioTransposon/Hackflex
Key protein classes investigated in gene annotation: depolymerases (ref), anti-defence systems (CRISPR-related), modifying enzymes, integrases.

Name		Name	Last commit message	Last commit date
Latest commit History 305 Commits
dist		dist
empathi_out		empathi_out
fig		fig
ncbi_cli_scripts		ncbi_cli_scripts
notebooks		notebooks
outdated		outdated
pfi_objects_encoded_sketches_n500_k12		pfi_objects_encoded_sketches_n500_k12
scripts		scripts
scripts_ext		scripts_ext
.DS_Store		.DS_Store
.gitignore		.gitignore
HANDOFF_regex_decimal_issue.md		HANDOFF_regex_decimal_issue.md
Makefile		Makefile
Project_graphical.png		Project_graphical.png
README.md		README.md
convert_and_queue.sh		convert_and_queue.sh
prokka_reqs.txt		prokka_reqs.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup_env.sh		setup_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PredictPhagePPI

Project overview

Data

Raw data (`raw_data/`)

Produced data (`data_prod/`)

Pipeline

Step 1 — Genome downsampling

Step 2 — Cluster analysis & data composition

Step 3 — FFNN training (`FFNN_inner.py`)

Step 4 — Iterative cluster-exclusion pipeline (`slurm_iec/`)

Step 5 — Result collection & analysis (`collect_iterres.py`)

Environment setup

Notebooks (`notebooks/`)

Key scripts reference

Dependencies

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PredictPhagePPI

Project overview

Data

Raw data (raw_data/)

Produced data (data_prod/)

Pipeline

Step 1 — Genome downsampling

Step 2 — Cluster analysis & data composition

Step 3 — FFNN training (FFNN_inner.py)

Step 4 — Iterative cluster-exclusion pipeline (slurm_iec/)

Step 5 — Result collection & analysis (collect_iterres.py)

Environment setup

Notebooks (notebooks/)

Key scripts reference

Dependencies

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Raw data (`raw_data/`)

Produced data (`data_prod/`)

Step 3 — FFNN training (`FFNN_inner.py`)

Step 4 — Iterative cluster-exclusion pipeline (`slurm_iec/`)

Step 5 — Result collection & analysis (`collect_iterres.py`)

Notebooks (`notebooks/`)

Packages