Predicting phage–host interactions from downsampled bacterial and bacteriophage genomes using machine learning.
Given a phage genome and a bacterial genome, the model predicts whether the phage can infect that bacterium. Genome representation is achieved by downsampling each genome into a fixed-size MinHash sketch (k-mer set), either via Sourmash or a custom MurmurHash3 / xxHash implementation. The concatenated sketches in a constructed presence matrix format, form the input vector to a Feed-Forward Neural Network (FFNN), which is evaluated by a hostrange/EOP interaction set.
PredictPhagePPI/
├── raw_data/ # Raw FASTA genomes + interaction tables
├── data_prod/ # Produced data: sketches, presence matrices, results
├── nn_runs/ # Per-run output from FFNN training
├── tmp/ # Temporary Slurm task files
├── notebooks/ # Exploratory Jupyter notebooks
├── scripts/ # Core pipeline scripts
│ └── slurm_iec/ # Iterative cluster-exclusion Slurm pipeline
├── Makefile # Environment setup helpers
└── README.md
Two datasets are supported, referred to throughout the code as data1 (default) and data2.
| Path | Description |
|---|---|
raw_data/phagehost_KU/bacteriaKU_cleaned.fasta |
Bacterial genomes (data1) |
raw_data/phagehost_KU/phage_cleaned.fasta |
Phage genomes (data1) |
raw_data/phagehost_KU/Hostrange_data_all_crisp_iso.xlsx |
Lab-tested phage–host interactions (EOP / binary) for data1 |
raw_data/phagehost_KU/data2_bacts.fasta |
Bacterial genomes (data2) |
raw_data/phagehost_KU/data2_phages.fasta |
Phage genomes (data2) |
raw_data/phagehost_KU/data2_EOP.xlsx |
EOP interaction table for data2 |
| Path | Description |
|---|---|
SM_sketches/ |
Sourmash MinHash sketches (BactMinhash_n<N>_k<K>/, PhageMinhash_n<N>_k<K>/) - similar structure for Murmurhash3 / xxHash, but substitutes "SM" prefix with "encoded" |
SM_sketches/sim_matrices/ |
Pairwise Jaccard similarity matrices + cluster assignment CSVs |
PresMat_*/ |
Precomputed binary presence matrices used as NN input |
IterExclClus_*/ |
Collected results from iterative cluster-exclusion runs |
NN_files/ |
Train/test split definitions and saved model files |
Converts raw FASTA files into MinHash sketches. Each genome is decomposed into k-mers of length k, and the n smallest hashes are retained.
# Sourmash-backed sketches (default)
python scripts/downsampling.py --nk 500 12 --method sourmash
# Custom MurmurHash3 sketches
python scripts/downsampling.py --nk 500 12 --method minhash --hash mmh3
# data2 dataset
python scripts/downsampling.py --nk 500 12 --method sourmash --data2
# Batch downsampling across many n/k combinations (via Slurm)
bash scripts/queue_batch_downsampling.shFor bacterial genomes with multi-contig assemblies, use bact_downsampling.py (wraps downsampling.py with per-contig merging).
Key parameters:
| Flag | Description |
|---|---|
--nk N K |
Sketch size N and k-mer length K (applied to both bacteria and phages) |
--split_nk BN BK PN PK |
Independent sketch parameters for bacteria and phages |
--method |
sourmash (default), minhash, or ohe (one-hot encoding) |
--hash |
Hash function: mmh3 (default), xxhash, or ohe_custom |
--data2 |
Use the data2 dataset |
Output lands in data_prod/SM_sketches/ (or SM_sketches_data2/).
Computes pairwise Jaccard similarity between sketches and clusters genomes for use in the iterative-exclusion pipeline.
# Run full Sourmash compare + cluster pipeline
python scripts/compare_sigs.py --config scripts/config_compare_sigs.yaml
# Dry-run to preview commands
python scripts/compare_sigs.py --config scripts/config_compare_sigs.yaml --dry-runOutputs similarity matrices, heatmap PNGs, and cluster assignment CSVs to data_prod/SM_sketches/sim_matrices/.
The core model script. Builds a presence matrix from the MinHash sketches, trains an FFNN, and optionally runs feature importance analysis.
# Basic run (n=500, k=12, Sourmash sketches)
python scripts/FFNN_inner.py --nk 500 12 --logging
# Cross-validation with SMOTE, saving the model
python scripts/FFNN_inner.py --nk 500 12 --cv --kf_n_splits 5 --smote --save_model --logging
# Exclude a bacterial cluster and test on unseen data
python scripts/FFNN_inner.py --nk 500 12 \
--exclude_clusters \
--exclude_bact_clusters StrainA StrainB \
--test_on_excluded --logging
# Pairwise feature importance + gene annotation
python scripts/FFNN_inner.py --nk 500 12 --perform_pfi --perform_ga --logging
# data2 with EOP values
python scripts/FFNN_inner.py --nk 500 12 --data2 --loggingSelected flags:
| Flag | Description |
|---|---|
--nk N K |
Sketch parameters (n hashes, k-mer size) |
--data2 |
Use the data2 EOP dataset |
--use_encoded |
Use custom-encoded sketches instead of Sourmash sketches |
--cv / --kf_n_splits |
K-Fold cross-validation |
--exclude_clusters |
Hold out entire clusters for out-of-distribution testing |
--test_on_excluded |
Evaluate the held-out cluster as the test set |
--test_on_unseen |
Evaluate on completely unseen (zero overlap) data |
--perform_pfi |
Pairwise feature importance analysis |
--save_model |
Persist trained model to disk |
--n_epochs / --learning_rate / --batch_size |
Hyperparameters |
Output goes to nn_runs/<run_dir>/.
Runs the full leave-one-cluster-out experiment (IterExcl): every (bacteria-cluster, phage-cluster) pair is excluded in turn, the model is trained on the remainder, and performance on the held-out pair is recorded. Results are aggregated by collect_iterres.py.
# Submit the full array + postprocess job to Slurm
bash scripts/slurm_iec/submit_iter_excl.sh 500 12 SM_sketches
# Local (non-Slurm) run for testing
bash scripts/slurm_iec/submit_iter_local.sh 500 12 SM_sketchessubmit_iter_excl.sh generates a task map, submits ffnn_train_task.sh as a Slurm array, then chains ffnn_postprocess.sh with --dependency=afterok to run collect_iterres.py once all tasks finish.
Aggregates per-run results from an iterative-exclusion experiment into summary CSVs and figures.
python scripts/collect_iterres.py \
--run_dir nn_runs/IterExcl_SM_sketches_n500_k12 \
--out data_prod/IterExclClus_SM_sketches_n500_k12Produces: all_runs_summary.csv, accuracy/F1/balanced-accuracy plots by n/k, confusion matrices, k-mer distribution plots, top gene annotations, and a full run log.
# Create conda environment and install dependencies
make setup
# Or step by step:
make env # create conda env 'PredPPI' (Python 3.11)
make requirements # scan and freeze requirements.txt
conda activate PredPPI
pip install -e . # install package in editable mode (run from repo root)Exploratory Jupyter notebooks for each stage of the project:
| Notebook | Description |
|---|---|
downsampling.ipynb |
MinHash sketch exploration |
bact_eda.ipynb / phage_eda.ipynb |
Genome-level EDA for bacteria and phages |
phylogeny_and_minhash.ipynb |
Phylogenetic structure vs. sketch similarity |
preprocces_data.ipynb |
Interaction matrix preprocessing |
random_forest.ipynb |
Random Forest baseline |
k_nearest_neighbor.ipynb |
k-NN baseline |
nn_torch.ipynb |
Early FFNN prototyping |
FFNN_outer_dev.ipynb |
Outer-loop development (model selection) |
gene_investigation.ipynb |
Gene-level feature analysis |
cnn.ipynb / rnn.ipynb |
CNN and RNN experiments |
playground.ipynb |
Scratch space |
| Script | Role |
|---|---|
downsampling.py |
Genome → MinHash sketch (Sourmash / mmh3 / xxhash) |
bact_downsampling.py |
Bacterial-specific downsampling (multi-contig support) |
batch_downsampling.py |
Batch over multiple configurations |
compare_sigs.py |
Pairwise Jaccard similarity + (hierarchial) clustering |
FFNN_inner.py |
Core FFNN training & evaluation |
collect_iterres.py |
Aggregate iterative-exclusion results |
decompositions.py |
k-mer decomposition opposing Sourmash, KmerCodec (4-bit encoding), Decompose class |
io_operations.py |
Presence matrix I/O, host-range loading |
manipulations.py |
Feature construction, host-range binarisation |
analysis.py |
Feature importance, gene analysis, plotting |
data_generators.py |
PyTorch dataset utilities |
nn_torch.py / torchmlp.py / simpleffnn.py |
Network architecture definitions |
paths.py |
Centralised path constants |
utils.py |
Taxonomy / strain ID helpers |
slurm_iec/submit_iter_excl.sh |
Launch full iterative-exclusion array |
slurm_iec/ffnn_train_task.sh |
Single Slurm array task (one cluster pair) |
slurm_iec/ffnn_postprocess.sh |
Post-processing job (runs after array completes) |
slurm_iec/extract_accuracy.sh |
Extract per-run accuracy files |
Core libraries: torch, scikit-learn, imbalanced-learn, sourmash, mmh3, xxhash, biopython, pandas, numpy, matplotlib, seaborn, networkx, tqdm, pyyaml, joblib.
See requirements.txt (auto-generated by make requirements) for the full pinned list.
- KU library preparation uses the Hackflex protocol: https://github.com/GaioTransposon/Hackflex
- Key protein classes investigated in gene annotation: depolymerases (ref), anti-defence systems (CRISPR-related), modifying enzymes, integrases.
