PhyloHPC — Gene Family Phylogenomics Pipeline

A Nextflow pipeline for large-scale gene family phylogenomics: from raw proteomes to ortholog annotations, gene tree–species tree reconciliation, and clade-level ancestral gene content reconstruction.

Overview

Multi-species proteomes
        │
        ▼
  step1.nf ── PFAM domain search (HMMER) - **genefam.csv**
              MCL clustering
              → Homology Groups (HGs)
        │
        ▼
  step2.nf ── Multiple sequence alignment (MAFFT)
              Gene tree inference (IQ-TREE2 / FastTree)
              Ortholog prediction (POSSVM)
              [optional] Gene tree–species tree reconciliation (GeneRax)
        │
        ▼
  workflow/gather_annotations.py ── Per-species ortholog annotation tables
        │
        ▼
  workflow/step4.ancestry.nf ── Clade-scoped ortholog calling (POSSVM + --ignoretips)
                        Presence/absence matrix
                        Ancestral state reconstruction (PastML, MPPA + F81)
                        Interactive visualisation (D3.js HTML)

Documentation: full user manual → docs/manual.md · Step 4 ancestral reconstruction → docs/ancestry.md

Clone

git clone --recurse-submodules https://github.com/zolotarovgl/phylohpc.git

Requirements

Environment: workflow/environment.yaml is a minimal, portable spec — only the direct tools, loosely pinned, so it solves across machines. For local macOS work on Apple Silicon, use the Rosetta/x86_64 environment in workflow/environment.macos-x86_64.yaml.

mamba env create -f workflow/environment.yaml

mamba activate phylo

On Apple Silicon Macs:

conda create -n phylo-macos
conda activate phylo-macos
conda config --env --set subdir osx-64
mamba env update -n phylo-macos -f workflow/environment.macos-x86_64.yaml

If mamba is not installed, use:

conda env update -n phylo-macos -f workflow/environment.macos-x86_64.yaml

The Nextflow config sets PYTHONNOUSERSITE=1 for local and slurm runs so user-level ~/.local Python packages do not override the environment. The macOS environment uses Python 3.10 because the current clipkit build requires it for the alignment trimming step in step2.nf.

Key tools: HMMER, MAFFT, IQ-TREE2, FastTree, MCL, GeneRax, POSSVM, PastML, ete3, Python 3.10.

On the CRG HPC, load the required modules before running any pipeline:

mamba activate phylo
module load Java 
module load OpenMPI

GeneRax reconciliation (--run_generax) is optional and not in the conda env. It needs an MPI-compiled GeneRax + mpirun (the bioconda build is non-MPI and crashes). Use it on the CRG HPC via module load OpenMPI with an MPI GeneRax, or run locally without --run_generax (POSSVM runs on the raw trees). POSSVM ships with the phylogeny/ submodule cloned above.

Input data

File	Description
`data/input.fasta`	Concatenated proteomes for all species. Sequence IDs must start with a 3–6 character species prefix followed by `_` (e.g. `Mmus_ENSMUSP00000001`).
`data/species_tree.full.newick`	Full species tree with named internal nodes (required for GeneRax and ancestral reconstruction).
`data/species_tree.newick`	Strictly binary species tree (no polytomies) used by GeneRax. Generated from the full tree — see below.
`data/Mmus_gene_names.csv`	Reference species gene name mapping (`protein_id<TAB>gene_name`).
`genefam.csv`	Gene family definitions (HMM profiles, clustering parameters — see `docs/Gene_families.md`).
`config/species_list`	One species prefix per line; defines the analysis set.

Test data

Fetch proteomes from a sequence database

DB_PATH=~/ant/xgraubove/genomes/data/ # path to Xavi's database 
bash workflow/prepare_fasta.sh config/species_list data/input.fasta $DB_PATH

Prepare the binary species tree

Note: --random-resolve is for the cases where the tree contains polytomies

python workflow/check_tree.py --random-resolve data/species_tree.full.newick config/species_list data/species_tree.newick

This validates the full tree, prunes it to the species in config/species_list, and resolves any polytomies, writing a GeneRax-compatible binary tree to data/species_tree.newick.

Test genefam.csv

cat data/gene_families_searchinfo.csv | grep -E 'Forkhead|T-box' > genefam.csv

Execution profiles

Profiles are combined on the command line with -profile <executor>,<flavour>.

Executors

Profile	Description
`local`	Run interactively on the current machine / HPC interactive session.
`slurm`	Submit each process as a SLURM job on the CRG HPC (`genoa64` queue).

Flavours

Profile	Alignment	Tree method	GeneRax SPR
`fast`	MAFFT default	FastTree	2
`precise`	MAFFT L-INS-i	IQ-TREE2 (model test)	3

Use fast for testing and small families; precise for publication-quality trees.

Step 1 — PFAM search and homology group clustering

Searches the input proteome with HMMER against the defined PFAM profiles, then clusters domain-containing sequences into Homology Groups (HGs) using MCL.

WORKDIR=/no_backup/asebe/gzolotarov/nextflow/phylohpc/work_step1/
WORKDIR=work
PROFILE=local,fast

# NB: do not forget to set the path to pfam_db Pfam-A.hmm file!
# Interactive
nextflow run step1.nf \
    -profile $PROFILE \
    -resume \
    -w $WORKDIR \
	--pfam_db /home/grygoriyzolotarov/ant/xgraubove/data/pfam/Pfam-A.hmm \
    --genefam_info genefam.csv \
    --infasta data/input.fasta \
    -with-report reports/report.step1.html \
    -with-trace  reports/trace.step1.txt

# SLURM
PROFILE=slurm,fast
sbatch --time=01:00:00 -J step1 submit_nf.sh step1.nf \
    -resume -profile $PROFILE \
    -w $WORKDIR \
    --report  reports/report.step1.html \
    --trace   reports/trace.step1.txt \
    --timeline reports/timeline.step1.html

Outputs: results/search/ (domain sequences, gene lists) and results/clusters/ (per-HG FASTA files).

Select HGs for downstream analysis

# Inspect cluster sizes
python workflow/get_seqstat.py results/clusters/*.fasta | sort -k2 -n

# Select HGs that contain the reference species, have ≥20 sequences
# across ≥5 species, and write their IDs to ids.txt
python workflow/select_hgs.py \
    --out ids.txt \
    --soi Mmus \
    --min_seqs 20 \
    --min_sps 5

ids.txt controls which HGs are processed in all subsequent steps.

Step 2 — Gene tree inference and ortholog prediction

For each HG in ids.txt: align sequences (MAFFT), infer a gene tree (FastTree or IQ-TREE2), and predict ortholog groups with POSSVM.

Optionally, reconcile gene trees with the species tree using GeneRax before POSSVM.

Optional: resource prediction

Before submitting large runs, predict memory and runtime requirements from a previous trace file:

python workflow/get_seqstat.py results/clusters/*.fasta > seq_stat.tab

Rscript workflow/train.R \
    --trace     reports/trace.step2.txt \
    --seq_stats seq_stat.tab \
    --outfile   workflow/models/models.json \
    --plotfile  workflow/models/models.pdf \
    --tau 0.95

python workflow/predict_resources.py \
    --ids_fn      ids.txt \
    --cluster_dir results/clusters \
    --models_json workflow/models/models.json \
    --defaults_json workflow/models/defaults.json \
    --outfile     resources.tsv \
    --max_mem 100000 --max_time 2880 --increase 0.5

Run

# Interactive
WORKDIR=/no_backup/asebe/gzolotarov/nextflow/phylohpc/work_step2/
WORKDIR=work/
PROFILE=local,fast
nextflow run step2.nf \
    -profile $PROFILE \
    -resume -w $WORKDIR \
    --run_generax \
    --family_info genefam.csv \
    --species_tree data/species_tree.full.newick \
    -with-report reports/report.step2.html \
    -with-trace  reports/trace.step2.txt

# SLURM
PROFILE=slurm,precise
sbatch -J step2 -o reports/slurm.step2.out submit_nf.sh step2.nf \
    -profile $PROFILE \
    -resume -w $WORKDIR \
    --run_generax \
    --family_info genefam.csv \
    --species_tree data/species_tree.full.newick \
    --report   reports/report.step2.html \
    --trace    reports/trace.step2.txt \
    --timeline reports/timeline.step2.html

--run_generax enables gene tree–species tree reconciliation with GeneRax before POSSVM runs. Without this flag, POSSVM runs directly on the IQ-TREE2/FastTree trees.

Outputs:

results/align/ — trimmed alignments
results/gene_trees/ — raw gene trees (.treefile)
results/possvm/ — POSSVM ortholog groups (when --run_generax)
results/possvm_prev/ — POSSVM on raw trees (when --run_generax)
results/generax/ — reconciled gene trees (when --run_generax)
results/report_step2.html - an interactive html report

Create an interactive report

The script now takes as input results dir and figures out much of the rest

python workflow/report_step2.py --results_dir results --species_tree data/species_tree.full.newick --family_info genefam.csv --output report2.html

Monitoring resource usage

# Get SLURM stats for completed jobs from a trace file
cat reports/trace.step2.txt | grep -E "COMPLETED|CACHED" | cut -f3 | grep -v native > job_ids
python workflow/check_job.v2.py -f job_ids > job_stats.tab

R/downstream_stats.R and R/resources.R plot CPU/memory efficiency across jobs. R/generax_stats.R analyses GeneRax runtime scaling with SPR radius.

Misc

Download phylopics

comm <(cut -f 1 data/species_info.tsv | sort | uniq ) <(basename -a img/phylo/*.png | sed 's/.png//g' | sort | uniq) -32 | awk -F '\t' 'FNR==NR{d[$1]=$2;next}{print $1"\t"d[$1]}' data/species_info.tsv  - > tmp/phylopic_missing
wc -l tmp/phylopic_missing


for PREF in $(cat tmp/phylopic_missing | cut -f 1 ); do
SPECIES_NAME=$(awk -F '\t' -v ID=$PREF '$1==ID{print $2}' data/species_info.tsv )
echo $PREF
echo $SPECIES_NAME
python workflow/download_phylopic.py "${SPECIES_NAME}" -o img/phylo/${PREF}.png
done

Name		Name	Last commit message	Last commit date
Latest commit History 298 Commits
.claude/skills/regen-report		.claude/skills/regen-report
.github/workflows		.github/workflows
R		R
config		config
data		data
docs		docs
img		img
phylogeny @ 05d88cd		phylogeny @ 05d88cd
workflow		workflow
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
make_report.sh		make_report.sh
nextflow.config		nextflow.config
step1.nf		step1.nf
step1.smk		step1.smk
step2.nf		step2.nf
step2.smk		step2.smk
submit_nf.sh		submit_nf.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhyloHPC — Gene Family Phylogenomics Pipeline

Overview

Clone

Requirements

Input data

Test data

Fetch proteomes from a sequence database

Prepare the binary species tree

Test genefam.csv

Execution profiles

Step 1 — PFAM search and homology group clustering

Select HGs for downstream analysis

Step 2 — Gene tree inference and ortholog prediction

Optional: resource prediction

Run

Create an interactive report

Monitoring resource usage

Misc

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhyloHPC — Gene Family Phylogenomics Pipeline

Overview

Clone

Requirements

Input data

Test data

Fetch proteomes from a sequence database

Prepare the binary species tree

Test genefam.csv

Execution profiles

Step 1 — PFAM search and homology group clustering

Select HGs for downstream analysis

Step 2 — Gene tree inference and ortholog prediction

Optional: resource prediction

Run

Create an interactive report

Monitoring resource usage

Misc

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages