AmpliPhy is a scalable, fully automated Nextflow pipeline for improving phylogenetic inference of gene families through database-driven homolog enrichment.
Preprint (bioRxiv): doi.org/10.64898/2026.01.26.701724
Run with conda:
git clone https://github.com/DessimozLab/ampliphy.git
cd ampliphy
nextflow run ampliphy.nf -profile conda \
--input_dir sample/input \
--input_taxonomy sample/taxonomy \
--output_dir sample/output- Automated workflow: MMseqs2 search → MAFFT amplification → IQ-TREE 2 inference → MAD rooting and pruning
- Curated or custom MMseqs2 databases
- Tunable homolog selection and sequence-addition limits
- Optional taxonomy and taxonomic congruence score (TCS) reports
- Portable Nextflow + Bioconda implementation for local or HPC execution
For each input gene family, AmpliPhy performs the following steps:
- Aligns the input protein sequences with MAFFT to generate an initial MSA.
- Searches for homologous sequences against an MMseqs2 database.
- Removes exact sequence matches to the original input and selects homologs according to the configured limits.
- Adds selected homologs to the original MSA with MAFFT
--addfragments --keeplength. - Infers a tree from the amplified MSA with IQ-TREE 2.
- Roots the amplified tree with MAD unless rooting is disabled.
- Prunes added homologs from the rooted tree to produce a final tree containing the original input leaves.
- Optionally generates taxonomy-based reports and TCS comparisons when input taxonomy is supplied.
- Nextflow (DSL2)
- Java
- Bash
- Either:
-profile conda(recommended), or- the following tools available in
PATH:mafftmmseqsiqtree2gotree- Python 3 for reporting scripts
--input_dir must contain one or more FASTA files with protein sequences. Each FASTA file represents one gene family.
Recognized extensions, optionally gzipped:
.fa,.fasta,.faa,.fna,.ffn,.frn- optional
.gzsuffix
The family identifier is derived from the file name by stripping the recognized FASTA extension and optional .gz suffix.
Example:
input/
├── fam1.fa
├── fam2.faa
└── fam3.fa.gzThese files are processed as families fam1, fam2, and fam3.
When --input_taxonomy is supplied, the directory may contain one taxonomy file per input family:
taxonomy/
├── fam1.tax
├── fam2.tax
└── fam3.taxEach <family_id>.tax file must contain two tab-separated columns:
sequence_id species_taxidFor example, given fam1.fa:
>S1
AAA
>S2
WWWthe corresponding fam1.tax may be:
S1 9606
S2 10090TaxIds must be valid NCBI Taxonomy identifiers at the species rank.
Behavior:
- Families with a valid corresponding
.taxfile are included in taxonomy- and TCS-based reports. - Families without a corresponding
.taxfile are skipped with a warning. - A provided
.taxfile that is malformed or contains invalid/non-species TaxIds causes the relevant reporting step to fail. - If no family has usable taxonomy information, the requested taxonomy-dependent report cannot be generated.
All published outputs are written under --output_dir:
output_dir/
├── homologs/
│ └── {family_id}.homologs.fa
├── msa/
│ ├── {family_id}.msa.fa
│ └── {family_id}.amp.fa
├── tree/
│ ├── {family_id}.amp.nwk
│ └── {family_id}.amp.unpruned.nwk # only with --keep_unpruned_tree true
└── report/
├── homolog_search_report.tsv
├── homolog_taxonomy_report.tsv # when homolog taxonomy is available
├── amplified_taxonomy_report.tsv # when --input_taxonomy is supplied
└── amplified_tcs_report.tsv # when --input_taxonomy is supplied-
{family_id}.msa.fa
Initial MAFFT alignment of the input protein sequences. -
{family_id}.amp.fa
Amplified MSA after adding selected homologs with MAFFT.
{family_id}.homologs.fa
Selected homolog sequences added during MSA amplification.
Exact sequence matches to any input sequence are removed before selection.
-
{family_id}.amp.nwk
Final amplified tree after optional MAD rooting and pruning of added homolog leaves. Its tips correspond to the original input sequences. -
{family_id}.amp.unpruned.nwk
IQ-TREE 2 tree inferred directly from the amplified MSA before pruning. Published only when--keep_unpruned_tree trueis given.
-
homolog_search_report.tsv
Reports the number of original sequences, retrieved homologs, selected homologs, and amplified sequences per family. -
homolog_taxonomy_report.tsv
Summarizes the taxonomy of added homologs, including lineage diversity and least common ancestor information. Generated only when taxonomy metadata is available for the searched MMseqs2 database and taxonomy reporting is not disabled. -
amplified_taxonomy_report.tsv
Summarizes taxonomy for each original family and its amplified counterpart, using rows such asfam1andfam1.amp. Generated when--input_taxonomyis supplied and the required taxonomy information is available. -
amplified_tcs_report.tsv
Reports taxonomic congruence scores for original and amplified trees. Generated when--input_taxonomyis supplied.
Nextflow intermediate task files remain in the work/ directory.
-
--input_dir
Directory containing input protein FASTA files.
Default:sample_input -
--output_dir
Output directory.
Default:sample_output -
--threads
Number of threads assigned to supported processes.
Default:4 -
--max_memory
Memory allocation for MMseqs2-labelled processes.
Default:16 GB
-
--mafft_preset
MAFFT preset used for the initial alignment and amplified MSA construction.
Default:auto
Available values:auto,fast,linsi,ginsi,einsi -
--mafft_options
Additional MAFFT options appended to the selected preset.
Default: empty
Example:
nextflow run ampliphy.nf -profile conda \
--input_dir sample_input \
--output_dir results \
--mafft_preset linsi-
--database
Curated MMseqs2 database to use for homolog searching. Database names are accepted case-insensitively.
Default:UniRef50Supported database choices include:
UniRef100UniRef90UniRef50UniProtKBUniProtKB/Swiss-ProtNR
-
--custom_database
Path prefix to an existing local MMseqs2 database. If supplied, this overrides--database.
Default: empty -
--database_dir
Directory in which downloaded MMseqs2 databases are cached.
Default:mmseqs_db -
--tmp_dir
Temporary directory used by MMseqs2 and auxiliary resources.
Default:./tmp -
--mmseqs_options
Additional options passed tommseqs easy-search, for example E-value, sequence identity, or alignment coverage thresholds.
Default: empty -
--max_depth
Relative cap on the number of selected homologs. For an input family containingNsequences, at mostmax_depth × Nhomologs are selected.
Default:5
Useinfto disable the depth-based cap. -
--max_seqs
Absolute cap on the number of selected homologs. If both--max_depthand--max_seqsimpose limits, the smaller limit is applied.
Default:0(disabled)
Example using a custom database and selection limits:
nextflow run ampliphy.nf -profile conda \
--input_dir sample_input \
--output_dir results \
--custom_database /path/to/mmseqs_db_prefix \
--max_depth 3 \
--max_seqs 500 \
--mmseqs_options "-e 1e-5 --min-seq-id 0.3"-
--input_taxonomy
Directory containing optional<family_id>.taxfiles for the original input sequences. When supplied, enables amplified-taxonomy and TCS reporting for families with valid taxonomy files.
Default: empty -
--ncbi_dir
Directory in which NCBI Taxonomy dump files are stored or downloaded for taxonomy-based reporting.
Default:./ncbi_taxonomy -
--no_taxonomy
Disable taxonomy reporting based on MMseqs2 database annotations.
Default:false
Notes:
homolog_taxonomy_report.tsvrequires taxonomy metadata associated with the searched MMseqs2 database.amplified_taxonomy_report.tsvrequires valid input taxonomy files and homolog taxonomy information.amplified_tcs_report.tsvuses input taxonomy to generate complete lineage files for the original tree tips.
Example:
nextflow run ampliphy.nf -profile conda \
--input_dir sample_input \
--input_taxonomy input_taxonomy \
--ncbi_dir ncbi_taxonomy \
--output_dir results--iqtree_options
Options passed to IQ-TREE 2.
Default:-m JTT+I+G4 -B 1000
Example:
nextflow run ampliphy.nf -profile conda \
--iqtree_options "-m LG+G4 -B 1000"If an MSA contains fewer than four sequences, bootstrap options are disabled for that family because IQ-TREE cannot perform bootstrap analysis on fewer than four sequences.
-
--no_rooting
Skip MAD rooting and prune directly from the inferred amplified tree.
Default:false -
--mad_script
Path to a custom MAD executable. -
--keep_unpruned_tree
Publish{family_id}.amp.unpruned.nwk, the tree inferred from the amplified MSA before rooting/pruning.
Default:false
-
standard
Uses tools available in the local environment. -
conda
Enables the Bioconda-based environment defined inenvs/ampliphy.yml.
When using AmpliPhy, please cite the AmpliPhy preprint:
- AmpliPhy preprint: doi.org/10.64898/2026.01.26.701724
AmpliPhy also relies on the following tools:
- Nextflow
- MMseqs2
- MAFFT
- IQ-TREE 2
- MAD
- gotree