EGAP (Entheome Genome Assembly Pipeline) v3.4.1 is a versatile bioinformatics pipeline for hybrid genome assembly using Oxford Nanopore (ONT), Illumina, and PacBio data. It evaluates assemblies based on BUSCO Completeness (Single + Duplicated), Assembly Contig Count, and N50, with additional metrics like L50 and GC-content available via QUAST.
-
Preprocess & QC Reads
- Merges multiple FASTQ files (
ont_combine_fastq_gz,illumina_extract_and_check). - Trims and removes adapters (Trimmomatic, BBDuk).
- Deduplicates reads (Clumpify).
- Filters and corrects ONT reads (Filtlong, Ratatosk).
- Generates Read Metrics (FastQC, NanoPlot, BBMap insert-size stats).
- Merges multiple FASTQ files (
-
Read Decontamination (new in v3.4.1)
- Classifies ONT/PacBio long reads with Kraken2 against a user-supplied database.
- Retains reads matching the target organism's domain (eukarya / bacteria / archaea); always keeps unclassified reads.
- Preserves all reads: kept reads continue to the assembler, and removed reads are archived as a compressed
.fastq.gzalongside the original pre-decontamination backup. - Non-fatal: skipped gracefully when no Kraken2 database is configured.
-
Assembly
- MaSuRCA: Illumina-only or hybrid (ONT/PacBio).
- Flye: ONT-only or PacBio-only.
- SPAdes: Illumina-only or hybrid (ONT).
- hifiasm: PacBio-only.
- Best Assembly Selection based on Read Metrics from all available assemblies.
- Runs BUSCO/Compleasm on two lineages for completeness.
- Runs QUAST for contiguity (N50, contig count, etc.).
-
Assembly Polishing
- Polishes with Racon (2x, if ONT/PacBio) and Pilon (if Illumina).
- Removes haplotigs with purge_dups (if long reads).
-
Assembly Curation
- Scaffolds and patches with RagTag (if reference provided).
- Closes gaps with TGS-GapCloser (ONT) or Abyss-Sealer (Illumina-only).
-
Assembly Decontamination (new in v3.4.1)
- Classifies every contig in the final curated assembly with Tiara (deep-learning classifier).
- Removes non-target sequences (e.g. bacterial contamination from fungal assemblies).
- Preserves removed sequences as a compressed
.fasta.gzfor auditability. - Decontaminated assembly is used for all downstream QC and reporting.
-
Quality Assessments & Classification
- Runs BUSCO/Compleasm on two lineages for completeness.
- Runs QUAST for contiguity (N50, contig count, etc.).
- Classifies assemblies as AMAZING, GREAT, OK, or POOR.
Optimized for fungal genomes, EGAP is adaptable to other organisms by adjusting lineages and references.
Supported Input Modes:
- Illumina-only (SRA, DIR, or RAW FASTQ)
- Illumina + Reference (GCA or FASTA)
- Illumina + ONT (SRA, DIR, or RAW FASTQ)
- Illumina + ONT + Reference
- PacBio-only (SRA, DIR, or RAW FASTQ)
- Assembly-only (for QC analysis)
Future developments: Support for ONT-only and ONT + Reference.
- Overview
- Requirements
- Installation
- Pipeline Flow
- Supported Sequencing Strategies
- Command-Line Usage
- TUI Interface
- File Management & Storage Optimization
- Per-Sample Logging
- Decontamination
- CSV Generation
- Quality Control Output Review
- Troubleshooting & FAQ
- Future Improvements
- References
- Changelog
- Contribution
- License
| Resource | Minimum | Recommended |
|---|---|---|
| CPU cores (threads) | 8 | 16+ |
| RAM | 32 GB | 64 GB+ (128 GB for large eukaryotic genomes) |
| Free disk space | ~150 GB per sample | 500 GB+ for multi-sample runs |
| Kraken2 database space | ~75 GB (uncompressed 16 GB Standard index plus hash files) | same |
A single EGAP run can peak at ~300 GB of intermediate files before cleanup; see File Management & Storage Optimization for how v3.4.1 reclaims this space as the pipeline progresses.
- OS: Linux x86_64 (primary target; tested on Debian/Ubuntu). macOS x86_64/arm64 is supported by
EGAP_setup.shbut some assemblers are Linux-only. - Python: 3.8 (EGAP pins to
>=3.8,<3.9because several dependencies, notablytiara=1.0.3andnumpy=1.19.5, are incompatible with newer Python). - Conda: Miniforge3, Miniconda3, or Anaconda (the installer uses Miniforge3).
- git, wget, tar on PATH.
- Optional: Docker ≥ 20.10 for container usage, Apptainer/Singularity ≥ 3.8 for HPC usage.
The following tools are installed:
- Trimmomatic
- BBMap
- FastQC
- NanoPlot
- Filtlong
- Ratatosk
- gfatools
- hifiasm
- MaSuRCA
- Flye
- SPAdes
- Racon
- Burrows-Wheeler Aligner
- SamTools
- BamTools
- Pilon
- purge_dups
- RagTag
- TGS-GapCloser
- ABYSS-Sealer
- QUAST
- BUSCO
- Compleasm
- Kraken2 (new in v3.4.1, read decontamination)
- Tiara (new in v3.4.1, assembly decontamination)
- pigz (new in v3.4.1, parallel FASTA/FASTQ compression)
The shell script EGAP_setup.sh at the repo root installs Miniforge3 (if absent), creates the EGAP_env conda environment, installs auxiliary tools, and optionally provisions the Kraken2 database:
git clone https://github.com/iPsychonaut/EGAP.git
cd EGAP
# Default: builds the Kraken2 standard 16 GB database from source (6-12 hrs).
bash EGAP_setup.sh
# Faster: download the pre-built 16 GB standard index (~1-2 hrs).
bash EGAP_setup.sh --kraken-prebuilt
# Skip the Kraken2 step entirely (decontamination will be disabled until you
# set KRAKEN2_DB manually).
bash EGAP_setup.sh --skip-kraken
# Customise thread count and database destination.
bash EGAP_setup.sh --kraken-prebuilt --threads 16 --kraken-db /data/kraken2_db
conda activate EGAP_envThe script appends export KRAKEN2_DB=<chosen path> to ~/.bashrc and ~/.zshrc so the variable persists across shells.
Build the image (the bundled Dockerfile produces a multi-env image with EGAP, EGEP, and Funannotate):
docker build -t entheome_ecosystem:3.4.1 .The default ENTRYPOINT runs EGAP directly, so you can treat the image like the EGAP CLI. Bind-mount your data and Kraken2 database at runtime:
docker run --rm \
-e KRAKEN2_DB=/kraken2_db \
-v /path/to/kraken2_db:/kraken2_db:ro \
-v /path/to/data:/data \
entheome_ecosystem:3.4.1 \
--input_csv /data/samples.csv \
--output_dir /data/output \
--cpu_threads 16 --ram_gb 64Interactive shell with all three conda envs on PATH (override the entrypoint):
docker run --rm -it --entrypoint bash \
-v /path/to/data:/data \
entheome_ecosystem:3.4.1Open a terminal in the directory where the entheome.sif.def is located and run:
sudo singularity build entheome.sif entheome.sif.defEdit the parameters in the nextflow.config and run (ensure the entheome.sif is in the same directory that draft_assembly.nf and nextflow.config are in):
nextflow draft_assembly.nf -with-singularity entheome.sifOR
Load into the Singularity image, load the pre-generated EGAP environment:
singularity shell entheome.sif -B /path/to/data/mnt:/path/to/data/mnt && \
source /opt/conda/etc/profile.d/conda.sh && \
conda activate EGAP_envIn a dedicated environment through the Bioconda channel with the following command:
conda create -y -n EGAP_env python=3.8 && conda activate EGAP_env && conda install -y -c bioconda egapEGAP uses Kraken2 to screen long reads for contamination. A Kraken2 database must be downloaded separately and the path provided via the KRAKEN2_DB environment variable. The pre-built Standard 16 GB database is recommended:
# Download the pre-built Standard 16 GB database
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_16gb_20240904.tar.gz
mkdir -p ~/kraken2_db
tar -xzf k2_standard_16gb_20240904.tar.gz -C ~/kraken2_db
# Set the environment variable for the current session and make it permanent
export KRAKEN2_DB=~/kraken2_db
echo 'export KRAKEN2_DB=~/kraken2_db' >> ~/.bashrcDocker users: Pass the variable and mount the database directory at runtime:
docker run -it \ -v /path/to/kraken2_db:/kraken2_db \ -e KRAKEN2_DB=/kraken2_db \ -v /path/to/data:/data \ entheome_ecosystem bash
Singularity users: Bind-mount the database directory and export the variable before running:
export KRAKEN2_DB=/path/to/kraken2_db singularity shell entheome.sif \ -B /path/to/kraken2_db:/path/to/kraken2_db \ -B /path/to/data:/data
If KRAKEN2_DB is not set or points to an invalid path, the read-decontamination step is skipped automatically and a warning is printed to the log.
At a glance, EGAP moves each sample through eight stages:
Preprocess → Decontaminate reads → Assemble → Compare → Polish → Curate → Decontaminate assembly → Assess & Report
| Stage | What it does |
|---|---|
| Preprocess | Merges raw FASTQs; trims adapters (Trimmomatic, BBDuk); deduplicates (Clumpify); filters/corrects ONT reads (Filtlong, Ratatosk); runs FastQC/NanoPlot metrics. |
| Decontaminate reads (new in v3.4.1) | Classifies long reads with Kraken2 against a user-supplied database; keeps target-domain + unclassified reads; archives removed reads. Non-fatal; skipped if KRAKEN2_DB is unset. |
| Assemble | Runs the relevant assembler(s) for the sample's read types (see Supported Sequencing Strategies). |
| Compare | Evaluates every candidate assembly with BUSCO/Compleasm + QUAST and picks the best by completeness, N50, and contig count. |
| Polish | Racon (×2 with long reads) and Pilon (with Illumina); removes haplotigs with purge_dups if long reads are present. |
| Curate | Scaffolds with RagTag (if reference provided) and gap-fills with TGS-GapCloser (ONT) or Abyss-Sealer (Illumina). |
| Decontaminate assembly (new in v3.4.1) | Classifies every contig with Tiara (deep-learning) and removes non-target sequence; archives removed contigs. |
| Assess & Report | Final BUSCO/Compleasm + QUAST; classifies assembly as AMAZING / GREAT / OK / POOR; emits HTML report and per-sample log. |
Which assemblers EGAP invokes depends entirely on which read types you supply in the input CSV. After every available assembler has produced a draft, EGAP picks the best by BUSCO + N50 + contig count.
| Input | Assemblers run | Best pick scored on |
|---|---|---|
| Illumina only | MaSuRCA, SPAdes | BUSCO, N50, contig count |
| Illumina + Reference | MaSuRCA, SPAdes | BUSCO, N50, contig count (scaffolded against reference) |
| ONT only | Flye | BUSCO, N50, contig count |
| ONT + Reference | Flye | BUSCO, N50, contig count (scaffolded against reference) |
| PacBio only | Flye, hifiasm | BUSCO, N50, contig count |
| ONT + Illumina (hybrid) | MaSuRCA, SPAdes | BUSCO, N50, contig count |
| ONT + Illumina + Reference | MaSuRCA, SPAdes | BUSCO, N50, contig count (scaffolded against reference) |
| Assembly only (REF_SEQ / REF_SEQ_GCA, no reads) | (QC-only mode; no assembly is run) | BUSCO, QUAST metrics |
Note: ONT-only and ONT + Reference modes currently run Flye only. Additional ONT-focused assemblers (NextDenovo, Canu) are on the roadmap under Future Improvements.
| Flag | Short | Type | Default | Description |
|---|---|---|---|---|
--input_csv |
-csv |
str | (required) | Path to CSV with sample data |
--output_dir |
-o |
str | (required) | Path to the desired output directory |
--cpu_threads |
-t |
int | 1 |
Number of CPU threads to use |
--ram_gb |
-r |
int | 8 |
RAM in GB to allocate |
--dry_run |
flag | False |
Log all file-management actions (removals, compressions) without executing them. Equivalent to setting EGAP_DRY_RUN=1 in the environment. |
|
--tui |
flag | False |
Launch the interactive TUI instead of plain terminal output. All other flags pass through to the TUI. |
Standard run:
EGAP -csv /path/to/input.csv -o /path/to/output_dir -t 16 -r 64Run with the interactive TUI:
EGAP -csv /path/to/input.csv -o /path/to/output_dir -t 16 -r 64 --tuiAudit what would be cleaned up without deleting anything:
EGAP -csv /path/to/input.csv -o /path/to/output_dir -t 16 -r 64 --dry_runAudit in TUI mode:
EGAP -csv /path/to/input.csv -o /path/to/output_dir -t 16 -r 64 --tui --dry_run| Variable | Description |
|---|---|
EGAP_DRY_RUN=1 |
Enable dry-run mode (equivalent to --dry_run). Checked at each file-management call so it can be set after import. |
KRAKEN2_DB=/path/to/db |
Path to a Kraken2 database for read decontamination. Alternatively, supply a KRAKEN2_DB column in the CSV. If neither is set, the decontamination step is skipped with a warning. |
EGAP v3.4.1 includes a full terminal user interface built with Textual. It provides a live view of pipeline progress without leaving the terminal.
The TUI can be started in two ways:
Via the --tui flag on EGAP.py (recommended). All arguments pass through automatically:
EGAP -csv /path/to/input.csv -o /path/to/output_dir -t 16 -r 64 --tuiStandalone (from the bin/ directory):
python EGAP_TUI.py -csv /path/to/input.csv -o /path/to/output_dir -t 16 -r 64Both launch modes support --dry_run.
┌─────────────────────────────────────┬────────────────────────────────────────┐
│ EGAP Banner (ANSI art) │ Pipeline Settings (auto-scrolling) │
├─────────────────┬───────────────────┴─────────────┬──────────────────────────┤
│ Live Log │ Step Progress Table │ CPU / RAM Monitor │
│ (streaming) │ (per-sample, per-step) │ (per-core history) │
└─────────────────┴─────────────────────────────────┴──────────────────────────┘
- Live Log: real-time streaming of all subprocess output, with numpy API noise filtered out.
- Step Progress Table: one row per sample × step; cells update from
PENDING → RUNNING → PASS/FAILas each step completes. - CPU / RAM Monitor: per-core utilization history bars and live RAM/swap usage, refreshed every 0.5 s.
- Settings Panel: all pipeline settings pulled from
EGAP.py's single source of truth, auto-scrolling through the full list.
| Key | Action |
|---|---|
q or Ctrl+Q |
Gracefully shut down the pipeline (SIGTERM → SIGKILL) and quit |
F2 |
Copy the full log buffer to the clipboard |
A single EGAP run can grow from ~60 GB to 300+ GB of intermediate files that are not needed after each step completes. v3.4.1 introduces automatic cleanup via the centralized bin/file_manager.py module.
| Step | Files / Directories Removed After Confirmation |
|---|---|
| Pilon prep | The SAM file (can be 10+ GB) is deleted as soon as the sorted BAM is confirmed. The intermediate sorted BAM is deleted once the final indexed BAM is ready. |
| Polish assembly | Racon PAF alignment files (*.paf), per-round Racon FASTAs (_racon_polish_1.fasta, _racon_polish_2.fasta), and all five BWA-mem2 index sidecars (.0123, .bwt.2bit.64, .pac, .amb, .ann). |
| MaSuRCA assembly | The CA/ CABOG intermediate tree (~11 GB), work1/, and large working files (pe.cor.fa, pe.linking.fa, the Guillaume K-unitig FASTAs, bbmerge_interleaved.fq, bbmap_data.fq). |
| SPAdes assembly | Per-kmer directories (K21/ through K99/), contigs.fasta, before_rr.fasta, misc/, tmp/. |
| Compleasm (QC) | Lineage .tar.gz archives in mb_downloads/ once the extracted directory and .done marker are both confirmed present. |
| Read decontamination | The pre-decontamination backup and the removed-reads file are compressed to .fastq.gz with pigz. The active decontaminated file used by assemblers is left uncompressed. |
| Assembly decontamination | The Tiara working FASTAs (_tiara_kept.fasta, _tiara_removed.fasta) are compressed to .fasta.gz. The final output (_decontaminated.fasta) is left uncompressed for downstream tools. |
To audit exactly what would be deleted/compressed without touching any files:
# Via CLI flag
EGAP -csv input.csv -o output/ -t 16 -r 64 --dry_run
# Via environment variable (also works for subprocesses)
export EGAP_DRY_RUN=1
EGAP -csv input.csv -o output/ -t 16 -r 64Every planned action is logged with the size that would be freed:
DRY_RUN: Would remove file (11.2 GB): /path/to/sample/pilon_polish/racon.sam
DRY_RUN: Would remove directory (10.8 GB): /path/to/sample/masurca_assembly/CA
- Never removes on error: files are only cleaned up after the downstream output is confirmed present and non-empty.
- Preserves step-skip guards: files required by existing
if os.path.exists(...)re-run checks are always kept. - Logs freed space: every real deletion logs the size freed, e.g.
NOTE: Removed intermediate file (11.2 GB freed): ....
Each row in the input CSV gets its own log file written in real time throughout the run:
{output_dir}/{sample_id}_log.txt
- Opened in append mode so re-runs accumulate into the same file.
- Contains all standard output for that sample's steps (ANSI escape codes stripped for clean reading).
- In plain terminal mode, stdout is simultaneously mirrored to the terminal and the log file via an internal
_Teeclass. - In TUI mode, the TUI's
log_line()method writes to both the on-screen log widget and the per-sample file.
Runs after ONT/PacBio preprocessing and before assembly. Classifies every read against a Kraken2 database and partitions them by domain.
Domain keep profiles:
| Organism Kingdom | Domains Kept | Domains Removed |
|---|---|---|
| Bacteria | bacteria, unclassified, other | archaea, eukarya, viruses |
| Archaea | archaea, unclassified, other | bacteria, eukarya, viruses |
| Flora / Funga / Fauna | eukarya, unclassified, other | bacteria, archaea, viruses |
unclassified and other reads are always kept because they may represent genuine target sequence absent from the database or with ambiguous taxonomy.
Configuring the Kraken2 database (in priority order):
KRAKEN2_DBenvironment variableKRAKEN2_DBcolumn in the input CSV
If neither is set the step is skipped with a WARN and the pipeline continues; it is non-fatal.
Output files (all in {species_dir}/kraken2_reads/):
| File | Description |
|---|---|
{label}_kraken2.out |
Raw Kraken2 per-read output |
{label}_kraken2_report.txt |
Kraken2 summary report |
{label}_removed_reads.fastq.gz |
Contaminant reads (compressed, kept for audit) |
decontaminate_reads_done.txt |
Completion marker (prevents re-running on resume) |
The original pre-decontamination reads are renamed to _pre_decontam.fastq.gz (compressed). The decontaminated reads overwrite the _highest_mean_qual_long_reads.fastq path that all assemblers already expect, so no assembler changes are needed.
Runs after curation and before final QC. Classifies every contig using Tiara's deep-learning model.
Class keep profiles:
| Organism Kingdom | Classes Kept | Classes Removed |
|---|---|---|
| Bacteria | bacteria, prokarya, unknown | eukarya, archaea, organelle |
| Archaea | archaea, prokarya, unknown | eukarya, bacteria, organelle |
| Flora / Funga / Fauna | eukarya, organelle, unknown | bacteria, archaea, prokarya |
organelle is kept for eukaryotes (mitochondria, plastids) but removed for prokaryotes (likely host contamination). unknown is always kept to avoid discarding genuine low-complexity sequence.
Output files (all in {sample_dir}/decontamination/):
| File | Description |
|---|---|
tiara_output.txt |
Raw Tiara classification TSV |
{sample_id}_tiara_kept.fasta.gz |
Retained sequences (compressed working copy) |
{sample_id}_tiara_removed.fasta.gz |
Removed sequences (compressed, kept for audit) |
decontamination_done.txt |
Completion marker |
The final decontaminated assembly is written to {sample_dir}/{sample_id}_decontaminated.fasta (uncompressed, for downstream tools).
A warning is issued if more than 50% of sequences are removed, which may indicate an incorrect kingdom assignment or an unexpected Tiara result.
It is necessary to provide a CSV file containing the necessary information for each sample.
The CSV file should have the following header and columns:
| ONT_SRA | ONT_RAW_DIR | ONT_RAW_READS | ILLUMINA_SRA | ILLUMINA_RAW_DIR | ILLUMINA_RAW_F_READS | ILLUMINA_RAW_R_READS | PACBIO_SRA | PACBIO_RAW_DIR | PACBIO_RAW_READS | SPECIES_ID | SAMPLE_ID | ORGANISM_KINGDOM | ORGANISM_KARYOTE | BUSCO_1 | BUSCO_2 | EST_SIZE | REF_SEQ_GCA | REF_SEQ | KRAKEN2_DB |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| None | None | None | SRA00000001 | None | None | None | None | None | None | Ab_sample1 | Ab_sample1 | Funga | Eukaryote | basidiomycota | agaricales | 55m | GCA00000001.1 | None | None |
| None | None | /path/to/ONT/sample.fastq | None | None | /path/to/Illumina/s_1.fastq | /path/to/Illumina/s_2.fastq | None | None | None | Ab_sample2 | Ab_sample2_ONT | Funga | Eukaryote | basidiomycota | agaricales | 60m | None | None | /path/to/kraken2_db |
- ONT_SRA: Oxford Nanopore Sequence Read Archive (SRA) Accession number. Use
Noneif specifying individual files. - ONT_RAW_DIR: Path to the directory containing all Raw ONT Reads. Use
Noneif specifying individual files. - ONT_RAW_READS: Path to the combined Raw ONT FASTQ reads (e.g.,
/path/to/ONT/sample1.fastq). - ILLUMINA_SRA: Illumina Sequence Read Archive (SRA) Accession number. Use
Noneif specifying individual files. - ILLUMINA_RAW_DIR: Path to the directory containing all Raw Illumina Reads. Use
Noneif specifying individual files. - ILLUMINA_RAW_F_READS: Path to the Raw Forward Illumina Reads (e.g.,
/path/to/Illumina/sample1_1.fastq). - ILLUMINA_RAW_R_READS: Path to the Raw Reverse Illumina Reads (e.g.,
/path/to/Illumina/sample1_2.fastq). - PACBIO_SRA: PacBio Sequence Read Archive (SRA) Accession number. Use
Noneif specifying individual files. - PACBIO_RAW_DIR: Path to the directory containing all Raw PacBio Reads. Use
Noneif specifying individual files. - PACBIO_RAW_READS: Path to the combined Raw PacBio FASTQ reads (e.g.,
/path/to/PACBIO/sample1.fastq). - SPECIES_ID: Species ID formatted as
<full species name>(e.g.,Escherichia_coli). - SAMPLE_ID: Sample ID formatted as
<full species name>-<other identifiers>(e.g.,Escherichia_coli-Illu-SRR32496875). - ORGANISM_KINGDOM: Kingdom of the organism (
Bacteria,Archaea,Flora,Funga, orFauna). Used by both Kraken2 and Tiara decontamination. - ORGANISM_KARYOTE: Karyote type of the organism (e.g.,
Eukaryote,Prokaryote). - BUSCO_1: Name of the first Compleasm/BUSCO database (e.g.,
basidiomycota). - BUSCO_2: Name of the second Compleasm/BUSCO database (e.g.,
agaricales). - EST_SIZE: Estimated genome size (e.g.,
55mfor 55 Mbp,5gfor 5 Gbp). - REF_SEQ_GCA: Curated Genome Assembly (GCA) Accession number (or
None). - REF_SEQ: Path to the reference genome for assembly scaffolding (or
None). - KRAKEN2_DB (optional, new in v3.4.1): Path to a Kraken2 database for read decontamination. Overrides the
KRAKEN2_DBenvironment variable. Omit the column entirely or useNoneto rely on the env var or skip decontamination.
- If you are providing ANY raw reads, ensure they exist in their appropriate folder (
/path/to/sample_dir/Illumina,/path/to/sample_dir/ONT, etc.); you may need to generate the sample_dir based on the output_dir, species_id, and then sample_id (/output_dir/species_id/sample_id). - If you are providing
ILLUMINA_RAW_F_READSandILLUMINA_RAW_R_READS, please make sure the directory path the files are in DO NOT CONTAIN_1or_2, but the actual READS FILES DO CONTAIN_1or_2. - If you provide a value for
ILLUMINA_RAW_DIR, setILLUMINA_RAW_F_READSandILLUMINA_RAW_R_READStoNone. EGAP will automatically detect and process all paired-end reads within that directory. The same applies forONT_RAW_DIR. - EGAP automatically renames
.fqand.fq.gzfiles to.fastq/.fastq.gzat startup. - Ensure that all file paths are correct and accessible.
- The CSV file should not contain extra spaces or special characters in the headers.
- If you just want to perform QC analysis for an already built assembly: provide the path for the assembly or GCA Accession number to download, in the
REF_SEQorREF_SEQ_GCAfield respectively, provideORGANISM_KARYOTE, and the two BUSCO databases (BUSCO_1,BUSCO_2) to use; DO NOT PROVIDE ESTIMATED SIZE (EST_SIZE).
EGAP_test.csv is included in this repository to run test examples. Running all four files takes about 24 hours on a 16-thread, 64 GB system.
ONT_SRA,ONT_RAW_DIR,ONT_RAW_READS,ILLUMINA_SRA,ILLUMINA_RAW_DIR,ILLUMINA_RAW_F_READS,ILLUMINA_RAW_R_READS,PACBIO_SRA,PACBIO_RAW_DIR,PACBIO_RAW_READS,SAMPLE_ID,SPECIES_ID,ORGANISM_KINGDOM,ORGANISM_KARYOTE,BUSCO_1,BUSCO_2,EST_SIZE,REF_SEQ_GCA,REF_SEQ,KRAKEN2_DB
None,None,None,None,None,None,None,None,None,None,Escherichia_coli-RefSeq,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,None,GCA_000005845.2,None,None
None,None,None,SRR32496875,None,None,None,None,None,None,Escherichia_coli-Illu-RefSeq,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,5m,GCA_000005845.2,None,None
SRR32405433,None,None,SRR32496875,None,None,None,None,None,None,Escherichia_coli-ONT-Illu,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,5m,None,None,/path/to/kraken2_db
None,None,None,None,None,None,None,SRR31460895,None,None,Escherichia_coli-PacBio,Escherichia_coli,Bacteria,prokaryote,gammaproteobacteria,enterobacterales,5m,None,None,NoneIf you are providing your own data locally, be sure to have a species folder and if needed a sub-folder matching your Species ID:
Example: Illumina + ONT data for Psilocybe cubensis B+ with reference sequence:
/path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/Illumina/f_reads.fastq/path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/Illumina/r_reads.fastq/path/to/EGAP/EGAP_Processing/Ps_cubensis/Ps_cubensis_B+/ONT/reads.fastq/path/to/EGAP/EGAP_Processing/Ps_cubensis/ref_seq.fasta
If no sub-folder for sub-species is needed, place everything in the main species folder:
/path/to/EGAP/EGAP_Processing/Ps_semilanceata/Illumina/f_reads.fastq
EGAP generates final assemblies along with:
- QUAST metrics (contig count, N50, L50, GC%, coverage)
- BUSCO/Compleasm plots showing Single, Duplicated, Fragmented & Missing scores.
- Final assembly classification: AMAZING, GREAT, OK, or POOR
- Per-sample log file (
{sample_id}_log.txt) with the full run record.
The current thresholds for each metric classification (subject to change) are:
- first_busco_c = {"AMAZING": ≥98.5, "GREAT": ≥90.0, "OK": ≥75.0, "POOR": <75.0}
- second_busco_c = {"AMAZING": ≥98.5, "GREAT": ≥90.0, "OK": ≥75.0, "POOR": <75.0}
- contigs_thresholds = {"AMAZING": <100, "GREAT": <1000, "OK": <10000, "POOR": >10000}
- n50_thresholds = {"AMAZING": >100000, "GREAT": >10000, "OK": >1000, "POOR": <1000}
- l50_thresholds = {"AMAZING": #, "GREAT": #, "OK": #, "POOR": #} (still determining best metrics)
BUSCO outputs are evaluated based on:
- Greater than or equal to 98.5% Completion (sum of Single and Duplicated genes) for an AMAZING/Great Assembly
- Greater than 90.0% Completion for a Good Assembly
- Greater than 75% Completion for an OK Assembly
- Less than 75% Completion for a POOR Assembly
Additionally, fewer contigs aligning to BUSCO genes is preferable. Contigs with only duplicated genes are excluded from the plot (noted in the x-axis label).
Ps. cubensis B+ agaricales BUSCO |
Ps. cubensis B+ basidiomycota BUSCO |
Ps. semilanceata agaricales BUSCO |
Ps. semilanceata basidiomycota BUSCO |
Pa. papilionaceus agaricales BUSCO |
Pa. papilionaceus basidiomycota BUSCO |
Q: The pipeline logs WARN: No Kraken2 database found ... Skipping read decontamination. Is that a problem?
No. Read decontamination is non-fatal by design. If KRAKEN2_DB is unset or points at an invalid directory, the step is skipped with a warning and the pipeline continues. To enable it, set KRAKEN2_DB to a valid database path (see Installation, Kraken2) or add a KRAKEN2_DB column to your input CSV.
Q: MaSuRCA fails with a CABOG / unitigger error partway through assembly.
MaSuRCA's CABOG stage is sensitive to thread count and RAM. Try reducing --cpu_threads (for example from 32 to 16) and ensuring the machine has at least 2 × estimated genome size in free RAM. Also confirm that EST_SIZE in the CSV is realistic; a wildly wrong value (e.g. 50m for a 5 Gbp genome) can trigger unitigger failures.
Q: Tiara removes >50% of my contigs and the warning fires.
Tiara's classifier is kingdom-aware. Double-check that ORGANISM_KINGDOM in the CSV matches the sample (for fungi use Funga, not Flora or Fauna). If the kingdom is correct, inspect {sample_dir}/decontamination/tiara_output.txt; a genuinely contaminated assembly can legitimately lose more than half its contigs. The removed sequences are preserved as {sample_id}_tiara_removed.fasta.gz for manual review.
Q: Disk fills up during a run even though v3.4.1 is supposed to auto-clean intermediates.
Cleanup only fires after the downstream output is confirmed present (a safety guard). If a step fails, intermediates are retained so you can resume without re-running expensive upstream work. Use --dry_run to audit what would be removed on a fresh run, and check the per-sample log for Removed intermediate file (X GB freed) entries to confirm cleanup is happening. For a stalled run, inspect {output_dir}/{sample_id}/ for the largest directories; masurca_assembly/CA/ and spades_assembly/K*/ are the usual culprits if a run aborted mid-assembly.
Q: The TUI shows PENDING forever on one sample while others progress.
EGAP processes samples sequentially by default. The TUI renders all samples up front but only the current one actively runs. To parallelise across samples, run multiple EGAP invocations with separate CSVs.
Q: docker build fails pulling packages from bioconda.
Bioconda occasionally throws solver conflicts when transitive dependencies shift. If a build fails mid-env-create, just re-run; conda's solver is non-deterministic and a retry often succeeds. If it persistently fails, check the build log for the conflicting package and file an issue. The Dockerfile's version pins (numpy=1.19.5, tiara=1.0.3, kraken2=2.1.6, flye=2.9.5, etc.) are load-bearing and documented inline.
Q: Singularity build fails on an HPC with "operation not permitted".
Most HPCs disable --fakeroot and require pre-built images. Build the SIF on a machine you control (with sudo) and copy the .sif file to the HPC. Apptainer/Singularity ≥ 3.8 is required.
Q: EGAP hangs at "Downloading Compleasm lineage…" or "Downloading from SRA…".
Network-bound steps have no built-in timeout. Check outbound connectivity to NCBI (nslookup sra-download.ncbi.nlm.nih.gov) and to https://busco-data.ezlab.org. On shared HPCs a proxy may be needed; set HTTPS_PROXY before invoking EGAP.
Q: A hybrid (ONT + Illumina) run fails curation or assembly decontamination with ERROR: FASTA file contains non-nucleotide sequences.
Pilon emits IUPAC ambiguity codes (notably K for G/T and R for A/G) at positions where short- and long-read evidence disagree, so polished hybrid assemblies are not pure ATCGN. As of v3.4.1 the FASTA validator accepts the full IUPAC nucleotide alphabet (ACGTUNRYSWKMBDHV) and prints the offending characters when something genuinely invalid appears. If you still see this error on v3.4.1+, the diagnostic will list the unexpected characters, which points to true corruption rather than expected ambiguity.
- Enhanced Support for Diverse Genomes: Optimize pipeline parameters for non-fungal genomes to improve versatility.
- Improved Error Handling: Develop robust error detection and user-friendly feedback.
- Integration with Additional Sequencing Platforms:
- Support for ONT-only and ONT + Reference input modes.
- FASTA/FASTQ Compression at Handoff Points: Automatically compress intermediate read files with pigz at appropriate handoff points between steps, with transparent decompression for tools that require uncompressed input.
- Automated HTML Report Improvements: Expand the HTML report to include decontamination statistics, file management savings, and TUI-style step timing.
This pipeline was modified from two of the following pipelines:
Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR, Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes of ecologically and geographically diverse Psilocybe species. Microbiol Resour Announc 0:e00250-24; doi: 10.1128/mra.00250-24
Muñoz-Barrera A, Rubio-Rodríguez LA, Jáspez D, Corrales A, Marcelino-Rodriguez I, Lorenzo-Salazar JM, González-Montelongo R, Flores C. Benchmarking of bioinformatics tools for the hybrid de novo assembly of human whole-genome sequencing data. bioRxiv 2024.05.28.595812; doi: 10.1101/2024.05.28.595812
The example data are published in:
Bollinger IM, Singer H, Jacobs J, Tyler M, Scott K, Pauli CS, Miller DR, Barlow C, Rockefeller A, Slot JC, Angel-Mosti V. High-quality draft genomes of ecologically and geographically diverse Psilocybe species. Microbiol Resour Announc 0:e00250-24; doi: 10.1128/mra.00250-24
McKernan K, Kane L, Helbert Y, Zhang L, Houde N, McLaughlin S. A whole genome atlas of 81 Psilocybe genomes as a resource for psilocybin production. F1000Research 2021, 10:961; doi: 10.12688/f1000research.55301.2
Ruiz‐Dueñas FJ, Barrasa JM, Sánchez‐García M, Camarero S, Miyauchi S, Serrano A, Linde D, Babiker R, Drula E, Ayuso‐Fernández I, Pacheco R, Padilla G, Ferreira P, Barriuso J, Kellner H, Castanera R, Alfaro M, Ramírez L, Pisabarro AG, Riley R, Kuo A, Andreopoulos W, LaButti K, Pangilinan J, Tritt A, Lipzen A, He G, Yan M, Ng V, Grigoriev IV, Cullen D, Martin F, Rosso M, Henrissat B, Hibbett D, Martínez AT. Genomic Analysis Enlightens Agaricales Lifestyle Evolution and Increasing Peroxidase Diversity. Molecular Biology and Evolution. 38(4): 1428-1446 (2020). 10.1093/molbev/msaa301.
Floudas D, Bentzer J, Ahrén D, Johansson T, Persson P, Tunlid A. Uncovering the hidden diversity of litter-decomposition mechanisms in mushroom-forming fungi. ISME J 14, 2046–2059 (2020). 10.1038/s41396-020-0667-6.
Maintenance release on top of the v3.4.0 connectivity refactor. Focused on unblocking real end-to-end runs.
- Fixes
- End-to-end pipeline run unblocked:
zipfilehandling andlog_printare now safe across working-directory changes. ORGANISM_KARYOTE/ORGANISM_KINGDOMcomparisons no longer crash when the value ispd.NA.preprocess_ontandpreprocess_pacbiopullsample_stats_dictout ofSampleContextso per-sample stats survive the refactor.- Stale
bin/TruSeq3-PE.faremoved; the adapter is now copied fromresources/as a fallback.
- End-to-end pipeline run unblocked:
- Changes
validate_fastaaccepts the full IUPAC nucleotide alphabet (see FAQ).fs.pyandsh.pyrenamed to descriptive two-word names to match the rest ofbin/.
- Housekeeping
- Added
.gitignore; dropped a stale stats CSV that test runs were committing. - Untracked the committed
bin/__pycache__/*.pycfiles. - README voice pass: dropped em-dashes and tightened phrasing.
- Added
Connectivity refactor plus a new decontamination stack.
- Major additions
- New Kraken2 read decontamination stage (
decontaminate_reads.py). - Kingdom-aware Tiara assembly decontamination (
decontaminate_assembly.py). - Per-sample logging and a new TUI flag.
- Pipeline-flow SVG diagram added to the docs.
- New Kraken2 read decontamination stage (
- Refactors
- Connectivity refactor: utilities split,
SampleContextadopted across stages, duplicated logic removed. - Logging lifted into its own
bin/log.pymodule. EGAP_TUI.pyruns frombin/withoutPYTHONPATHtweaks and stays in sync withEGAP.py's process list.
- Connectivity refactor: utilities split,
- Fixes / polish
- Trimmomatic adapter path lookup, null-string CSV bug, literal
'None'inREF_SEQ/GCA, and hard-FAIL on missing tools. run_subprocess_cmdcatchesFileNotFoundError/PermissionError.- Kraken2 and Tiara settings shown in the pipeline settings display.
- Suppressed the noisy non-fatal numpy API mismatch in the TUI log.
- Trimmomatic adapter path lookup, null-string CSV bug, literal
- Infra
- Container builds, README, and bioconda recipe updated for the new tooling.
- Pipeline-wide logging system: Every run now writes a timestamped log file to
<output_dir>/<output_dir_name>_log.txt. All status, command, warning, pass, skip, and error messages from every pipeline stage are captured there in addition to being printed to the terminal with color coding.utilities.py: Added module-levelDEFAULT_LOG_FILEandENVIRONMENT_TYPEvariables;run_subprocess_cmd()now routes all output throughlog_print().EGAP.py: Callsinitialize_logging_environment(output_dir)on startup; all operational progress messages routed throughlog_print().- All sub-pipeline scripts (
preprocess_*,assemble_*,compare_assemblies,polish_assembly,curate_assembly,qc_assessment,html_reporter,process_metadata,final_compress): each initializes the logging environment independently on startup (required since they run as separate subprocesses) and routes all operational messages throughlog_print().
- Added
process_metadata.pyfor SRA and assembly metadata TSV generation. html_reporter.py: Improved template handling and robustness to missing QC artifacts.compare_assemblies.pyandqc_assessment.py: Stability and path-handling fixes.
If you would like to contribute to the EGAP Pipeline, please submit a pull request or open an issue on GitHub. For major changes, please discuss via an issue first.
This project is licensed under the BSD 3-Clause License.







