Skip to content

iowa69/HostBuster

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

HostBuster

Human DNA Decontamination Pipeline for Illumina Metagenomic Data

Python License: MIT Conda

HostBuster is a fast, efficient Python pipeline for removing human DNA contamination from Illumina paired-end metagenomic sequencing data. It produces three specialized output formats optimized for different downstream analyses.


πŸš€ Features

  • One-Command Installation - Install via conda with all dependencies
  • Automatic Reference Management - Auto-downloads T2T-CHM13v2.0 on first run
  • Custom Index Support - Build and use custom reference genomes
  • Three Specialized Outputs
    • Assembly-ready paired-end reads (conservative filtering)
    • Profiling-optimized single-end reads (aggressive filtering)
    • GDPR-compliant publication-ready reads (maximum decontamination)
  • Dual-Pass Filtering
    • Primary: minimap2 (conservative, maintains pairs)
    • Secondary: Bowtie2 (aggressive, removes borderline sequences)
  • Comprehensive Quality Control
    • fastp for adapter trimming and quality filtering
    • BBDuk for complexity and entropy filtering
    • MultiQC for aggregated reporting
  • Production Ready
    • Pure Python implementation
    • Self-contained conda package
    • Complete logging and statistics
    • Fast processing (~10 minutes for 1.8M read pairs)

πŸ“‹ Requirements

System Requirements

  • OS: Linux (Ubuntu 20.04+) or macOS
  • RAM: Minimum 12GB for index building, 8GB+ for pipeline runs
  • Storage: 50-100GB free space
  • CPU: Minimum 4 cores, Recommended 8+ cores
  • Python: 3.9+

πŸ”§ Installation

Method 1: From Source (Recommended)

# Clone repository
git clone https://github.com/iowa69/HostBuster.git
cd HostBuster

# Create and activate conda environment
conda env create -f environment.yml
conda activate hostbuster

# Install package
pip install -e .

# Verify installation
hostbuster --help

Method 2: Conda (Coming Soon)

# Will be available after bioconda submission
conda create -n hostbuster
conda activate hostbuster
conda install -c bioconda hostbuster

πŸ“– Quick Start

1. First Run (Auto-Downloads Reference)

On first run, HostBuster automatically downloads and builds the T2T-CHM13v2.0 reference indices:

hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -t 8

First run will:

  • Auto-download T2T-CHM13v2.0 reference (~930MB compressed, 3.0GB uncompressed)
  • Build minimap2 index (~2.2 minutes, 8.4GB output, 11GB RAM peak)
  • Build bowtie2 index (~54 minutes, 16GB total indices)
  • Store indices in conda environment
  • Run complete pipeline

Note: Index building requires ~12GB RAM. With 8GB RAM on WSL2, increase memory allocation. See Troubleshooting section.

2. Subsequent Runs

After first run, indices are available and pipeline runs immediately:

hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -t 8

3. Using Custom Reference

Build a custom index:

hostbuster --build --ix my_custom_ref --ref my_reference.fasta -t 8

Use custom index:

hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -i my_custom_ref -t 8

4. List Available Indices

hostbuster --lx

πŸŽ›οΈ Command Line Options

Pipeline Run Mode

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n sample -o output/ [OPTIONS]

Required:

  • -1, --input-r1 - Input R1 FASTQ file
  • -2, --input-r2 - Input R2 FASTQ file
  • -n, --sample-name - Sample name for outputs
  • -o, --output-dir - Output directory

Optional - Index Selection:

  • -i, --index - Index to use (default: standard)

Optional - Quality Control:

  • --tail - fastp tail quality (default: 20)
  • --p - fastp phred quality (default: 20)
  • --l - Minimum read length (default: 50)
  • --c - fastp complexity (default: 30)
  • --bbe - BBDuk entropy for profiling (default: 0.7)
  • --bbeg - BBDuk entropy for GDPR (default: 0.85)

Optional - Performance:

  • -t, --threads - CPU threads (default: all cores)
  • --keep-intermediates - Keep BAM files
  • -v, --verbose - Verbose logging

Index Management Mode

Build custom index:

hostbuster --build --ix index_name --ref reference.fasta -t 8

List available indices:

hostbuster --lx

πŸ“Š Output Files

Main Outputs (in cleaned/)

  1. {sample}_ASSEMBLY_R1.fastq.gz / {sample}_ASSEMBLY_R2.fastq.gz

    • Use case: Meta-assembly, genome binning
    • Filtering: Conservative (preserves read pairs)
    • Method: minimap2 alignment only
  2. {sample}_PROFILING.fastq.gz

    • Use case: Taxonomic profiling (Kraken2, MetaPhlAn)
    • Filtering: Aggressive dual-pass
    • Quality: Configurable entropy (default 0.7)
  3. {sample}_GDPR.fastq.gz

    • Use case: Public data release (SRA, ENA)
    • Filtering: Maximum decontamination
    • Quality: High entropy (default 0.85)

Quality Reports (in qc/ and stats/)

  • {sample}_fastp.html - fastp quality report
  • {sample}_fastp.json - fastp JSON statistics
  • {sample}_multiqc.html - MultiQC aggregated report
  • {sample}_stats.json - Detailed pipeline statistics

πŸ” Pipeline Steps

INPUT: Paired-end FASTQ files (R1 + R2)
   ↓
[Step 0] Input Validation
   ↓
[Step 1] Quality Control & Adapter Trimming (fastp)
   ↓
[Step 2] Primary Host Removal (minimap2)
   ↓
   β”œβ”€β”€β†’ OUTPUT 1: Assembly-ready PE reads
   ↓
[Step 3] Convert to Single-End
   ↓
[Step 4] Complexity Filtering (BBDuk, entropy)
   ↓
[Step 5] Length Filtering (BBDuk)
   ↓
[Step 6] Secondary Host Removal (Bowtie2 aggressive)
   ↓
[Step 7] Post-Alignment Normalization (BBDuk)
   ↓
   β”œβ”€β”€β†’ OUTPUT 2: Profiling-optimized SE reads
   ↓
[Step 8] GDPR Strict Filtering (BBDuk)
   ↓
   └──→ OUTPUT 3: GDPR-compliant reads

πŸ“ˆ Performance

Test Dataset: SRR6062009 (1,863,630 read pairs, ~270MB compressed)

================================================================================
πŸ“Š PIPELINE SUMMARY
================================================================================

Step                           Input           Output         Filtered     Retention
-------------------------------------------------------------------------------------
0. Raw Input (pairs)          1,863,630                                    100.00%
1. Quality Control (pairs)    1,863,630      1,812,172         51,458     97.24%
2. Primary Removal (pairs)    1,812,172      1,811,850            322     99.98%
3. PE to SE Conversion        3,623,700      3,623,700              0    100.00%
4. Complexity Filter          3,623,700      3,623,657             43    100.00%
5. Length Filter              3,623,657      3,623,657              0    100.00%
6. Secondary Removal          3,623,657      3,623,586             71    100.00%
7. Normalization              3,623,586      3,623,586              0    100.00%
8. GDPR Filter                3,623,586      3,623,327            259     99.99%

Contamination Removed:
  Human read pairs:              36
  Human percentage:            0.00%

Final Outputs:
  OUTPUT 1 (Assembly):    1,811,850 pairs
  OUTPUT 2 (Profiling):   3,623,586 reads
  OUTPUT 3 (GDPR):        3,623,327 reads

Runtime: 9.7 minutes
================================================================================

Performance Metrics:

  • Total runtime: 9.7 minutes (582 seconds)
  • Throughput: ~192,000 reads/minute (~3,200 reads/sec)
  • Memory usage: ~11GB peak during minimap2 alignment, ~8GB during BBDuk steps
  • Disk space: ~1.8GB intermediate files (can be removed with default settings)
  • CPU efficiency: ~99% parallelization during alignment steps (with 8 threads)

Step-by-Step Timing (from actual run):

  • Step 0: Input validation - 8 seconds
  • Step 1: fastp QC - 54 seconds
  • Step 2: minimap2 alignment - 189 seconds (~3 minutes)
  • Step 3: PE to SE conversion - 10 seconds
  • Step 4: Complexity filter - 8 seconds
  • Step 5: Length filter - 6 seconds
  • Step 6: Bowtie2 alignment - 44 seconds
  • Step 7: Normalization - 7 seconds
  • Step 8: GDPR filter - 8 seconds
  • Report generation - 7 seconds

🎯 Use Cases & Parameter Tuning

Viral Metagenomics

For viral samples, use lower entropy thresholds:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n viral_sample -o results/ \
    --bbe 0.5 --bbeg 0.6 -t 8

High-Quality Requirements

For very strict quality control:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n strict_sample -o results/ \
    --p 30 --tail 30 --l 100 --bbe 0.8 --bbeg 0.9 -t 8

Low-Quality Samples

For degraded or low-quality samples:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n lowqual_sample -o results/ \
    --p 15 --tail 15 --l 40 --c 20 -t 8

πŸ› οΈ Troubleshooting

Issue: "Standard index not found" and auto-download fails

For systems with 8GB RAM on WSL2:

The full T2T genome requires ~12GB RAM to build indices. Increase WSL memory allocation:

  1. Create .wslconfig file in Windows:

    • Location: C:\Users\YourUsername\.wslconfig
    • Content:
      [wsl2]
      memory=12GB
      processors=8
      swap=4GB
  2. Restart WSL (in Windows PowerShell):

    wsl --shutdown
  3. Verify in Ubuntu:

    free -h
    # Should show ~12GB total memory

Alternative: Build indices manually with the exact commands used in testing:

cd $CONDA_PREFIX/share/hostbuster/databases/standard

# Download reference
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
gunzip GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
mv GCF_009914755.1_T2T-CHM13v2.0_genomic.fna human_T2T.fna

# Build minimap2 index (~2.2 minutes, 11GB peak RAM)
minimap2 -x sr -I 4G -d human.mmi human_T2T.fna

# Build bowtie2 index (~54 minutes total: 27min forward + 27min reverse)
bowtie2-build --threads 4 --offrate 4 human_T2T.fna human_bt2

# Verify indices
ls -lh
# Expected output:
# human.mmi          8.4G
# human_T2T.fna      3.0G
# human_bt2.*.bt2    ~5.5G total (6 files)

Index Build Times (verified on 8-core system with 12GB RAM):

  • minimap2: 2.2 minutes (11.2GB peak RAM)
  • bowtie2 forward: 27 minutes
  • bowtie2 reverse: 27 minutes
  • Total: ~56 minutes

Issue: hostbuster --lx shows "No indices found" even after building

This can happen if indices are in the wrong directory. The indices must be in:

$CONDA_PREFIX/share/hostbuster/databases/standard/

Solution:

# Check current conda environment
echo $CONDA_PREFIX
# Should show: /home/username/miniconda3/envs/hostbuster

# Find where your indices are
find ~ -name "human_bt2.1.bt2" -type f -size +100M 2>/dev/null
find ~ -name "human.mmi" -type f -size +100M 2>/dev/null

# If they're in the wrong location, move them
mkdir -p $CONDA_PREFIX/share/hostbuster/databases/standard
mv /path/to/your/indices/* $CONDA_PREFIX/share/hostbuster/databases/standard/

# Clean up any temporary files
cd $CONDA_PREFIX/share/hostbuster/databases/standard/
rm -f *.tmp wget-log

# Verify
hostbuster --lx
# Should now show: "Available index databases: - standard"

Issue: hostbuster: command not found

Solution:

# Make sure conda environment is activated
conda activate hostbuster

# Reinstall if needed
pip uninstall hostbuster
pip install -e .

Issue: Out of memory during pipeline run

Solutions:

  1. Reduce thread count: -t 2
  2. Close other applications
  3. Increase swap space
  4. For WSL2 users, increase memory allocation in .wslconfig

Issue: Very low retention after fastp

Solution: Adjust quality thresholds for your data:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n sample -o results/ \
    --p 15 --tail 15 --l 40 --c 20

Issue: Pipeline hangs or is very slow

Check these:

  • Ensure sufficient CPU threads are available
  • Verify input files are not corrupted
  • Check disk space (need ~50-100GB free)
  • Monitor memory usage with htop or free -h

πŸ§ͺ Testing

Run the included test with sample data:

# Download test dataset (~270MB)
mkdir -p ~/test_hostbuster
cd ~/test_hostbuster
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/009/SRR6062009/SRR6062009_1.fastq.gz -O test_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/009/SRR6062009/SRR6062009_2.fastq.gz -O test_R2.fastq.gz

# Run pipeline
conda activate hostbuster
hostbuster -1 test_R1.fastq.gz -2 test_R2.fastq.gz -n test_sample -o results/ -t 8 -v

# Check results
ls -lh results/cleaned/
cat results/stats/test_sample_stats.json | python3 -m json.tool

Expected runtime: ~10 minutes (9.7 minutes observed)
Expected outputs: 3 cleaned FASTQ files + QC reports

Detailed Test Results:

Step                           Input           Output         Filtered     Retention
-------------------------------------------------------------------------------------
0. Raw Input (pairs)          1,863,630                                    100.00%
1. Quality Control (pairs)    1,863,630      1,812,172         51,458     97.24%
2. Primary Removal (pairs)    1,812,172      1,811,850            322     99.98%
3. PE to SE Conversion        3,623,700      3,623,700              0    100.00%
4. Complexity Filter          3,623,700      3,623,657             43    100.00%
5. Length Filter              3,623,657      3,623,657              0    100.00%
6. Secondary Removal          3,623,657      3,623,586             71    100.00%
7. Normalization              3,623,586      3,623,586              0    100.00%
8. GDPR Filter                3,623,586      3,623,327            259     99.99%

Human read pairs removed: 36 (0.00%)
Final outputs: 1,811,850 pairs (ASSEMBLY), 3,623,586 reads (PROFILING), 3,623,327 reads (GDPR)

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

πŸ“ Citation

If you use HostBuster in your research, please cite:

[Citation information will be added upon publication]

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • T2T-CHM13 Consortium for the reference genome
  • Developers of minimap2, Bowtie2, fastp, BBTools, and samtools
  • CAMI Consortium for test datasets

πŸ“§ Contact


Made with ❀️ for the metagenomics community

About

Human DNA Decontamination Pipeline for Illumina Metagenomic Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors