HostBuster

Human DNA Decontamination Pipeline for Illumina Metagenomic Data

HostBuster is a fast, efficient Python pipeline for removing human DNA contamination from Illumina paired-end metagenomic sequencing data. It produces three specialized output formats optimized for different downstream analyses.

🚀 Features

One-Command Installation - Install via conda with all dependencies
Automatic Reference Management - Auto-downloads T2T-CHM13v2.0 on first run
Custom Index Support - Build and use custom reference genomes
Three Specialized Outputs
- Assembly-ready paired-end reads (conservative filtering)
- Profiling-optimized single-end reads (aggressive filtering)
- GDPR-compliant publication-ready reads (maximum decontamination)
Dual-Pass Filtering
- Primary: minimap2 (conservative, maintains pairs)
- Secondary: Bowtie2 (aggressive, removes borderline sequences)
Comprehensive Quality Control
- fastp for adapter trimming and quality filtering
- BBDuk for complexity and entropy filtering
- MultiQC for aggregated reporting
Production Ready
- Pure Python implementation
- Self-contained conda package
- Complete logging and statistics
- Fast processing (~10 minutes for 1.8M read pairs)

📋 Requirements

System Requirements

OS: Linux (Ubuntu 20.04+) or macOS
RAM: Minimum 12GB for index building, 8GB+ for pipeline runs
Storage: 50-100GB free space
CPU: Minimum 4 cores, Recommended 8+ cores
Python: 3.9+

🔧 Installation

Method 1: From Source (Recommended)

# Clone repository
git clone https://github.com/iowa69/HostBuster.git
cd HostBuster

# Create and activate conda environment
conda env create -f environment.yml
conda activate hostbuster

# Install package
pip install -e .

# Verify installation
hostbuster --help

Method 2: Conda (Coming Soon)

# Will be available after bioconda submission
conda create -n hostbuster
conda activate hostbuster
conda install -c bioconda hostbuster

📖 Quick Start

1. First Run (Auto-Downloads Reference)

On first run, HostBuster automatically downloads and builds the T2T-CHM13v2.0 reference indices:

hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -t 8

First run will:

Auto-download T2T-CHM13v2.0 reference (~930MB compressed, 3.0GB uncompressed)
Build minimap2 index (~2.2 minutes, 8.4GB output, 11GB RAM peak)
Build bowtie2 index (~54 minutes, 16GB total indices)
Store indices in conda environment
Run complete pipeline

Note: Index building requires ~12GB RAM. With 8GB RAM on WSL2, increase memory allocation. See Troubleshooting section.

2. Subsequent Runs

After first run, indices are available and pipeline runs immediately:

hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -t 8

3. Using Custom Reference

Build a custom index:

hostbuster --build --ix my_custom_ref --ref my_reference.fasta -t 8

Use custom index:

hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -i my_custom_ref -t 8

4. List Available Indices

hostbuster --lx

🎛️ Command Line Options

Pipeline Run Mode

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n sample -o output/ [OPTIONS]

Required:

-1, --input-r1 - Input R1 FASTQ file
-2, --input-r2 - Input R2 FASTQ file
-n, --sample-name - Sample name for outputs
-o, --output-dir - Output directory

Optional - Index Selection:

-i, --index - Index to use (default: standard)

Optional - Quality Control:

--tail - fastp tail quality (default: 20)
--p - fastp phred quality (default: 20)
--l - Minimum read length (default: 50)
--c - fastp complexity (default: 30)
--bbe - BBDuk entropy for profiling (default: 0.7)
--bbeg - BBDuk entropy for GDPR (default: 0.85)

Optional - Performance:

-t, --threads - CPU threads (default: all cores)
--keep-intermediates - Keep BAM files
-v, --verbose - Verbose logging

Index Management Mode

Build custom index:

hostbuster --build --ix index_name --ref reference.fasta -t 8

List available indices:

hostbuster --lx

📊 Output Files

Main Outputs (in `cleaned/`)

{sample}_ASSEMBLY_R1.fastq.gz / {sample}_ASSEMBLY_R2.fastq.gz
- Use case: Meta-assembly, genome binning
- Filtering: Conservative (preserves read pairs)
- Method: minimap2 alignment only
{sample}_PROFILING.fastq.gz
- Use case: Taxonomic profiling (Kraken2, MetaPhlAn)
- Filtering: Aggressive dual-pass
- Quality: Configurable entropy (default 0.7)
{sample}_GDPR.fastq.gz
- Use case: Public data release (SRA, ENA)
- Filtering: Maximum decontamination
- Quality: High entropy (default 0.85)

Quality Reports (in `qc/` and `stats/`)

{sample}_fastp.html - fastp quality report
{sample}_fastp.json - fastp JSON statistics
{sample}_multiqc.html - MultiQC aggregated report
{sample}_stats.json - Detailed pipeline statistics

🔍 Pipeline Steps

INPUT: Paired-end FASTQ files (R1 + R2)
   ↓
[Step 0] Input Validation
   ↓
[Step 1] Quality Control & Adapter Trimming (fastp)
   ↓
[Step 2] Primary Host Removal (minimap2)
   ↓
   ├──→ OUTPUT 1: Assembly-ready PE reads
   ↓
[Step 3] Convert to Single-End
   ↓
[Step 4] Complexity Filtering (BBDuk, entropy)
   ↓
[Step 5] Length Filtering (BBDuk)
   ↓
[Step 6] Secondary Host Removal (Bowtie2 aggressive)
   ↓
[Step 7] Post-Alignment Normalization (BBDuk)
   ↓
   ├──→ OUTPUT 2: Profiling-optimized SE reads
   ↓
[Step 8] GDPR Strict Filtering (BBDuk)
   ↓
   └──→ OUTPUT 3: GDPR-compliant reads

📈 Performance

Test Dataset: SRR6062009 (1,863,630 read pairs, ~270MB compressed)

================================================================================
📊 PIPELINE SUMMARY
================================================================================

Step                           Input           Output         Filtered     Retention
-------------------------------------------------------------------------------------
0. Raw Input (pairs)          1,863,630                                    100.00%
1. Quality Control (pairs)    1,863,630      1,812,172         51,458     97.24%
2. Primary Removal (pairs)    1,812,172      1,811,850            322     99.98%
3. PE to SE Conversion        3,623,700      3,623,700              0    100.00%
4. Complexity Filter          3,623,700      3,623,657             43    100.00%
5. Length Filter              3,623,657      3,623,657              0    100.00%
6. Secondary Removal          3,623,657      3,623,586             71    100.00%
7. Normalization              3,623,586      3,623,586              0    100.00%
8. GDPR Filter                3,623,586      3,623,327            259     99.99%

Contamination Removed:
  Human read pairs:              36
  Human percentage:            0.00%

Final Outputs:
  OUTPUT 1 (Assembly):    1,811,850 pairs
  OUTPUT 2 (Profiling):   3,623,586 reads
  OUTPUT 3 (GDPR):        3,623,327 reads

Runtime: 9.7 minutes
================================================================================

Performance Metrics:

Total runtime: 9.7 minutes (582 seconds)
Throughput: ~192,000 reads/minute (~3,200 reads/sec)
Memory usage: ~11GB peak during minimap2 alignment, ~8GB during BBDuk steps
Disk space: ~1.8GB intermediate files (can be removed with default settings)
CPU efficiency: ~99% parallelization during alignment steps (with 8 threads)

Step-by-Step Timing (from actual run):

Step 0: Input validation - 8 seconds
Step 1: fastp QC - 54 seconds
Step 2: minimap2 alignment - 189 seconds (~3 minutes)
Step 3: PE to SE conversion - 10 seconds
Step 4: Complexity filter - 8 seconds
Step 5: Length filter - 6 seconds
Step 6: Bowtie2 alignment - 44 seconds
Step 7: Normalization - 7 seconds
Step 8: GDPR filter - 8 seconds
Report generation - 7 seconds

🎯 Use Cases & Parameter Tuning

Viral Metagenomics

For viral samples, use lower entropy thresholds:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n viral_sample -o results/ \
    --bbe 0.5 --bbeg 0.6 -t 8

High-Quality Requirements

For very strict quality control:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n strict_sample -o results/ \
    --p 30 --tail 30 --l 100 --bbe 0.8 --bbeg 0.9 -t 8

Low-Quality Samples

For degraded or low-quality samples:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n lowqual_sample -o results/ \
    --p 15 --tail 15 --l 40 --c 20 -t 8

🛠️ Troubleshooting

Issue: "Standard index not found" and auto-download fails

For systems with 8GB RAM on WSL2:

The full T2T genome requires ~12GB RAM to build indices. Increase WSL memory allocation:

Create .wslconfig file in Windows:
- Location: C:\Users\YourUsername\.wslconfig
- Content:
```
[wsl2]
memory=12GB
processors=8
swap=4GB
```
Restart WSL (in Windows PowerShell):
```
wsl --shutdown
```

Verify in Ubuntu:

free -h
# Should show ~12GB total memory

Alternative: Build indices manually with the exact commands used in testing:

cd $CONDA_PREFIX/share/hostbuster/databases/standard

# Download reference
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
gunzip GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
mv GCF_009914755.1_T2T-CHM13v2.0_genomic.fna human_T2T.fna

# Build minimap2 index (~2.2 minutes, 11GB peak RAM)
minimap2 -x sr -I 4G -d human.mmi human_T2T.fna

# Build bowtie2 index (~54 minutes total: 27min forward + 27min reverse)
bowtie2-build --threads 4 --offrate 4 human_T2T.fna human_bt2

# Verify indices
ls -lh
# Expected output:
# human.mmi          8.4G
# human_T2T.fna      3.0G
# human_bt2.*.bt2    ~5.5G total (6 files)

Index Build Times (verified on 8-core system with 12GB RAM):

minimap2: 2.2 minutes (11.2GB peak RAM)
bowtie2 forward: 27 minutes
bowtie2 reverse: 27 minutes
Total: ~56 minutes

Issue: `hostbuster --lx` shows "No indices found" even after building

This can happen if indices are in the wrong directory. The indices must be in:

$CONDA_PREFIX/share/hostbuster/databases/standard/

Solution:

# Check current conda environment
echo $CONDA_PREFIX
# Should show: /home/username/miniconda3/envs/hostbuster

# Find where your indices are
find ~ -name "human_bt2.1.bt2" -type f -size +100M 2>/dev/null
find ~ -name "human.mmi" -type f -size +100M 2>/dev/null

# If they're in the wrong location, move them
mkdir -p $CONDA_PREFIX/share/hostbuster/databases/standard
mv /path/to/your/indices/* $CONDA_PREFIX/share/hostbuster/databases/standard/

# Clean up any temporary files
cd $CONDA_PREFIX/share/hostbuster/databases/standard/
rm -f *.tmp wget-log

# Verify
hostbuster --lx
# Should now show: "Available index databases: - standard"

Issue: `hostbuster: command not found`

Solution:

# Make sure conda environment is activated
conda activate hostbuster

# Reinstall if needed
pip uninstall hostbuster
pip install -e .

Issue: Out of memory during pipeline run

Solutions:

Reduce thread count: -t 2
Close other applications
Increase swap space
For WSL2 users, increase memory allocation in .wslconfig

Issue: Very low retention after fastp

Solution: Adjust quality thresholds for your data:

hostbuster -1 R1.fq.gz -2 R2.fq.gz -n sample -o results/ \
    --p 15 --tail 15 --l 40 --c 20

Issue: Pipeline hangs or is very slow

Check these:

Ensure sufficient CPU threads are available
Verify input files are not corrupted
Check disk space (need ~50-100GB free)
Monitor memory usage with htop or free -h

🧪 Testing

Run the included test with sample data:

# Download test dataset (~270MB)
mkdir -p ~/test_hostbuster
cd ~/test_hostbuster
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/009/SRR6062009/SRR6062009_1.fastq.gz -O test_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/009/SRR6062009/SRR6062009_2.fastq.gz -O test_R2.fastq.gz

# Run pipeline
conda activate hostbuster
hostbuster -1 test_R1.fastq.gz -2 test_R2.fastq.gz -n test_sample -o results/ -t 8 -v

# Check results
ls -lh results/cleaned/
cat results/stats/test_sample_stats.json | python3 -m json.tool

Expected runtime: ~10 minutes (9.7 minutes observed)
Expected outputs: 3 cleaned FASTQ files + QC reports

Detailed Test Results:

Step                           Input           Output         Filtered     Retention
-------------------------------------------------------------------------------------
0. Raw Input (pairs)          1,863,630                                    100.00%
1. Quality Control (pairs)    1,863,630      1,812,172         51,458     97.24%
2. Primary Removal (pairs)    1,812,172      1,811,850            322     99.98%
3. PE to SE Conversion        3,623,700      3,623,700              0    100.00%
4. Complexity Filter          3,623,700      3,623,657             43    100.00%
5. Length Filter              3,623,657      3,623,657              0    100.00%
6. Secondary Removal          3,623,657      3,623,586             71    100.00%
7. Normalization              3,623,586      3,623,586              0    100.00%
8. GDPR Filter                3,623,586      3,623,327            259     99.99%

Human read pairs removed: 36 (0.00%)
Final outputs: 1,811,850 pairs (ASSEMBLY), 3,623,586 reads (PROFILING), 3,623,327 reads (GDPR)

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

📝 Citation

If you use HostBuster in your research, please cite:

[Citation information will be added upon publication]

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

T2T-CHM13 Consortium for the reference genome
Developers of minimap2, Bowtie2, fastp, BBTools, and samtools
CAMI Consortium for test datasets

📧 Contact

Issues: GitHub Issues

Made with ❤️ for the metagenomics community

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
docs		docs
reference		reference
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
environment.yml		environment.yml
hostbuster.py		hostbuster.py
meta.yaml		meta.yaml
setup.py		setup.py
test_pipeline.sh		test_pipeline.sh

Folders and files

Latest commit

History

Repository files navigation

HostBuster

🚀 Features

📋 Requirements

System Requirements

🔧 Installation

Method 1: From Source (Recommended)

Method 2: Conda (Coming Soon)

📖 Quick Start

1. First Run (Auto-Downloads Reference)

2. Subsequent Runs

3. Using Custom Reference

4. List Available Indices

🎛️ Command Line Options

Pipeline Run Mode

Index Management Mode

📊 Output Files

Main Outputs (in cleaned/)

Quality Reports (in qc/ and stats/)

🔍 Pipeline Steps

📈 Performance

🎯 Use Cases & Parameter Tuning

Viral Metagenomics

High-Quality Requirements

Low-Quality Samples

🛠️ Troubleshooting

Issue: "Standard index not found" and auto-download fails

Issue: hostbuster --lx shows "No indices found" even after building

Issue: hostbuster: command not found

Issue: Out of memory during pipeline run

Issue: Very low retention after fastp

Issue: Pipeline hangs or is very slow

🧪 Testing

🤝 Contributing

📝 Citation

📄 License

🙏 Acknowledgments

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Main Outputs (in `cleaned/`)

Quality Reports (in `qc/` and `stats/`)

Issue: `hostbuster --lx` shows "No indices found" even after building

Issue: `hostbuster: command not found`

Packages