Human DNA Decontamination Pipeline for Illumina Metagenomic Data
HostBuster is a fast, efficient Python pipeline for removing human DNA contamination from Illumina paired-end metagenomic sequencing data. It produces three specialized output formats optimized for different downstream analyses.
- One-Command Installation - Install via conda with all dependencies
- Automatic Reference Management - Auto-downloads T2T-CHM13v2.0 on first run
- Custom Index Support - Build and use custom reference genomes
- Three Specialized Outputs
- Assembly-ready paired-end reads (conservative filtering)
- Profiling-optimized single-end reads (aggressive filtering)
- GDPR-compliant publication-ready reads (maximum decontamination)
- Dual-Pass Filtering
- Primary: minimap2 (conservative, maintains pairs)
- Secondary: Bowtie2 (aggressive, removes borderline sequences)
- Comprehensive Quality Control
- fastp for adapter trimming and quality filtering
- BBDuk for complexity and entropy filtering
- MultiQC for aggregated reporting
- Production Ready
- Pure Python implementation
- Self-contained conda package
- Complete logging and statistics
- Fast processing (~10 minutes for 1.8M read pairs)
- OS: Linux (Ubuntu 20.04+) or macOS
- RAM: Minimum 12GB for index building, 8GB+ for pipeline runs
- Storage: 50-100GB free space
- CPU: Minimum 4 cores, Recommended 8+ cores
- Python: 3.9+
# Clone repository
git clone https://github.com/iowa69/HostBuster.git
cd HostBuster
# Create and activate conda environment
conda env create -f environment.yml
conda activate hostbuster
# Install package
pip install -e .
# Verify installation
hostbuster --help# Will be available after bioconda submission
conda create -n hostbuster
conda activate hostbuster
conda install -c bioconda hostbusterOn first run, HostBuster automatically downloads and builds the T2T-CHM13v2.0 reference indices:
hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -t 8First run will:
- Auto-download T2T-CHM13v2.0 reference (~930MB compressed, 3.0GB uncompressed)
- Build minimap2 index (~2.2 minutes, 8.4GB output, 11GB RAM peak)
- Build bowtie2 index (~54 minutes, 16GB total indices)
- Store indices in conda environment
- Run complete pipeline
Note: Index building requires ~12GB RAM. With 8GB RAM on WSL2, increase memory allocation. See Troubleshooting section.
After first run, indices are available and pipeline runs immediately:
hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -t 8Build a custom index:
hostbuster --build --ix my_custom_ref --ref my_reference.fasta -t 8Use custom index:
hostbuster -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -n my_sample -o results/ -i my_custom_ref -t 8hostbuster --lxhostbuster -1 R1.fq.gz -2 R2.fq.gz -n sample -o output/ [OPTIONS]Required:
-1, --input-r1- Input R1 FASTQ file-2, --input-r2- Input R2 FASTQ file-n, --sample-name- Sample name for outputs-o, --output-dir- Output directory
Optional - Index Selection:
-i, --index- Index to use (default: standard)
Optional - Quality Control:
--tail- fastp tail quality (default: 20)--p- fastp phred quality (default: 20)--l- Minimum read length (default: 50)--c- fastp complexity (default: 30)--bbe- BBDuk entropy for profiling (default: 0.7)--bbeg- BBDuk entropy for GDPR (default: 0.85)
Optional - Performance:
-t, --threads- CPU threads (default: all cores)--keep-intermediates- Keep BAM files-v, --verbose- Verbose logging
Build custom index:
hostbuster --build --ix index_name --ref reference.fasta -t 8List available indices:
hostbuster --lx-
{sample}_ASSEMBLY_R1.fastq.gz/{sample}_ASSEMBLY_R2.fastq.gz- Use case: Meta-assembly, genome binning
- Filtering: Conservative (preserves read pairs)
- Method: minimap2 alignment only
-
{sample}_PROFILING.fastq.gz- Use case: Taxonomic profiling (Kraken2, MetaPhlAn)
- Filtering: Aggressive dual-pass
- Quality: Configurable entropy (default 0.7)
-
{sample}_GDPR.fastq.gz- Use case: Public data release (SRA, ENA)
- Filtering: Maximum decontamination
- Quality: High entropy (default 0.85)
{sample}_fastp.html- fastp quality report{sample}_fastp.json- fastp JSON statistics{sample}_multiqc.html- MultiQC aggregated report{sample}_stats.json- Detailed pipeline statistics
INPUT: Paired-end FASTQ files (R1 + R2)
β
[Step 0] Input Validation
β
[Step 1] Quality Control & Adapter Trimming (fastp)
β
[Step 2] Primary Host Removal (minimap2)
β
ββββ OUTPUT 1: Assembly-ready PE reads
β
[Step 3] Convert to Single-End
β
[Step 4] Complexity Filtering (BBDuk, entropy)
β
[Step 5] Length Filtering (BBDuk)
β
[Step 6] Secondary Host Removal (Bowtie2 aggressive)
β
[Step 7] Post-Alignment Normalization (BBDuk)
β
ββββ OUTPUT 2: Profiling-optimized SE reads
β
[Step 8] GDPR Strict Filtering (BBDuk)
β
ββββ OUTPUT 3: GDPR-compliant reads
Test Dataset: SRR6062009 (1,863,630 read pairs, ~270MB compressed)
================================================================================
π PIPELINE SUMMARY
================================================================================
Step Input Output Filtered Retention
-------------------------------------------------------------------------------------
0. Raw Input (pairs) 1,863,630 100.00%
1. Quality Control (pairs) 1,863,630 1,812,172 51,458 97.24%
2. Primary Removal (pairs) 1,812,172 1,811,850 322 99.98%
3. PE to SE Conversion 3,623,700 3,623,700 0 100.00%
4. Complexity Filter 3,623,700 3,623,657 43 100.00%
5. Length Filter 3,623,657 3,623,657 0 100.00%
6. Secondary Removal 3,623,657 3,623,586 71 100.00%
7. Normalization 3,623,586 3,623,586 0 100.00%
8. GDPR Filter 3,623,586 3,623,327 259 99.99%
Contamination Removed:
Human read pairs: 36
Human percentage: 0.00%
Final Outputs:
OUTPUT 1 (Assembly): 1,811,850 pairs
OUTPUT 2 (Profiling): 3,623,586 reads
OUTPUT 3 (GDPR): 3,623,327 reads
Runtime: 9.7 minutes
================================================================================
Performance Metrics:
- Total runtime: 9.7 minutes (582 seconds)
- Throughput: ~192,000 reads/minute (~3,200 reads/sec)
- Memory usage: ~11GB peak during minimap2 alignment, ~8GB during BBDuk steps
- Disk space: ~1.8GB intermediate files (can be removed with default settings)
- CPU efficiency: ~99% parallelization during alignment steps (with 8 threads)
Step-by-Step Timing (from actual run):
- Step 0: Input validation - 8 seconds
- Step 1: fastp QC - 54 seconds
- Step 2: minimap2 alignment - 189 seconds (~3 minutes)
- Step 3: PE to SE conversion - 10 seconds
- Step 4: Complexity filter - 8 seconds
- Step 5: Length filter - 6 seconds
- Step 6: Bowtie2 alignment - 44 seconds
- Step 7: Normalization - 7 seconds
- Step 8: GDPR filter - 8 seconds
- Report generation - 7 seconds
For viral samples, use lower entropy thresholds:
hostbuster -1 R1.fq.gz -2 R2.fq.gz -n viral_sample -o results/ \
--bbe 0.5 --bbeg 0.6 -t 8For very strict quality control:
hostbuster -1 R1.fq.gz -2 R2.fq.gz -n strict_sample -o results/ \
--p 30 --tail 30 --l 100 --bbe 0.8 --bbeg 0.9 -t 8For degraded or low-quality samples:
hostbuster -1 R1.fq.gz -2 R2.fq.gz -n lowqual_sample -o results/ \
--p 15 --tail 15 --l 40 --c 20 -t 8For systems with 8GB RAM on WSL2:
The full T2T genome requires ~12GB RAM to build indices. Increase WSL memory allocation:
-
Create
.wslconfigfile in Windows:- Location:
C:\Users\YourUsername\.wslconfig - Content:
[wsl2] memory=12GB processors=8 swap=4GB
- Location:
-
Restart WSL (in Windows PowerShell):
wsl --shutdown -
Verify in Ubuntu:
free -h # Should show ~12GB total memory
Alternative: Build indices manually with the exact commands used in testing:
cd $CONDA_PREFIX/share/hostbuster/databases/standard
# Download reference
wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/914/755/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
gunzip GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
mv GCF_009914755.1_T2T-CHM13v2.0_genomic.fna human_T2T.fna
# Build minimap2 index (~2.2 minutes, 11GB peak RAM)
minimap2 -x sr -I 4G -d human.mmi human_T2T.fna
# Build bowtie2 index (~54 minutes total: 27min forward + 27min reverse)
bowtie2-build --threads 4 --offrate 4 human_T2T.fna human_bt2
# Verify indices
ls -lh
# Expected output:
# human.mmi 8.4G
# human_T2T.fna 3.0G
# human_bt2.*.bt2 ~5.5G total (6 files)Index Build Times (verified on 8-core system with 12GB RAM):
- minimap2: 2.2 minutes (11.2GB peak RAM)
- bowtie2 forward: 27 minutes
- bowtie2 reverse: 27 minutes
- Total: ~56 minutes
This can happen if indices are in the wrong directory. The indices must be in:
$CONDA_PREFIX/share/hostbuster/databases/standard/
Solution:
# Check current conda environment
echo $CONDA_PREFIX
# Should show: /home/username/miniconda3/envs/hostbuster
# Find where your indices are
find ~ -name "human_bt2.1.bt2" -type f -size +100M 2>/dev/null
find ~ -name "human.mmi" -type f -size +100M 2>/dev/null
# If they're in the wrong location, move them
mkdir -p $CONDA_PREFIX/share/hostbuster/databases/standard
mv /path/to/your/indices/* $CONDA_PREFIX/share/hostbuster/databases/standard/
# Clean up any temporary files
cd $CONDA_PREFIX/share/hostbuster/databases/standard/
rm -f *.tmp wget-log
# Verify
hostbuster --lx
# Should now show: "Available index databases: - standard"Solution:
# Make sure conda environment is activated
conda activate hostbuster
# Reinstall if needed
pip uninstall hostbuster
pip install -e .Solutions:
- Reduce thread count:
-t 2 - Close other applications
- Increase swap space
- For WSL2 users, increase memory allocation in
.wslconfig
Solution: Adjust quality thresholds for your data:
hostbuster -1 R1.fq.gz -2 R2.fq.gz -n sample -o results/ \
--p 15 --tail 15 --l 40 --c 20Check these:
- Ensure sufficient CPU threads are available
- Verify input files are not corrupted
- Check disk space (need ~50-100GB free)
- Monitor memory usage with
htoporfree -h
Run the included test with sample data:
# Download test dataset (~270MB)
mkdir -p ~/test_hostbuster
cd ~/test_hostbuster
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/009/SRR6062009/SRR6062009_1.fastq.gz -O test_R1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/009/SRR6062009/SRR6062009_2.fastq.gz -O test_R2.fastq.gz
# Run pipeline
conda activate hostbuster
hostbuster -1 test_R1.fastq.gz -2 test_R2.fastq.gz -n test_sample -o results/ -t 8 -v
# Check results
ls -lh results/cleaned/
cat results/stats/test_sample_stats.json | python3 -m json.toolExpected runtime: ~10 minutes (9.7 minutes observed)
Expected outputs: 3 cleaned FASTQ files + QC reports
Detailed Test Results:
Step Input Output Filtered Retention
-------------------------------------------------------------------------------------
0. Raw Input (pairs) 1,863,630 100.00%
1. Quality Control (pairs) 1,863,630 1,812,172 51,458 97.24%
2. Primary Removal (pairs) 1,812,172 1,811,850 322 99.98%
3. PE to SE Conversion 3,623,700 3,623,700 0 100.00%
4. Complexity Filter 3,623,700 3,623,657 43 100.00%
5. Length Filter 3,623,657 3,623,657 0 100.00%
6. Secondary Removal 3,623,657 3,623,586 71 100.00%
7. Normalization 3,623,586 3,623,586 0 100.00%
8. GDPR Filter 3,623,586 3,623,327 259 99.99%
Human read pairs removed: 36 (0.00%)
Final outputs: 1,811,850 pairs (ASSEMBLY), 3,623,586 reads (PROFILING), 3,623,327 reads (GDPR)
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
If you use HostBuster in your research, please cite:
[Citation information will be added upon publication]
This project is licensed under the MIT License - see the LICENSE file for details.
- T2T-CHM13 Consortium for the reference genome
- Developers of minimap2, Bowtie2, fastp, BBTools, and samtools
- CAMI Consortium for test datasets
- Issues: GitHub Issues
Made with β€οΈ for the metagenomics community