LABhabitPred is a reproducible and scalable Snakemake pipeline for predicting environmental habitat preferences of 16S rRNA sequences, specifically targeting the Lactobacillaceae family.
The LABhabitPred Pipeline is designed for researchers who need to infer the environmental habitat preferences of Lactobacillaceae sequences using 16S rRNA gene data. The pipeline:
- Runs BLASTn against a curated LAB-specific 16S reference database.
- Filters BLAST results by percent identity (≥97%) and alignment length (≥150 bp).
- Maps sequences to environmental metadata (isolation sources, taxonomy).
- Scores habitat preferences using category-specific weighted scoring adapted from the ProkAtlas method.
- Conda/Mamba: Miniconda is required. Mamba is recommended for faster dependency resolution.
- Git: To clone the repository.
- Clone the repository:
git clone https://github.com/nanzhen102/LABhabitatPre.git
cd LABhabitatPre- Create Conda Environment:
Install dependencies manually:
conda create -n lab_habitat -c bioconda snakemake pandas numpy blast biopython
conda activate lab_habitatFrom the main project directory, execute the pipeline with:
snakemake --cores 8This command will:
- Process each sample from the
data/directory. - Run BLAST and filtering steps.
- Map metadata and generate habitat profiles.
- Log execution in the
logs/directory.
Data Files:
- Place your query 16S rRNA FASTA sequences in the
data/directory. - Filename format:
sample1.fasta,sample2.fasta, etc.
Database Files:
- LAB-specific BLAST database files should be placed in
16S_database/. - Ensure
ssu_all_r220_Lactobacillaceae_deduplicated_1200bp_noN_matched_metadata.csvandsource_class_weight.csvare available in this directory.
After pipeline completion, the results/ directory will contain:
- Raw BLAST results:
results/blast_raw/sample_blast_results.tsv - Filtered BLAST results:
results/blast_filtered/sample_blast_results_filtered.tsv - Metadata-mapped results:
results/mapped_metadata/sample_blast_results_filtered_metadata.csv - Habitat profiles:
results/habitat_profiles/sample_habitat_profile.csv
Logs for each rule are stored in the logs/ directory.
- Conda environment:
Use--use-condaflag to activate Conda environments. - Cores:
--cores <N>specifies the number of cores.
- Snakemake
- Conda/Mamba
- BLAST+
- Python ≥3.8
- Pandas
- NumPy
- Biopython
Available in the data/ directory.