16SfastLAB is a reproducible and scalable pipeline for processing 16S rRNA gene amplicon sequencing data. It assigns taxonomic information at the genus level for the Lactobacillaceae family using Snakemake and Conda.
The 16SfastLAB Pipeline is designed for bioinformatics users who need to quickly understand the presence of lactic acid bacteria in paired-end 16S rRNA sequencing data. This pipeline:
This pipeline:
- Merges paired-end FASTQ files using VSEARCH.
- Converts merged FASTQ files into FASTA format.
- Runs BLASTn against a custom 16S database.
- Extracts genus-level information from BLAST output.
- Filters and summarizes BLAST results to generate relative abundance frequency tables.
- Combines frequency data across samples into a single CSV, organized by the latest taxonomic order of Lactobacillaceae.
- Conda/Mamba: Ensure you have Miniconda installed. (Using Mamba for faster environment resolution is recommended.)
- Git: To clone the repository.
-
Clone the repository:
git clone https://github.com/nanzhen102/16SfastLAB.git cd 16SfastLAB -
Create Conda Environments:
Please install the following manually before running the pipeline:
- vsearch (≥2.15)
- blastn
- Python 3 (≥3.8) with packages:
- biopython
- pandas
conda install bioconda::vsearch # install vsearch (then blastn will be installed at the same time) mamba install -c conda-forge -c bioconda biopython # install biopython
From the main project directory, execute the pipeline with:
snakemake --cores 8 This command will:
- Build and activate the required Conda environments.
- Process each sample from the data/ directory.
- Generate intermediate and final outputs in the results/ directory.
- Log the execution of each rule in the logs/ directory.
Data Files
- The pipeline expects both paired-end and single-end FASTQ files to be stored in the
data/directory. - File naming convention:
ERRxxxxxx_1.fastq.gzandERRxxxxxx_2.fastq.gz, orSRRxxxxxx_1.fastq.gzandSRRxxxxxx_2.fastq.gz, orERRxxxxxx_1.fastqandERRxxxxxx_2.fastq, orSRRxxxxxx_1.fastqandSRRxxxxxx_2.fastq, orERRxxxxxx_trimmed.fastq. - A configuration file
config.yamlis used to specify directory paths, database locations, and tool parameters.
After running the pipeline, the results/ directory will contain:
- Merged FASTQ Files: e.g.,
xxxxxx_merged.fastq - FASTA Files: e.g.,
xxxxxx_merged.fasta - BLASTn Output Files: e.g.,
xxxxxx_blastn_ssu_r220_LAB.out - Genus Match Files: e.g.,
xxxxxx_genus_match.csv - Filtered Results: e.g.,
xxxxxx_filtered.csv - Frequency Tables: e.g.,
xxxxxx_frequency.csv - Combined Frequency Table:
combined_genera_frequency.csv
Logs for each rule are stored in the logs/ directory.
- Conda Flag:
--use-condamust be specified when running Snakemake to use the Conda environment files. - Cores:
--cores <N>specifies the number of cores to utilize. - Config File:
The
config.yamlfile controls input/output paths and parameters. Modify this file to adjust database paths, tool parameters, or directory settings.
- Snakemake
- Conda/Mamba
- BLAST+
- VSEARCH
In the data/ directory.
To download SRR files from NCBI SRA database:
fasterq-dump SRR24916211 --split-files --gzip -O data/Or
tail -n +2 SraAccList.csv | while read srr; do
fasterq-dump "$srr" --split-files -O data/
done
