Notes on workflow for the metagenomics profiling

Working directory configuration

Let's first configure the working directoy, for example:

├── config
├── DATA
├── SCRIPTS
├── DB
│   ├── db_mOTU
│   │   ├── db_mOTU_test
│   │   └── public_profiles
│   ├── human_CHM13
│   └── human_GRCh38
|   └── db_MetaPhlan4	
├── RESULTS
├── workflow

config to store all configuartion files needed for the pipeline to run
DATA to store the the input raw sequence (in fastq or fastq.gz)
DB to store the pre-build reference database needed by the pipeline
RESULTS placeholder for the output file
SCRIPTS to store scripts needed for analysis
workflow to store the snakefile

Data sources

We will use the paired-end reads generated by Lee et al., 2017 study:https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0270-x#Sec2. The full set of samples available here: https://www.ebi.ac.uk/ena/browser/view/PRJNA353655 (total size 45 gb)

For simplicity we will only use the before and after (4 weeks) FMT data from 2 participants.

Move to the desired directory and to download the data run this command:

sbatch ../SCRIPTS/download.sh

Nextflow

We will be using the taxprofiler: https://nf-co.re/taxprofiler/1.1.2

Set-up and submitting the job:

prepare the databse needed and create the full database sheet of tools you want to use
- for metaphlan4 puhti followed set-up here: https://docs.csc.fi/apps/metaphlan/
- for kraken2 you can download database from here: https://benlangmead.github.io/aws-indexes/k2 for example we want to use the PlusPF database from 2022-09-265, execute wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20221209.tar.gz and unzip the file using tar -xvzf k2_pluspf_20221209.tar.gzin the directory of interest
- cretae the database sheet:
  - exam_config/database.csv
Prepare the samplesheet as requested by the workflow
- for example data: exam_config/samplesheet.csv
Select configurations of the optional params (read more here: https://nf-co.re/taxprofiler/1.1.2/parameters)
- Read processing, activated with --perform_shortread_qc - Let's use fastp and use the tool's default adapter unless the adapter is diferent - no need to merge pairs if using metaphlan4 or motus - save the resulting processed reads with : --save_preprocessed_reads - Let's also save the final reads from all processing steps (that are sent to classification profiling) in results directory --save_analysis_ready_fastqs
- Let's activate the complexity filtering using bbduk step using: --perform_shortread_complexityfilter --shortread_complexityfilter_tool bbduk
- Activate the host-read removal using current DB CHM13 : --perform_shortread_hostremoval
  - specify the DB of FASTA file downloaded from here: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_009914755.4/ Optional to provide the indexed file. --hostremoval_reference /scratch/project_xxx/USER_WORKSPACES/xxx/exercise/nextflow_metagenome/DB/T2T-CHM13v2.0.zip If you downloaded it from local dir copy it to puhti using scp
  - Let use --save_hostremoval_unmapped just to be safe to store the unmapped host reads (typically this is the final processed reads)
- We can skip the Run merging for now
- let's activate the option for table generation: --run_profile_standardisation
- for profile let's use singularity using oprtion -profile SIngularity
- tools that we want to run, let's try with metaphlan4 and motus using this option
Be sure to update the pipeline regularly nextflow pull nf-core/taxprofiler
Let's create the sbatch script to submit the job, for sample less than 100, hyperqueue is not needed --> confirm again

See example: https://yetulaxman.github.io/containers-workflows/hands-on/day4/nextflow-containers.html In case we need hyperqueue: https://yetulaxman.github.io/containers-workflows/hands-on/day4/nf-core-hyperqueue.html

see example: SCRIPTS/tax_profiler.sh

Submit the job using:

sbatch SCRIPTS/taxprofiler.sh

Additional concerns to address:

resources allocation in puhti
customaize the nf-core pipeline if additional step needed --> functional annotation using humann3

Snakemake

Due to difficulties in integrating other workflow in nextflow, the additional customize step will be executed via snakemake: GreenGenes v2 annotation and functional annotation with Humman3

Greengenes2 for shotgun and amplicon data

adapted from @TuomasBorman

Setting and Running GG2 plugin

FOR SHOTGUN

For shotgun reads, we need newest QIIME2 tools with GG2 plugin and Woltka

Go to directory where you want to save the config for installation (e.g in project or home, not scratch)
Download and save the environment config file needed for installation wget https://raw.githubusercontent.com/qiime2/distributions/dev/2023.9/shotgun/released/qiime2-shotgun-ubuntu-latest-conda.yml
Install QIIME2

module load tykky
mkdir qiime2-shotgun-2023.09
conda-containerize new --mamba --prefix qiime2-shotgun-2023.09/ qiime2-shotgun-ubuntu-latest-conda.yml

Create a file for plugin installation

create a text file named post_install_plugins_shotgun.txtcontaining the following info: (you can use nano, vim, or other favorite text editor)

pip install q2-greengenes2 pip install woltka conda install -c bioconda bowtie2 pip install https://github.com/knights-lab/SHOGUN/archive/master.zip pip install https://github.com/qiime2/q2-shogun/archive/master.zip qiime dev refresh-cache

Install plugins

conda-containerize update qiime2-shotgun-2023.09/ --post-install post_install_plugins_shotgun.txt

Add the software path so that the software is executable

export PATH="qiime2-shotgun-2023.09/bin:$PATH"

(preferably using full path, here is the example of current directory)

Download the WoL2 database

http://ftp.microbio.me/pub/wol2/ https://github.com/qiyunzhu/woltka/blob/master/doc/wol.md#the-wol-database

Let's put this in the DB directory, remember to change the project number in the script to match yours

cd ./DB
chmod +x download_wol2.sh
sbatch ../SCRIPTS/download_wol2.sh

Now create the Snakefile under the workflow directory You can see the example Snakefile_gg2_shotgun in this repository, and adjust the path to the qiime config, project number, and files locations according to your need. We will use the pre-processed reads from nextflow that has been stored in RESULTS/analysis_ready_fastqs, if you cannot find this folder, you might need to re-run the taxprofiler with updated additional option.
After modification of the Snakefile, let's prepare for the bash script to execute snakemake Please see SCRIPTS/run_GG2shotgun_workflow.sh for example, and modify the project number
To execute run:

chmod +x ./SCRIPTS/run_GG2shotgun_workflow.sh
./SCRIPTS/run_GG2shotgun_workflow.sh

IMPORTANT NOTES regarding the use of snakemake in csc The qiime2 here was created using tykky installation, thus, it will be compatible using the snakemake version 7.17.1. See more about running snakemake in csc: https://docs.csc.fi/support/tutorials/snakemake-puhti/ Other version seems to fail if you use tykky installation for the package.

Humann for shotgun metagenomic

adapted from @TuomasBorman and @KatariinaParnanen

Create the Snakefile under the workflow directory Example given as Snakefile_humann that you can create with text editor (nano, vim, etc.). Remember to adjust the path, project allocation, and file locations.
Let's prepare fir the bash script to execute snakemake. It is easier to documment the version use etc. and command execution with bash script. Please see SCRIPTS/run_humann_workflow.sh for example, and modify the project number
To execute run:

chmod +x ./SCRIPTS/run_humann_workflow.sh
./SCRIPTS/run_humann_workflow.sh

IMPORTANT NOTES always remember to check the tools version especially if running in batch. For running samples more than 100, it is more convenience to use the group function of snakemake

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
RESULTS		RESULTS
SCRIPTS		SCRIPTS
exam_config		exam_config
workflow		workflow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Notes on workflow for the metagenomics profiling

Working directory configuration

Data sources

Nextflow

Set-up and submitting the job:

Additional concerns to address:

Snakemake

Greengenes2 for shotgun and amplicon data

Setting and Running GG2 plugin

Humann for shotgun metagenomic

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Notes on workflow for the metagenomics profiling

Working directory configuration

Data sources

Nextflow

Set-up and submitting the job:

Additional concerns to address:

Snakemake

Greengenes2 for shotgun and amplicon data

Setting and Running GG2 plugin

Humann for shotgun metagenomic

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages