Let's first configure the working directoy, for example:
├── config
├── DATA
├── SCRIPTS
├── DB
│ ├── db_mOTU
│ │ ├── db_mOTU_test
│ │ └── public_profiles
│ ├── human_CHM13
│ └── human_GRCh38
| └── db_MetaPhlan4
├── RESULTS
├── workflow
configto store all configuartion files needed for the pipeline to runDATAto store the the input raw sequence (in fastq or fastq.gz)DBto store the pre-build reference database needed by the pipelineRESULTSplaceholder for the output fileSCRIPTSto store scripts needed for analysisworkflowto store the snakefile
We will use the paired-end reads generated by Lee et al., 2017 study:https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0270-x#Sec2. The full set of samples available here: https://www.ebi.ac.uk/ena/browser/view/PRJNA353655 (total size 45 gb)
For simplicity we will only use the before and after (4 weeks) FMT data from 2 participants.
Move to the desired directory and to download the data run this command:
sbatch ../SCRIPTS/download.sh
We will be using the taxprofiler: https://nf-co.re/taxprofiler/1.1.2
-
prepare the databse needed and create the full database sheet of tools you want to use
- for metaphlan4 puhti followed set-up here: https://docs.csc.fi/apps/metaphlan/
- for kraken2 you can download database from here: https://benlangmead.github.io/aws-indexes/k2
for example we want to use the PlusPF database from 2022-09-265, execute
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20221209.tar.gzand unzip the file usingtar -xvzf k2_pluspf_20221209.tar.gzin the directory of interest - cretae the database sheet:
- exam_config/database.csv
-
Prepare the samplesheet as requested by the workflow
- for example data: exam_config/samplesheet.csv
-
Select configurations of the optional params (read more here: https://nf-co.re/taxprofiler/1.1.2/parameters)
-
Read processing, activated with
--perform_shortread_qc- Let's use fastp and use the tool's default adapter unless the adapter is diferent - no need to merge pairs if using metaphlan4 or motus - save the resulting processed reads with :--save_preprocessed_reads- Let's also save the final reads from all processing steps (that are sent to classification profiling) in results directory--save_analysis_ready_fastqs -
Let's activate the complexity filtering using bbduk step using:
--perform_shortread_complexityfilter --shortread_complexityfilter_tool bbduk -
Activate the host-read removal using current DB CHM13 :
--perform_shortread_hostremoval- specify the DB of FASTA file downloaded from here: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_009914755.4/
Optional to provide the indexed file.
--hostremoval_reference /scratch/project_xxx/USER_WORKSPACES/xxx/exercise/nextflow_metagenome/DB/T2T-CHM13v2.0.zipIf you downloaded it from local dir copy it to puhti usingscp - Let use
--save_hostremoval_unmappedjust to be safe to store the unmapped host reads (typically this is the final processed reads)
- specify the DB of FASTA file downloaded from here: https://www.ncbi.nlm.nih.gov/data-hub/genome/GCA_009914755.4/
Optional to provide the indexed file.
-
We can skip the Run merging for now
-
let's activate the option for table generation:
--run_profile_standardisation -
for profile let's use singularity using oprtion
-profile SIngularity -
tools that we want to run, let's try with metaphlan4 and motus using this option
-
-
Be sure to update the pipeline regularly
nextflow pull nf-core/taxprofiler -
Let's create the sbatch script to submit the job, for sample less than 100, hyperqueue is not needed --> confirm again
See example: https://yetulaxman.github.io/containers-workflows/hands-on/day4/nextflow-containers.html In case we need hyperqueue: https://yetulaxman.github.io/containers-workflows/hands-on/day4/nf-core-hyperqueue.html
see example: SCRIPTS/tax_profiler.sh
Submit the job using:
sbatch SCRIPTS/taxprofiler.sh
- resources allocation in puhti
- customaize the nf-core pipeline if additional step needed --> functional annotation using humann3
Due to difficulties in integrating other workflow in nextflow, the additional customize step will be executed via snakemake: GreenGenes v2 annotation and functional annotation with Humman3
adapted from @TuomasBorman
FOR SHOTGUN
For shotgun reads, we need newest QIIME2 tools with GG2 plugin and Woltka
-
Go to directory where you want to save the config for installation (e.g in project or home, not scratch)
-
Download and save the environment config file needed for installation wget https://raw.githubusercontent.com/qiime2/distributions/dev/2023.9/shotgun/released/qiime2-shotgun-ubuntu-latest-conda.yml
-
Install QIIME2
module load tykky
mkdir qiime2-shotgun-2023.09
conda-containerize new --mamba --prefix qiime2-shotgun-2023.09/ qiime2-shotgun-ubuntu-latest-conda.yml
- Create a file for plugin installation
create a text file named post_install_plugins_shotgun.txtcontaining the following info:
(you can use nano, vim, or other favorite text editor)
pip install q2-greengenes2 pip install woltka conda install -c bioconda bowtie2 pip install https://github.com/knights-lab/SHOGUN/archive/master.zip pip install https://github.com/qiime2/q2-shogun/archive/master.zip qiime dev refresh-cache
- Install plugins
conda-containerize update qiime2-shotgun-2023.09/ --post-install post_install_plugins_shotgun.txt
- Add the software path so that the software is executable
export PATH="qiime2-shotgun-2023.09/bin:$PATH"
(preferably using full path, here is the example of current directory)
- Download the WoL2 database
http://ftp.microbio.me/pub/wol2/ https://github.com/qiyunzhu/woltka/blob/master/doc/wol.md#the-wol-database
Let's put this in the DB directory, remember to change the project number in the script to match yours
cd ./DB
chmod +x download_wol2.sh
sbatch ../SCRIPTS/download_wol2.sh
-
Now create the Snakefile under the
workflowdirectory You can see the exampleSnakefile_gg2_shotgunin this repository, and adjust the path to the qiime config, project number, and files locations according to your need. We will use the pre-processed reads from nextflow that has been stored inRESULTS/analysis_ready_fastqs, if you cannot find this folder, you might need to re-run the taxprofiler with updated additional option. -
After modification of the Snakefile, let's prepare for the bash script to execute snakemake Please see
SCRIPTS/run_GG2shotgun_workflow.shfor example, and modify the project number -
To execute run:
chmod +x ./SCRIPTS/run_GG2shotgun_workflow.sh
./SCRIPTS/run_GG2shotgun_workflow.sh
IMPORTANT NOTES regarding the use of snakemake in csc The qiime2 here was created using tykky installation, thus, it will be compatible using the snakemake version 7.17.1. See more about running snakemake in csc: https://docs.csc.fi/support/tutorials/snakemake-puhti/ Other version seems to fail if you use tykky installation for the package.
adapted from @TuomasBorman and @KatariinaParnanen
-
Create the Snakefile under the
workflowdirectory Example given asSnakefile_humannthat you can create with text editor (nano, vim, etc.). Remember to adjust the path, project allocation, and file locations. -
Let's prepare fir the bash script to execute snakemake. It is easier to documment the version use etc. and command execution with bash script. Please see
SCRIPTS/run_humann_workflow.shfor example, and modify the project number -
To execute run:
chmod +x ./SCRIPTS/run_humann_workflow.sh
./SCRIPTS/run_humann_workflow.sh
IMPORTANT NOTES always remember to check the tools version especially if running in batch. For running samples more than 100, it is more convenience to use the group function of snakemake