Nextflow workflow: `FSP_RawReads_Processing`

Overview

This is a nextflow workflow for QC, clean, merge Illumina reads and create kmer profiles.

This workflow was developed for the Fungarium Sequencing Project (FSP) at Royal Botanic Gardens, Kew. It may be useful to other projects that deal with difficult samples with degraded DNA for genome assembly purposes.

The workflow was designed to process hundreds of samples in parallel.

Nextflow workflow: FSP_RawReads_Processing

Input & Output

Inputs:

Raw sequence data by Illumina short read sequencers

Outputs:

clean, deduplicated reads with < 30bp removed.
QC reports for each sample of both raw and clean reads
QC statistics of the whole batch, including data size, read number, read length, GC content, duplication level etc.
K-mer distribution graphs for each sample K-mer statistics of the whole batch, including uniq kmer number, total kmer number, estimated genome size, peak coverage etc.

Current outputs

After one run, outputs are organized under --OutDir as below (replace Batch_1 with your --Batch_ID):

${OutDir}/
├── 01_ReadQC_report/Batch_1/
│   ├── raw_reads_QC/
│   │   ├── <Sample_ID>/
│   │   └── fastQC_result.txt
│   └── after_fastp_QC/
│       ├── <Sample_ID>/
│       └── fastQC_result.txt
├── 02_Trimmed_reads/Batch_1/
│   ├── <Sample_ID>/
│   │   ├── <Sample_ID>_trimmed.R1.fq.gz
│   │   ├── <Sample_ID>_trimmed.R2.fq.gz
│   │   ├── <Sample_ID>_unmerged.R1.fq.gz
│   │   ├── <Sample_ID>_unmerged.R2.fq.gz
│   │   └── <Sample_ID>_merge.fq.gz
│   └── 00_statistics/
│       ├── <Sample_ID>_*.stats
│       └── z_states_for_spreadsheet/
│           ├── total_bp_merged.txt
│           ├── total_bp_trimmed.txt
│           └── Len_avg_merged.txt
└── 05_KmerAnalysis/Batch_1/
  ├── <Sample_ID>/
  │   ├── <Sample_ID>.reads.kmer_freq.hist
  │   ├── <Sample_ID>.kmer.log
  │   ├── peak_1/
  │   └── peak_2/
  └── statistics/
    ├── statistics_all.csv
    └── kmer_profile_statistics_automated.csv

fastQC_result.txt, z_states_for_spreadsheet/*.txt, and kmer_profile_statistics_automated.csv are the main batch-level export files.

Usage

Clone the repo

cd /your_target_folder/
git clone https://github.com/Hazelhuangup/FSP_RawReads_Processing.git

Set up dependancies

Add /your_target_folder/FSP_RawReads_Processing/bin/ to your $PATH

echo 'export PATH=/your_target_folder/FSP_RawReads_Processing/bin' >> ~/.bashrc

Install Nextflow

Prepare the required inputs

1. Your input data directory structure

The directory that contains your input samples (e.g. Batch_1) must be structured in the following way:

Batch_1/
├── Sample_001
│   ├── Sample_001_R1.fastq.gz
│   └── Sample_001_R2.fastq.gz
├── Sample_002
│   ├── Sample_002_R1.fastq.gz
│   └── Sample_002_R2.fastq.gz
├── Sample_003
│   ├── Sample_003_R1.fastq.gz
│   └── Sample_003_R2.fastq.gz
└── sample.list

2. Prepare the file sample.list

The file sample.list is a list of samples.

Sample_001
Sample_002
Sample_003

If all the files you received are in one folder, you can use the following script structure the folder, and create the sample.list file. Adapt batch ID in this script. Be aware of the folder names can be adjusted at line 12 in the script.

bin/paired_dir.sh

3. Set up nextflow.config

Change the where you'd like to cache the conda environment. If you leave it blank, by default it's in ./work/conda.

vim nextflow.config
cacheDir = './your-desired-path/'

Running the workflow

1. Run with command line

nextflow run ReadQC.nf -profile conda -resume\
    --InDir /your/input/directory/Batch_1\
    --OutDir /your/output/directory/Batch_1\
    --Li /your/absolute/directory/to/sample.list\
    --Batch_ID Batch_1

2. Run with submission script

This Repo contains an example ReadQC.sh for running the pipeline in slurm managed HPC. Feel free to copy and replace the directories in ReadsQC.sh by yours. Make sure to set up your sbatch script according to your system settings. FalcoQC/fqStat/kmer statistics compiling are integrated inside the Nextflow workflow, so no additional post-processing shell scripts are required after pipeline completion. Then submit the script by

sbatch --export=BATCH_ID=Batch_1 ReadQC.sh

3. Monitor the progress:

# general check if your recent submissions are successful
nextflow log
# monitor the current run
cd /your_current_job_running_folder/
tail -f QC_MAIN.log

Authors

Wu Huang
- Royal Botanic Gardens, Kew
- ORCID profile

The other members of the FSP bioinformatics team, Lia Obinu and Niall Garvey, and George Mears also contributed to the development and testing of this part of the workflow. This is a stand alone pipeline for read processing. The full pipeline including read processing, genome assembly and assessment, and decontamination could be found at: [fspassemblypipeline] (https://github.com/RBGKew/fspassemblypipeline/tree/main).

Citation

Please cite the URL or DOI (10.5281/zenodo.17608339) if you use this workflow in a paper.

References

P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820
Shifu Chen, et al, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics 34(17) 884–890 (2018), https://doi.org/10.1093/bioinformatics/bty560
de Sena Brandine G and Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000Research 8, 1874 (2021), https://doi.org/10.12688/f1000research.21142.2

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
bin		bin
envs		envs
example/00_test		example/00_test
modules		modules
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
ReadQC.nf		ReadQC.nf
ReadQC.sh		ReadQC.sh
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nextflow workflow: `FSP_RawReads_Processing`

Overview

Input & Output

Current outputs

Usage

Clone the repo

Set up dependancies

Prepare the required inputs

1. Your input data directory structure

2. Prepare the file sample.list

3. Set up nextflow.config

Running the workflow

1. Run with command line

2. Run with submission script

3. Monitor the progress:

Authors

Citation

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nextflow workflow: FSP_RawReads_Processing

Overview

Input & Output

Current outputs

Usage

Clone the repo

Set up dependancies

Prepare the required inputs

1. Your input data directory structure

2. Prepare the file sample.list

3. Set up nextflow.config

Running the workflow

1. Run with command line

2. Run with submission script

3. Monitor the progress:

Authors

Citation

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Nextflow workflow: `FSP_RawReads_Processing`

Packages