Skip to content

Hazelhuangup/FSP_RawReads_Processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow workflow: FSP_RawReads_Processing

Nextflow run with conda DOI

Bioinformatics team

Overview

This is a nextflow workflow for QC, clean, merge Illumina reads and create kmer profiles.

This workflow was developed for the Fungarium Sequencing Project (FSP) at Royal Botanic Gardens, Kew. It may be useful to other projects that deal with difficult samples with degraded DNA for genome assembly purposes.

The workflow was designed to process hundreds of samples in parallel.

Input & Output

Inputs:

Raw sequence data by Illumina short read sequencers

Outputs:

  • clean, deduplicated reads with < 30bp removed.
  • QC reports for each sample of both raw and clean reads
  • QC statistics of the whole batch, including data size, read number, read length, GC content, duplication level etc.
  • K-mer distribution graphs for each sample K-mer statistics of the whole batch, including uniq kmer number, total kmer number, estimated genome size, peak coverage etc.

Current outputs

After one run, outputs are organized under --OutDir as below (replace Batch_1 with your --Batch_ID):

${OutDir}/
├── 01_ReadQC_report/Batch_1/
│   ├── raw_reads_QC/
│   │   ├── <Sample_ID>/
│   │   └── fastQC_result.txt
│   └── after_fastp_QC/
│       ├── <Sample_ID>/
│       └── fastQC_result.txt
├── 02_Trimmed_reads/Batch_1/
│   ├── <Sample_ID>/
│   │   ├── <Sample_ID>_trimmed.R1.fq.gz
│   │   ├── <Sample_ID>_trimmed.R2.fq.gz
│   │   ├── <Sample_ID>_unmerged.R1.fq.gz
│   │   ├── <Sample_ID>_unmerged.R2.fq.gz
│   │   └── <Sample_ID>_merge.fq.gz
│   └── 00_statistics/
│       ├── <Sample_ID>_*.stats
│       └── z_states_for_spreadsheet/
│           ├── total_bp_merged.txt
│           ├── total_bp_trimmed.txt
│           └── Len_avg_merged.txt
└── 05_KmerAnalysis/Batch_1/
  ├── <Sample_ID>/
  │   ├── <Sample_ID>.reads.kmer_freq.hist
  │   ├── <Sample_ID>.kmer.log
  │   ├── peak_1/
  │   └── peak_2/
  └── statistics/
    ├── statistics_all.csv
    └── kmer_profile_statistics_automated.csv

fastQC_result.txt, z_states_for_spreadsheet/*.txt, and kmer_profile_statistics_automated.csv are the main batch-level export files.

Usage

Clone the repo

cd /your_target_folder/
git clone https://github.com/Hazelhuangup/FSP_RawReads_Processing.git

Set up dependancies

  • Add /your_target_folder/FSP_RawReads_Processing/bin/ to your $PATH
echo 'export PATH=/your_target_folder/FSP_RawReads_Processing/bin' >> ~/.bashrc
  • Install Nextflow

Prepare the required inputs

1. Your input data directory structure

The directory that contains your input samples (e.g. Batch_1) must be structured in the following way:

Batch_1/
├── Sample_001
│   ├── Sample_001_R1.fastq.gz
│   └── Sample_001_R2.fastq.gz
├── Sample_002
│   ├── Sample_002_R1.fastq.gz
│   └── Sample_002_R2.fastq.gz
├── Sample_003
│   ├── Sample_003_R1.fastq.gz
│   └── Sample_003_R2.fastq.gz
└── sample.list

2. Prepare the file sample.list

The file sample.list is a list of samples.

Sample_001
Sample_002
Sample_003

If all the files you received are in one folder, you can use the following script structure the folder, and create the sample.list file. Adapt batch ID in this script. Be aware of the folder names can be adjusted at line 12 in the script.

bin/paired_dir.sh

3. Set up nextflow.config

  • Change the where you'd like to cache the conda environment. If you leave it blank, by default it's in ./work/conda.
vim nextflow.config
cacheDir = './your-desired-path/'

Running the workflow

1. Run with command line

nextflow run ReadQC.nf -profile conda -resume\
    --InDir /your/input/directory/Batch_1\
    --OutDir /your/output/directory/Batch_1\
    --Li /your/absolute/directory/to/sample.list\
    --Batch_ID Batch_1

2. Run with submission script

This Repo contains an example ReadQC.sh for running the pipeline in slurm managed HPC. Feel free to copy and replace the directories in ReadsQC.sh by yours. Make sure to set up your sbatch script according to your system settings. FalcoQC/fqStat/kmer statistics compiling are integrated inside the Nextflow workflow, so no additional post-processing shell scripts are required after pipeline completion. Then submit the script by

sbatch --export=BATCH_ID=Batch_1 ReadQC.sh

3. Monitor the progress:

# general check if your recent submissions are successful
nextflow log
# monitor the current run
cd /your_current_job_running_folder/
tail -f QC_MAIN.log

Authors

The other members of the FSP bioinformatics team, Lia Obinu and Niall Garvey, and George Mears also contributed to the development and testing of this part of the workflow. This is a stand alone pipeline for read processing. The full pipeline including read processing, genome assembly and assessment, and decontamination could be found at: [fspassemblypipeline] (https://github.com/RBGKew/fspassemblypipeline/tree/main).

Citation

Please cite the URL or DOI (10.5281/zenodo.17608339) if you use this workflow in a paper.

References

  1. P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820
  2. Shifu Chen, et al, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics 34(17) 884–890 (2018), https://doi.org/10.1093/bioinformatics/bty560
  3. de Sena Brandine G and Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000Research 8, 1874 (2021), https://doi.org/10.12688/f1000research.21142.2

About

A nextflow workflow for QC, clean, merge Illumina reads and create kmer profiles.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors