This is a nextflow workflow for QC, clean, merge Illumina reads and create kmer profiles.
This workflow was developed for the Fungarium Sequencing Project (FSP) at Royal Botanic Gardens, Kew. It may be useful to other projects that deal with difficult samples with degraded DNA for genome assembly purposes.
The workflow was designed to process hundreds of samples in parallel.
Inputs:
Raw sequence data by Illumina short read sequencers
Outputs:
- clean, deduplicated reads with < 30bp removed.
- QC reports for each sample of both raw and clean reads
- QC statistics of the whole batch, including data size, read number, read length, GC content, duplication level etc.
- K-mer distribution graphs for each sample K-mer statistics of the whole batch, including uniq kmer number, total kmer number, estimated genome size, peak coverage etc.
After one run, outputs are organized under --OutDir as below (replace Batch_1 with your --Batch_ID):
${OutDir}/
├── 01_ReadQC_report/Batch_1/
│ ├── raw_reads_QC/
│ │ ├── <Sample_ID>/
│ │ └── fastQC_result.txt
│ └── after_fastp_QC/
│ ├── <Sample_ID>/
│ └── fastQC_result.txt
├── 02_Trimmed_reads/Batch_1/
│ ├── <Sample_ID>/
│ │ ├── <Sample_ID>_trimmed.R1.fq.gz
│ │ ├── <Sample_ID>_trimmed.R2.fq.gz
│ │ ├── <Sample_ID>_unmerged.R1.fq.gz
│ │ ├── <Sample_ID>_unmerged.R2.fq.gz
│ │ └── <Sample_ID>_merge.fq.gz
│ └── 00_statistics/
│ ├── <Sample_ID>_*.stats
│ └── z_states_for_spreadsheet/
│ ├── total_bp_merged.txt
│ ├── total_bp_trimmed.txt
│ └── Len_avg_merged.txt
└── 05_KmerAnalysis/Batch_1/
├── <Sample_ID>/
│ ├── <Sample_ID>.reads.kmer_freq.hist
│ ├── <Sample_ID>.kmer.log
│ ├── peak_1/
│ └── peak_2/
└── statistics/
├── statistics_all.csv
└── kmer_profile_statistics_automated.csv
fastQC_result.txt, z_states_for_spreadsheet/*.txt, and kmer_profile_statistics_automated.csv are the main batch-level export files.
cd /your_target_folder/
git clone https://github.com/Hazelhuangup/FSP_RawReads_Processing.git
- Add /your_target_folder/FSP_RawReads_Processing/bin/ to your $PATH
echo 'export PATH=/your_target_folder/FSP_RawReads_Processing/bin' >> ~/.bashrc
- Install Nextflow
The directory that contains your input samples (e.g. Batch_1) must be structured in the following way:
Batch_1/
├── Sample_001
│ ├── Sample_001_R1.fastq.gz
│ └── Sample_001_R2.fastq.gz
├── Sample_002
│ ├── Sample_002_R1.fastq.gz
│ └── Sample_002_R2.fastq.gz
├── Sample_003
│ ├── Sample_003_R1.fastq.gz
│ └── Sample_003_R2.fastq.gz
└── sample.list
The file sample.list is a list of samples.
Sample_001
Sample_002
Sample_003
If all the files you received are in one folder, you can use the following script structure the folder, and create the sample.list file. Adapt batch ID in this script. Be aware of the folder names can be adjusted at line 12 in the script.
bin/paired_dir.sh
- Change the where you'd like to cache the conda environment. If you leave it blank, by default it's in ./work/conda.
vim nextflow.config
cacheDir = './your-desired-path/'
nextflow run ReadQC.nf -profile conda -resume\
--InDir /your/input/directory/Batch_1\
--OutDir /your/output/directory/Batch_1\
--Li /your/absolute/directory/to/sample.list\
--Batch_ID Batch_1
This Repo contains an example ReadQC.sh for running the pipeline in slurm managed HPC. Feel free to copy and replace the directories in ReadsQC.sh by yours. Make sure to set up your sbatch script according to your system settings. FalcoQC/fqStat/kmer statistics compiling are integrated inside the Nextflow workflow, so no additional post-processing shell scripts are required after pipeline completion. Then submit the script by
sbatch --export=BATCH_ID=Batch_1 ReadQC.sh
# general check if your recent submissions are successful
nextflow log
# monitor the current run
cd /your_current_job_running_folder/
tail -f QC_MAIN.log
- Wu Huang
- Royal Botanic Gardens, Kew
- ORCID profile
The other members of the FSP bioinformatics team, Lia Obinu and Niall Garvey, and George Mears also contributed to the development and testing of this part of the workflow. This is a stand alone pipeline for read processing. The full pipeline including read processing, genome assembly and assessment, and decontamination could be found at: [fspassemblypipeline] (https://github.com/RBGKew/fspassemblypipeline/tree/main).
Please cite the URL or DOI (10.5281/zenodo.17608339) if you use this workflow in a paper.
- P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820
- Shifu Chen, et al, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics 34(17) 884–890 (2018), https://doi.org/10.1093/bioinformatics/bty560
- de Sena Brandine G and Smith AD. Falco: high-speed FastQC emulation for quality control of sequencing data. F1000Research 8, 1874 (2021), https://doi.org/10.12688/f1000research.21142.2
