Workflow for processing of MinION Nanopore long read RNA sequencing data.
Explore the docs »
Report Bug
·
Request Feature
Table of Contents
File structure:
- 'scripts' contains scripts used to run the workflow. scripts are ordered by order of usage.
- 'data' contains data files used by the programs in the workflow. Data such as annotations, reference genome and index are here.
- 'data_input' is the folder for raw data files to be analyzed. Loose fastq.gz files are placed here and can be archived once finished.
- 'results' contains folders with all manipulated data. Programs output their temp files here as well.
Chopper is used for removing adapter sequences. minimap2 is used to align input data to reference genome. aligned .sam data is then filtered by FLAG value and converted to .bam files for Bambu.
-
Download from releases or clone the repo
Releasesgit clone https://github.com/ProjecticumDataScience/Project_Nanopore.git
-
Install
bash install_script.sh
This will do the following:
- Create conda environments based on the supplied exported environments located in /conda_environments
- Download reference files from ensembl
- Create an index file for minimap2 using reference files
- Download example data from sg-nex-data.s3.amazonaws.com
When running the quickscript.sh it automatically uses scripts such as 011_seqtk.sh and 012_chopper.sh. 011_seqtk.sh has an extra line to work in tandem with the two example SGNex data files:
seqtk sample -s123 "$file" 250000 | gzip > "s_${file}"
seqtk sample -s451 "$file" 250000 | gzip > "s2_${file}"The purpose is to provide DESeq with more than 2 files as it doesn't function without them. It also makes the files smaller for easier and faster running of the example workflow.
This extra line should be removed when running a full workflow with real data, or should be altered to suit your needs if subsetting is desired.
011_seqtk.sh alters the raw data_input, and 012_chopper.sh uses this data to create the output in /filter_fastqc/. 012_chopper.sh also comes with automatic file cleanup. Make sure the /data_input/ files are stored somewhere safe to prevent them from having to be redownloaded manually later. Such as /data_input/downloaded, then copy the example files to /data_input/ when running the workflow.
- Make sure reference genome and annotations are downloaded in
~/Project_Nanopore/data- Change to scripts directory
cd ~/Project_Nanopore/scripts- Activate scripts (See below for more information)
bash quickscript.sh- Run Bambu through 050_bambu.R
se <- bambu(reads = samples.bam, annotations = annotations, genome = fa.file)- Run Deseq analysis with summarizedExperiment from Bambu
dds <- DESeqDataSet(se,
design = ~ treatment)This script is essentially just a wrapper for all the scripts used in this workflow. Because the process is time-consuming this script automates a large majority of the workflow after tuning parameters in the seperate scripts and choosing the scripts you'd like to run in quickscript.sh This is an example of what that looks like:
bash 009_bam_to_fastq.sh
bash 011_seqtk.sh
bash 012_chopper.sh
bash 020_minimap2.sh
bash 030_samflagtobam.sh
bash 041_bam_filter.sh
echo doneThis workflow generally takes .fastq files as input in /data_input.
This script converts .bam files in /data_input to .fastq files using samtools bam2fq
samtools bam2fq file.bam > file.fastqThis script performs the fastqc quality control analysis on all .fastq files currently in the /data_input folder.
fastqc --outdir $fastqc_output_dir $fileAlternatively, NanoPlot can be used. This script performs the NanoPlot quality control analysis on all .fastq files currently in the /data_input folder.
NanoPlot -t 8 -o "$nanoplot_output_dir" --fastq "$file"This script is optional and is used for subsetting the amount of reads in the input files. The script should be altered depending on your needs.
seqtk sample -s123 "$file" 2500000 > "s_${file}"
# `-s123` is the seed used when generating the random subset
# `2500000` is the amount of desired reads seqtk can take either fastq or fastq.gz files and the resulting file is in .gz format which is required for 020_minimap.sh
Chopper is a tool meant for filtering and trimming long read fastq files. It can perform --headcrop or --tailcrop to cut off parts at the start or end of a read, such as adapters. It can also be used to filter reads based on quality, by default this script uses (Q>10)
chopper --quality 10chopper can take either fastq or fastq.gz files and the resulting file is in .gz format which is required for 020_minimap.sh
Minimap2 aligns the input files against a reference genome. The provided settings are recommended for nanopore data, however some of these options are overwritten by the pre-built index file.
minimap2 -ax splice -k14 -uf -t8 --junc-bed "$JUNC" "$REF" --split-prefix tmp input.fastq > output.sam
# default settings are recommended settings for longread sequences.
# a junction file is used for additional accuracy around junctions
# --split-prefix is used by default due to RAM constraints and a split-index is used as a referenceMinimap2 can make use of a junction.bed file for splicing junctions. The junc file can be generated as such as per the minimap2 github page:
paftools.js gff2bed anno.gff > anno.bedThis script filters the files based on sam FLAGS. It filters FLAG 0 and FLAG 16 by default:
awk 'BEGIN {OFS="\t"} $2 == 0 || $2 == 16 {print $0}' "$file" >> "${OUTPUT_DIR}$file"This script converts then .sam files to .bam files using samtools. The script takes input from /results/flag_filter_sam/ to output /results/bam/
samtools view -o "${OUTPUT_DIR}${base_name}.bam" "${INPUT_DIR}${file}"This script filters bam files based on chromosomes. By default it outputs chr1-22. In order to do so it first checks for a .bed file and if it doesn't exist generates it based on the reference genome in /data/ It then uses samtools bamfilt to generate the filtered file.
samtools view -L GRCh38.bed -o bamfilt_outputfile input_fileThis is the R file in which the bambu is performed. It utilizes the following packages:
here
ggbio
bambu
tidyverseBambu requires an fa.file and gtf.file. It then creates an annotation object using bambu::prepareAnnotations(gtf.file). This annotation file can optionally be saved as an .RDS file and using a list of all .bam samples in /results/bam the summarizedExperiment object is created using bambu.
fa.file <- here("data/index/Homo_sapiens.GRCh38.dna.primary_assembly.fa")
gtf.file <- here("data/Homo_sapiens.GRCh38.111.gtf")
annotations <- prepareAnnotations(gtf.file) # This function creates a reference annotation object which is used for transcript discovery and quantification in Bambu.
annotations %>% saveRDS(here("data/bambu_annotations_Homo_sapiens.GRCh38.111.RDS"))
samples.bam <- list.files(here("results/bam"), pattern = ".bam$", full.names = TRUE)
# running Bambu
se <- bambu(reads = samples.bam, annotations = annotations, genome = fa.file, ncore = 12)
se %>% saveRDS(here("results/bambu/output.rds"))This summarisedExperiment(se) file is saved for later and can be used for the deseq analysis. It has built-in plotting features too:
plotBambu(se, type = "pca")This script converts processed .BAM files to BigWig (.bw) files and filters based on exons for export and viewing in IGV browser.
See IGV Browser
The markdown file where the differential expression analysis is performed. It uses the summarisedExperiment output from bambu as input. The DESeq object is created as follows:
dds <- DESeqDataSet(se,
design = ~ treatment)This allows for results such as volcano plots

and up- and down regulation of genes with ensembl identifiers

- Run the main workflow up to the usage of at least 041_bam_filter.sh. This will provide .bam files located in /results/bam
- Run the bigwig script:
bash 050_bigwig.sh
The bigwig script will now execute samtools to sort and index the bam files from the /results/bam folder into the /results/bw folder. These files are then converted to small .bw files and the temp files generated by the sorting and indexing are removed. 3. Download the .bw files residing in the /results/bw folder and open in the IGV browser locally or upload to the web app
050_bigwig.sh converts files from .bam to .bw as such:
awk '$3 == "exon" {print $1, $4-1, $5}' OFS='\t' annotation.gtf > exons.bed
samtools sort input.bam -o output.bam
bedtools intersect -a output.bam -b exons.bed > exons_output.bam
samtools index exons_output.bam
bamCoverage -b exons_output.bam -o output.bw --normalizeUsing RPKMThe project is currently still being built and worked on, new future features might be added here once 1.0 is released.
- 1.0 Release
- New feature 2
- New feature 3
- Nested Feature
See the open issues for a full list of proposed features (and known issues).
Distributed under the MIT License. See LICENSE for more information.
Alex Groot - alex.groot@student.hu.nl
Project Link: https://github.com/ProjecticumDataScience/Project_Nanopore



