Skip to content
This repository was archived by the owner on Jul 3, 2024. It is now read-only.

AlexPGroot/Project_Nanopore

Repository files navigation


Project Nanopore

Workflow for processing of MinION Nanopore long read RNA sequencing data.
Explore the docs »

Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

Product Name Screen Shot

Product Name Screen Shot

Built With

made-with-bash
RStudio Community: RStudio Cloud

(back to top)

Getting Started

File structure:

  • 'scripts' contains scripts used to run the workflow. scripts are ordered by order of usage.
  • 'data' contains data files used by the programs in the workflow. Data such as annotations, reference genome and index are here.
  • 'data_input' is the folder for raw data files to be analyzed. Loose fastq.gz files are placed here and can be archived once finished.
  • 'results' contains folders with all manipulated data. Programs output their temp files here as well.

Chopper is used for removing adapter sequences. minimap2 is used to align input data to reference genome. aligned .sam data is then filtered by FLAG value and converted to .bam files for Bambu.

Installation

  1. Download from releases or clone the repo
    Releases

    git clone https://github.com/ProjecticumDataScience/Project_Nanopore.git
  2. Install

    bash install_script.sh

    This will do the following:

    1. Create conda environments based on the supplied exported environments located in /conda_environments
    2. Download reference files from ensembl
    3. Create an index file for minimap2 using reference files
    4. Download example data from sg-nex-data.s3.amazonaws.com

(back to top)

Usage

Example workflow [important]

When running the quickscript.sh it automatically uses scripts such as 011_seqtk.sh and 012_chopper.sh. 011_seqtk.sh has an extra line to work in tandem with the two example SGNex data files:

seqtk sample -s123 "$file" 250000 | gzip > "s_${file}"
seqtk sample -s451 "$file" 250000 | gzip > "s2_${file}"

The purpose is to provide DESeq with more than 2 files as it doesn't function without them. It also makes the files smaller for easier and faster running of the example workflow.

This extra line should be removed when running a full workflow with real data, or should be altered to suit your needs if subsetting is desired.

011_seqtk.sh alters the raw data_input, and 012_chopper.sh uses this data to create the output in /filter_fastqc/. 012_chopper.sh also comes with automatic file cleanup. Make sure the /data_input/ files are stored somewhere safe to prevent them from having to be redownloaded manually later. Such as /data_input/downloaded, then copy the example files to /data_input/ when running the workflow.

Main workflow

  1. Make sure reference genome and annotations are downloaded in
~/Project_Nanopore/data
  1. Change to scripts directory
cd ~/Project_Nanopore/scripts
  1. Activate scripts (See below for more information)
bash quickscript.sh
  1. Run Bambu through 050_bambu.R
se <- bambu(reads = samples.bam, annotations = annotations, genome = fa.file)
  1. Run Deseq analysis with summarizedExperiment from Bambu
dds <- DESeqDataSet(se, 
           design = ~ treatment)

(back to top)

Scripts

quickscript.sh

This script is essentially just a wrapper for all the scripts used in this workflow. Because the process is time-consuming this script automates a large majority of the workflow after tuning parameters in the seperate scripts and choosing the scripts you'd like to run in quickscript.sh This is an example of what that looks like:

bash 009_bam_to_fastq.sh
bash 011_seqtk.sh
bash 012_chopper.sh
bash 020_minimap2.sh
bash 030_samflagtobam.sh
bash 041_bam_filter.sh
echo done

009_bam_to_fastq.sh

This workflow generally takes .fastq files as input in /data_input.
This script converts .bam files in /data_input to .fastq files using samtools bam2fq

samtools bam2fq file.bam > file.fastq

010_fastqc.sh

This script performs the fastqc quality control analysis on all .fastq files currently in the /data_input folder.

fastqc --outdir $fastqc_output_dir $file

010_nanoPlot.sh

Alternatively, NanoPlot can be used. This script performs the NanoPlot quality control analysis on all .fastq files currently in the /data_input folder.

NanoPlot -t 8 -o "$nanoplot_output_dir" --fastq "$file"

011_seqtk.sh

This script is optional and is used for subsetting the amount of reads in the input files. The script should be altered depending on your needs.

seqtk sample -s123 "$file" 2500000 > "s_${file}"
# `-s123` is the seed used when generating the random subset
# `2500000` is the amount of desired reads 

seqtk can take either fastq or fastq.gz files and the resulting file is in .gz format which is required for 020_minimap.sh

012_chopper.sh

Chopper is a tool meant for filtering and trimming long read fastq files. It can perform --headcrop or --tailcrop to cut off parts at the start or end of a read, such as adapters. It can also be used to filter reads based on quality, by default this script uses (Q>10)

chopper --quality 10

chopper can take either fastq or fastq.gz files and the resulting file is in .gz format which is required for 020_minimap.sh

020_minimap2.sh

Minimap2 aligns the input files against a reference genome. The provided settings are recommended for nanopore data, however some of these options are overwritten by the pre-built index file.

minimap2 -ax splice -k14 -uf -t8 --junc-bed "$JUNC" "$REF" --split-prefix tmp input.fastq > output.sam
# default settings are recommended settings for longread sequences.
# a junction file is used for additional accuracy around junctions
# --split-prefix is used by default due to RAM constraints and a split-index is used as a reference

Minimap2 can make use of a junction.bed file for splicing junctions. The junc file can be generated as such as per the minimap2 github page:

paftools.js gff2bed anno.gff > anno.bed

030_samflagtobam.sh

This script filters the files based on sam FLAGS. It filters FLAG 0 and FLAG 16 by default:

awk 'BEGIN {OFS="\t"} $2 == 0 || $2 == 16 {print $0}' "$file" >> "${OUTPUT_DIR}$file"

This script converts then .sam files to .bam files using samtools. The script takes input from /results/flag_filter_sam/ to output /results/bam/

samtools view -o "${OUTPUT_DIR}${base_name}.bam" "${INPUT_DIR}${file}"

041_bam_filter.sh

This script filters bam files based on chromosomes. By default it outputs chr1-22. In order to do so it first checks for a .bed file and if it doesn't exist generates it based on the reference genome in /data/ It then uses samtools bamfilt to generate the filtered file.

samtools view -L GRCh38.bed -o bamfilt_outputfile input_file

050_bambu.R

This is the R file in which the bambu is performed. It utilizes the following packages:

here
ggbio
bambu
tidyverse

Bambu requires an fa.file and gtf.file. It then creates an annotation object using bambu::prepareAnnotations(gtf.file). This annotation file can optionally be saved as an .RDS file and using a list of all .bam samples in /results/bam the summarizedExperiment object is created using bambu.

fa.file <- here("data/index/Homo_sapiens.GRCh38.dna.primary_assembly.fa")
gtf.file <- here("data/Homo_sapiens.GRCh38.111.gtf")
annotations <- prepareAnnotations(gtf.file) # This function creates a reference annotation object which is used for transcript discovery and quantification in Bambu.
annotations %>% saveRDS(here("data/bambu_annotations_Homo_sapiens.GRCh38.111.RDS"))
samples.bam <- list.files(here("results/bam"), pattern = ".bam$", full.names = TRUE)

# running Bambu 
se <- bambu(reads = samples.bam, annotations = annotations, genome = fa.file, ncore = 12)
se %>% saveRDS(here("results/bambu/output.rds"))

This summarisedExperiment(se) file is saved for later and can be used for the deseq analysis. It has built-in plotting features too:

plotBambu(se, type = "pca")

050_bigwig.sh

This script converts processed .BAM files to BigWig (.bw) files and filters based on exons for export and viewing in IGV browser. See IGV Browser

050_deseq.Rmd

The markdown file where the differential expression analysis is performed. It uses the summarisedExperiment output from bambu as input. The DESeq object is created as follows:

dds <- DESeqDataSet(se, 
           design = ~ treatment)

This allows for results such as volcano plots

and up- and down regulation of genes with ensembl identifiers

(back to top)

IGV Browser

This workflow also allows for conversion of processed .BAM files to BigWig (.bw) files for export and viewing in IGV browser.
  1. Run the main workflow up to the usage of at least 041_bam_filter.sh. This will provide .bam files located in /results/bam
  2. Run the bigwig script:
    bash 050_bigwig.sh

The bigwig script will now execute samtools to sort and index the bam files from the /results/bam folder into the /results/bw folder. These files are then converted to small .bw files and the temp files generated by the sorting and indexing are removed. 3. Download the .bw files residing in the /results/bw folder and open in the IGV browser locally or upload to the web app

050_bigwig.sh converts files from .bam to .bw as such:

awk '$3 == "exon" {print $1, $4-1, $5}' OFS='\t' annotation.gtf > exons.bed

samtools sort input.bam -o output.bam
bedtools intersect -a output.bam -b exons.bed > exons_output.bam
samtools index exons_output.bam
bamCoverage -b exons_output.bam -o output.bw --normalizeUsing RPKM

(back to top)

Roadmap

The project is currently still being built and worked on, new future features might be added here once 1.0 is released.

  • 1.0 Release
  • New feature 2
  • New feature 3
    • Nested Feature

See the open issues for a full list of proposed features (and known issues).

(back to top)

Release history

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Alex Groot - alex.groot@student.hu.nl

Project Link: https://github.com/ProjecticumDataScience/Project_Nanopore

Acknowledgments

(back to top)

About

Unmaintained copy of school project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors