Skip to content

Juke34/RAIN

Repository files navigation

GitHub CI License: MIT

RAIN - RNA Alterations Investigation using Nextflow

RAIN is a Nextflow workflow designed for epitranscriptomic analyses, enabling the detection of RNA modifications in comparison to a reference genome. Its primary goal is to distinguish true RNA editing events from genomic variants such as SNPs, with a particular emphasis on identifying A-to-I (Adenosine-to-Inosine) editing.

Table of Contents

Foreword

RNA editing and other epitranscriptomic alterations are increasingly recognized as important sources of biological regulation and potential biomarkers in health and disease. However, their study remains technically challenging because true RNA modifications must be distinguished from sequencing artefacts, mapping issues, and genomic variants. This makes reproducible and accessible analysis workflows essential for turning raw RNA sequencing data into interpretable biological signals.

RAIN was developed to address this need by providing an end-to-end Nextflow workflow for the detection, aggregation, and downstream exploration of RNA editing events. Starting from FASTQ, BAM, or sample sheet inputs, RAIN combines alignment, quality control, editing-site detection, feature-level aggregation, and downstream filtering into a single reproducible framework. Its goal is to make epitranscriptomic analyses easier to run, easier to reproduce, and more useful for identifying candidate biomarkers from properly designed cohorts.

Flowchart

The diagram below summarizes the main steps of the RAIN workflow, from input preparation to feature- and aggregate-level editing matrices.

%%{init: {"flowchart": {"htmlLabels": true, "useMaxWidth": true, "wrappingWidth": 260}} }%%
flowchart TD
  A["Input reads<br/>FASTQ, BAM or CSV sample sheet"] --> B["Input validation<br/>and metadata parsing"]
  R["Reference genome<br/>and optional annotation"] --> C["Reference preparation<br/>FASTA unzip and annotation normalization"]
  B --> D["AliNe alignment workflow<br/>and library type handling"]
  C --> D
  D --> E["BAM post-processing<br/>sorting, indexing, optional duplicate removal and clip overlap"]
  D --> F["Optional QC<br/>FastQC and MultiQC"]
  E --> G["Editing site detection<br/>Reditools3 by default, or Reditools2, Jacusa2"]
  E --> H["Optional hyper-editing detection<br/>on unmapped reads"]
  G --> I["Pluviometer aggregation<br/>feature, transcript, gene, chromosome and genome levels"]
  H --> I
  I --> J["Drip filtering and metric computation<br/>ESPF and ESPR tables"]
  J --> L["Barometer downstream analysis<br/>candidate biomarker detection and reporting"]
Loading

Installation

The prerequisites to run the pipeline are:

  • Nextflow >= 22.04.0
  • Docker or Singularity
  • Build the required container images locally by running ./build_containers.sh after cloning the repository

Nextflow

  • Via conda

    See here
    conda create -n nextflow
    conda activate nextflow
    conda install bioconda::nextflow
  • Manually

    See here Nextflow runs on most POSIX systems (Linux, macOS, etc) and can typically be installed by running these commands:
    # Make sure 11 or later is installed on your computer by using the command:
    java -version
    
    # Install Nextflow by entering this command in your terminal(it creates a file nextflow in the current dir):
    curl -s https://get.nextflow.io | bash 
    
    # Add Nextflow binary to your user's PATH:
    mv nextflow ~/bin/
    # OR system-wide installation:
    # sudo mv nextflow /usr/local/bin

Container platform

To run the workflow you will need a container platform: docker or singularity. See below for installation instructions.

Docker

Please follow the instructions at the Docker website

Singularity

Please follow the instructions at the Singularity website

Softwares

Softwares are distributed as container images, ensuring reproducibility and enabling automatic retrieval by the pipeline. Version information for each software package is defined in config/softwares.config. To update a software version, simply modify the corresponding container image reference in that file.

Some images are pulled automatically from public registries, but RAIN also relies on project-specific images such as jacusa2, reditools2, reditools3, and pluviometer. These project-specific images must be built locally before running the workflow, otherwise the docker and singularity profiles will not be able to start the corresponding processes.

After cloning the repository and installing your container platform, run:

./build_containers.sh

The script auto-detects the available container engine and builds the corresponding images:

  • Docker images for the docker profile
  • Singularity/Apptainer .sif images in sif_images/ for the singularity profile

You can also target a specific engine explicitly:

./build_containers.sh --docker
./build_containers.sh --singularity

Usage

Help

You can first check the available options and parameters by running:

nextflow run Juke34/RAIN -r v0.0.3 --help

Profile

By default, RAIN uses a local executor for scheduling tasks. However, to provide the software environment, you should run with a container profile.

Select a profile according to the container platform you want to use:

  • singularity, a profile using Singularity to run the containers
  • docker, a profile using Docker to run the containers

The command will look like that:

nextflow run Juke34/RAIN -r v0.0.3 -profile docker <rest of paramaters>

Another profile is also available:

  • slurm, to add if your system has a slurm executor (local by default)

The use of the slurm profile gives a command like this one:

nextflow run Juke34/RAIN -r v0.0.3 -profile singularity,slurm <rest of paramaters>

Recommended usage is to always include a container profile (docker or singularity).

Parallelization and resource usage

RAIN parallelizes execution at two levels:

  1. Task-level parallelism: multiple pipeline tasks can run concurrently (controlled by maxForks).
  2. Per-task resources: each process can request specific CPU, memory, and time values.

Local execution (default)

With the local profile, Nextflow runs tasks on the local machine. Resource settings are defined in config/resources/local.config.

  • Default per task: cpus = 1, time = 1h
  • Maximum concurrent tasks: maxForks = 8
  • Examples of process overrides:
    • hisat2: cpus = 6
    • aline: cpus = 4, memory = 16GB, time = 2d
    • fastqc: cpus = 2
    • drip: cpus = 4
    • reditools3: cpus = 2, memory = 4G

Local throughput depends on the resources of your workstation.

SLURM execution (-profile slurm)

With the SLURM profile, tasks are submitted to the scheduler and run as cluster jobs. Main settings come from nextflow.config and config/resources/hpc.config.

  • Scheduler settings:
    • executor = slurm
    • queue = normal
    • maxForks = 12
    • errorStrategy = retry, maxRetries = 3
    • clusterOptions = "--constraint=infiniband"
  • Default per task in HPC config: cpus = 1, time = 4h
  • Examples of process overrides:
    • fastp, jacusa2, reditools3, samtools: up to cpus = 16
    • fastqc, pigz, pluviometer: cpus = 8
    • drip: cpus = 16
    • aline: launcher process in RAIN (cpus = 1, memory = 4GB), while AliNe manages its own submitted job

In SLURM mode, effective parallelism depends on maxForks, cluster availability/quotas, and per-process resource requests.

For details and tuning, see config/resources/local.config and config/resources/hpc.config.

Test

Several built-in test profiles are available:

  • test_short_single_list: test using an explicit list of input files
  • test_short_paired_folder: test using a folder containing paired-end FASTQ files
  • test_short_single_csv: test using a CSV sample sheet (single-end data)
  • test_short_paired_csv: test using a CSV sample sheet (paired-end data)
  • test_mix_csv: test using a mixed CSV input (single-end data, paired-end data, and bam files)

With Nextflow and Docker available, you can run one of these tests as follows:

nextflow run -profile docker,test_short_single_list Juke34/RAIN -r v0.0.3

Replace test_short_single_list with any of the other test profiles above depending on the input mode you want to validate.

Or via a clone of the repository:

git clone https://github.com/Juke34/rain.git
cd rain
nextflow run -profile docker,test_short_single_list rain.nf

Parameters

    RAIN - RNA Alterations Investigation using Nextflow - v0.0.3

        Usage example:
    nextflow run rain.nf -profile docker --genome /path/to/genome.fa --annotation /path/to/annotation.gff3 --reads /path/to/reads_folder --output /path/to/output --aligner hisat2

        Parameters:
    --help                      Prints the help section

        Input sequences:
    --annotation                Path to the annotation file (GFF or GTF)
    --genome                    Path to the reference genome in FASTA format.
    --read_type                 Type of reads among this list [short_paired, short_single, pacbio, ont] [no default]
    --reads                     path to the reads file, folder or csv. If a folder is provided, all the files with proper extension in the folder will be used. You can provide remote files (commma separated list).
                                    file extension expected : <.fastq.gz>, <.fq.gz>, <.fastq>, <.fq> or <.bam>. 
                                                              for paired reads extra <_R1_001> or <_R2_001> is expected where <R> and <_001> are optional. e.g. <sample_1.fastq.gz>, <sample_R1.fastq.gz>, <sample_R1_001.fastq.gz>)
                                    CSV Input Format:
                                        4 required columns: group, input_1, strandedness and read_type.
                                        3 optional columns: - input_2   : in case of paired-end data
                                                            - sample    : by default, the sample name is extracted from the filename and would be uniq. However, you must provide this column if you have technical replicates, so that replicates from the same sample are correctly grouped. 
                                                            - replicate : Specifies the replicate number. Otherwise, each sample will be treated as independent and assigned as rep1. Required the sample column.
                                    Strandedness, read_type expects same values as corresponding AliNe parameter; If a value is provided via AliNe parameter, it will override the value in the csv file.
                                    
                                    Example of csv file with biological replicates (with paired end data and bam files):
                                        group,input_1,input_2,strandedness,read_type
                                        control,path/to/data1.bam,,auto,short_single
                                        control,path/to/data2.bam,,auto,short_single
                                        desease,path/to/data3_R1.fastq.gz,path/to/data3_R2.fastq.gz,auto,short_paired
                                        desease,path/to/data4_R1.fastq.gz,path/to/data4_R2.fastq.gz,auto,short_paired
                                    
                                    Example of csv file with techinical replicates (with paired end data and bam files):
                                        group,input_1,input_2,strandedness,read_type,sample,replicate
                                        control,path/to/data1.bam,,auto,short_single,control1,replicate1
                                        control,path/to/data2.bam,,auto,short_single,control1,replicate2
                                        control,path/to/data3.bam,,auto,short_single,control2,replicate1
                                        control,path/to/data4.bam,,auto,short_single,control2,replicate2
                                        desease,path/to/data5_R1.fastq.gz,path/to/data5_R2.fastq.gz,auto,short_paired,desease1,replicate1
                                        desease,path/to/data6_R1.fastq.gz,path/to/data6_R2.fastq.gz,auto,short_paired,desease1,replicate2

        Output:
    --output                    Path to the output directory [default: rain_result]

       Optional input:
    --aggregation_mode          Mode for aggregating edition counts mapped on genomic features. See documentation for details. Options are: "all" (default) or "cds_longest"
    --aligner                   Aligner to use [default: hisat2]
    --clean_duplicate           Remove PCR duplicates from BAM files using GATK MarkDuplicates. [default: true]
    --clip_overlap              Clip overlapping sequences in read pairs to avoid double counting. [default: null]
    --cov_threshold             Minimal coverage to consider a site for editing detection [default: 10]
    --debug                     Enable debug output for troubleshooting. [default: false]
    --edit_site_tool            Tool used for detecting edited sites. [default: reditools3]
    --edit_threshold            Minimal number of edited reads to count a site as edited [default: 1]
    --fastqc                    run fastqc on main steps [default: false]
    --skip_hyper_editing        Skip hyper-editing detection step for unmapped reads. [default: false]
    --strandedness              Set the strandedness for all your input reads [default: null]. In auto mode salmon will guess the library type for each fastq sample. [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ]

        Nextflow options:
    -profile                    Change the profile of nextflow both the engine and executor more details on github README [debug, slurm, itrop, singularity, local, docker]

Output

Here the description of typical ouput you will get from RAIN:

└── rain_results                                         # Output folder set using --outdir. Default: <alignment_results>
    │
    ├── AliNe                                            # Folder containing AliNe alignment pipeline result (see https://github.com/Juke34/AliNe)
    │   ├── alignment                                    # bam alignment used by RAIN
    │   ├── salmon_strandedness                          # strandedness collected by AliNe in case auto mode was in used for fastq files
    │   └── ...      
    │
    ├── bam_indicies                                     # bam and indices bam.bai
    │
    ├── FastQC                                           # bam and indices bam.bai
    │
    ├── gatk_markduplicates                              # metrics and bam after markduplicates
    │
    └── Reditools2/Reditools3/Jacusa/sapin/              # Editing output from corresponding tool
    │
    └── feature_edits                                    # Editing computed at different level (genomic features, chromosome, global)
More details about the output format Pluviometer produces an output file for features (features.tsv) and another for aggregates (aggregates.tsv), plus a log file called pluviometer.log. The `--output` option can be used to add prefixes to the file names.

The two output formats are tables of comma-separated values with a header.

Feature file

Column Name Values or type Description
SeqID String Identifier for contigs or chromosomes, as given in the GFF file
ParentIDs Comma-separated feature IDs and . Path of parent features (. represents the "root" of the path, corresponding to the contig or chromosome itself)
FeatureID String Feature ID from the GFF file
Type String Feature type from the GFF file
Start Positive integer Starting position of the feature (inclusive)
End Positive integer Ending position of the feature (inclusive)
Strand 1 or -1 Whether the features is located on the positive (5'->3') or negative (3'->5') strand
TotalSites Positive integer Number of sites in the feature
ObservedBases Comma-separated positive integers Number and type of the bases in the feature in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3)
QualifiedBases Comma-separated positive integers Number and type of of the bases in the feature in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov)
SiteBasePairingsQualified Comma-separated positive integers Number of sites in which each genome-variant base pairings is found at reference level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing
ReadBasePairingsQualified Comma-separated positive integers Number of sites in which each genome-variant base pairings is found at reads level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing

[!note] The number of QualifiedBases can differ from sum of AA,CC,GG,TT from SiteBasePairingsQualified because we can have site 100% edited that will not fall into one of these categories.

An example of the feature output format is shown below, with some alterations to make the text line up in columns.

SeqID  ParentIDs                                         FeatureID                   Type                    Start   End     Strand  CoveredSites  GenomeBases              SiteBasePairings                                                         ReadBasePairings
21  .                                                  gene:ENSG00000237735        ncRNA_gene              692123  815688  -1            123448  38320,21652,22461,41015  38198,220,216,238,141,21557,136,126,134,151,22379,157,258,831,230,40857  275468,867,820,942,564,156616,577,481,543,606,162232,564,1056,2074,948,296797
21  .,gene:ENSG00000237735                             transcript:ENST00000753406  lnc_RNA                 692123  755798  -1             63595  20528,11342,11596,20129  20458,118,123,138,70,11298,67,71,72,77,11550,75,134,409,121,20044        146979,496,459,541,263,82312,258,262,297,321,83658,264,579,1059,519,145598
21  .,gene:ENSG00000237735,transcript:ENST00000753406  ENSE00004100538             exon                    692123  692253  -1               131  38,20,25,48              38,0,0,0,0,20,2,0,0,0,25,0,0,1,0,48                                      280,0,0,0,0,134,7,0,0,0,165,0,0,4,0,360
21  .,gene:ENSG00000237735,transcript:ENST00000753406  ENSE00001745431             exon                    755680  755798  -1               119  20,37,29,33              20,1,0,0,0,37,0,0,0,1,29,0,0,0,0,33                                      132,4,0,0,0,269,0,0,0,4,216,0,0,0,0,252
21  .,gene:ENSG00000237735                             transcript:ENST00000444868  lnc_RNA                 754519  815688  -1             61133  18166,10572,11104,21291  18114,103,94,101,72,10519,72,56,62,75,11068,84,126,430,109,21217         131936,375,362,403,310,76638,335,225,246,289,80760,308,483,1043,429,154705
21  .,gene:ENSG00000237735,transcript:ENST00000444868  ENSE00001776707             exon                    754519  754770  -1               252  96,41,45,70              96,0,0,0,0,41,1,0,0,0,45,0,0,1,0,70                                      830,0,0,0,0,357,5,0,0,0,410,0,0,1,0,573
21  .,gene:ENSG00000237735,transcript:ENST00000444868  agat-exon-1199              exon                    755680  755798  -1               119  20,37,29,33              20,1,0,0,0,37,0,0,0,1,29,0,0,0,0,33                                      132,4,0,0,0,269,0,0,0,4,216,0,0,0,0,252
21  .,gene:ENSG00000237735,transcript:ENST00000444868  ENSE00001602968             exon                    815622  815688  -1                67  31,10,17,9               31,0,0,0,0,10,0,0,0,1,17,0,0,0,0,9                                       357,0,0,0,0,122,0,0,0,8,199,0,0,0,0,108

[!warning] The numbers in the output format examples shown here are just for demonstration. They do not come from empirical data.

The hierarchical relationships between features can be recovered from the fields in the ParentIDs column. The output snippet above matches the following hierarchy (recall that . represents the "root", i.e. the chromosome or contig).

%%{init: {'theme': 'forest', "flowchart" : { "curve" : "monotoneY" } } }%%
graph TD
  . --> gene:ENSG00000237735
  gene:ENSG00000237735 --> transcript:ENST00000753406
  transcript:ENST00000753406 --> ENSE00004100538
  transcript:ENST00000753406 --> ENSE00001745431
  gene:ENSG00000237735 --> transcript:ENST00000444868
  transcript:ENST00000444868 --> ENSE00001776707
  transcript:ENST00000444868 --> agat-exon-1199
  transcript:ENST00000444868 --> ENSE00001602968

Loading

This hierarchical information is provided in the same manner in the aggregate file format.

Aggregate file

Column Name Values or type Description
SeqID String Identifier for contigs or chromosomes, as given in the GFF file
ParentIDs Comma-separated feature IDs Path of parent features
AggregateID String ID assigned after the feature under which the aggregation was done
ParentType String Type of the parent of the feature under which the aggregation was done
AggregateType String Type of the features that are aggregated
AggregationMode all_isoforms, longest_isoform, chimaera, feature or all-sites Way in which the aggregation was performed
TotalSites Positive integer Number of sites in the aggregated features
ObservedBases Comma-separated positive integers Number and type of the bases in the aggregated features in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3)
QualifiedBases Comma-separated positive integers Number and type of of the bases in the aggregated features in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov)
SiteBasePairingsQualifed Comma-separated positive integers Number of sites in which each genome-variant base pairings is found at reference level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) observed
ReadBasePairingsQualifed Comma-separated positive integers Number of sites in which each genome-variant base pairings is found at reads level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing

In the output of Pluviometer, aggregation is the sum of counts from several features of the same type at some feature level. For instance, exons can be aggregated at transcript level, gene level, chromosome level, and genome level.

An example of the aggregate output format is shown below.

SeqID  ParentIDs               AggregateID                                 ParentType  AggregateType  AggregationMode  CoveredSites  GenomeBases                  SiteBasePairings                                                                         ReadBasePairings
21     .,gene:ENSG00000237735  transcript:ENST00000444868-longest_isoform  ncRNA_gene  exon           longest_isoform           438  147,88,91,112                147,1,0,0,0,88,1,0,0,2,91,0,0,1,0,112                                                    1319,4,0,0,0,748,5,0,0,12,825,0,0,1,0,933
21     .,gene:ENSG00000237735  gene:ENSG00000237735-all_isoforms           ncRNA_gene  exon           all_isoforms              688  205,145,145,193              205,2,0,0,0,145,3,0,0,3,145,0,0,2,0,193                                                  1731,8,0,0,0,1151,12,0,0,16,1206,0,0,5,0,1545
21     .,gene:ENSG00000237735  transcript:ENST00000753406                  ncRNA_gene  exon           feature                   250  58,57,54,81                  58,1,0,0,0,57,2,0,0,1,54,0,0,1,0,81                                                      412,4,0,0,0,403,7,0,0,4,381,0,0,4,0,612
21     .,gene:ENSG00000237735  transcript:ENST00000444868                  ncRNA_gene  exon           feature                   438  147,88,91,112                147,1,0,0,0,88,1,0,0,2,91,0,0,1,0,112                                                    1319,4,0,0,0,748,5,0,0,12,825,0,0,1,0,933
21     .,gene:ENSG00000237735  gene:ENSG00000237735-exon-chimaera          ncRNA_gene  exon-chimaera  chimaera                  569  185,108,116,160              185,1,0,0,0,108,3,0,0,2,116,0,0,2,0,160                                                  1599,4,0,0,0,882,12,0,0,12,990,0,0,5,0,1293

Aggregation modes

The existence of alternative transcripts of a same gene causes some complication in determining the most meaningful way of aggregating features, because exons of different transcripts can have overlapping ranges. To handle this difficulty, Pluviometer reports three types of aggregation, shown with an example in the figure below.

  1. Longest isoform (Transcript 1 in the figure): Report the counts only from the transcript with the longest CDS or greatest total exon length (CDS are always preferred over exons). Its ID is composed of the feature ID of the longest isoform plus "-longest_isoform"

  2. All isoforms (Transcripts 1 to 3 in the figure): Report the sum of the counts from all the isoforms, regardless of counting several times the same site. Its ID is composed of the ID of the gene plus "-all_isoforms".

  3. Chimaera (Chimaera in the figure): Report the counts from the union of feature ranges over all the isoforms. Its ID is composed of the ID of the gene plus "-chimaera". The aggregation types of chimaeras are postfixed with "-chimaera" as well.

  4. Feature Standard mode for regular features. Aggregates data from sub-features (children) of a given feature. For example, for an exon or CDS, it aggregates the counts of all its constituent elements.

In the example below, a gene has three transcripts. For the longest isoform aggregation, Transcript 1 would be selected, because it has the greatest sum of exon lengths (numbers under the exon boxes). For the all isoforms aggregation, all the transcripts (1, 2, and 3) would be used. For chimaera aggregation, the aggregation ranges are the union of the ranges of the exons of all the transcripts. Therefore, the total length of the chimaeric features is always equal ot greater than the longest transcript.

alt text

Chromosomes and genomes also have an all sites aggregation mode that simply counts variants over all the sites ignoring all the other features.

Genome and chromosome/contig aggregates

Aggregatation follows the same logic for chromosomes/contigs and the entire genome. The main difference is that the aggregates of a chromsosome/contig have a SeqID, but no parent IDs (represented with just a .) and no feature IDs (represented with just a .). See the example below with aggregates of chromosome 21:

SeqID	 ParentIDs             	 FeatureID                                 	 ParentType	 AggregateType	 AggregationMode	 CoveredSites	 GenomeBases                	 SiteBasePairings                                                                       	 ReadBasePairings

21   	 .                     	 .                                         	 .         	 exon         	 longest_isoform	         8884	 2874,1621,1613,2776        	 2868,11,20,18,9,1619,8,10,9,10,1608,11,17,52,18,2763                                   	 19234,43,72,58,22,11322,31,40,30,56,11248,49,60,113,108,18070
21   	 .                     	 .                                         	 .         	 exon         	 all_isoforms   	        19064	 5995,3466,3520,6083        	 5980,23,45,39,12,3462,23,16,18,20,3507,28,40,110,37,6055                               	 41820,74,210,146,28,24433,70,50,81,108,25400,147,145,238,207,41714
21   	 .                     	 .                                         	 .         	 exon-chimaera	 chimaera       	        11853	 3742,2228,2188,3695        	 3733,15,26,24,10,2225,14,12,11,11,2181,15,24,65,23,3680                                	 25416,54,108,76,24,15487,48,42,41,62,15398,73,85,136,133,24713
21   	 .                     	 .                                         	 .         	 .            	 all_sites      	       999437	 330327,177650,176540,314920	 329214,1950,1986,1989,1170,177038,1036,1054,1006,1030,175968,1142,1868,6373,1848,313853	 2409166,7937,7754,7884,4656,1297210,4261,4149,4087,4215,1287279,4302,7457,15760,7686,2293932

The genome-level aggregates are marked the same way, but because they represent the total of all chromosomes/contigs, the SeqID field is just a .. Like so:

SeqID	 ParentIDs             	 FeatureID                                 	 ParentType	 AggregateType	 AggregationMode	 CoveredSites	 GenomeBases                	 SiteBasePairings                                                                       	 ReadBasePairings
.    	 .                     	 .                                         	 .         	 exon         	 longest_isoform	         8884	 2874,1621,1613,2776        	 2868,11,20,18,9,1619,8,10,9,10,1608,11,17,52,18,2763                                   	 19234,43,72,58,22,11322,31,40,30,56,11248,49,60,113,108,18070
.    	 .                     	 .                                         	 .         	 exon         	 all_isoforms   	        19064	 5995,3466,3520,6083        	 5980,23,45,39,12,3462,23,16,18,20,3507,28,40,110,37,6055                               	 41820,74,210,146,28,24433,70,50,81,108,25400,147,145,238,207,41714
.    	 .                     	 .                                         	 .         	 exon-chimaera	 chimaera       	        11853	 3742,2228,2188,3695        	 3733,15,26,24,10,2225,14,12,11,11,2181,15,24,65,23,3680                                	 25416,54,108,76,24,15487,48,42,41,62,15398,73,85,136,133,24713
.    	 .                     	 .                                         	 .         	 .            	 all_sites      	       999437	 330327,177650,176540,314920	 329214,1950,1986,1989,1170,177038,1036,1054,1006,1030,175968,1142,1868,6373,1848,313853	 2409166,7937,7754,7884,4656,1297210,4261,4149,4087,4215,1287279,4302,7457,15760,7686,2293932

Computing editing levels from the base pairing lists

The editing level (Bazak et al. 2014) of a feature or feature aggregate can be easily computed from the ReadBasePairings values, with the formula:

$$ AG\ editing\ level = \sum_{i=0}^{n} \dfrac{AG_i}{AA_i + AC_i + AG_i + AT_i} $$

Drip

espf (edited sites proportion in feature):

denom_espf = df[f'{genome_base}_count'] # X_QualifiedBases (e.g. C_count) df[espf_col] = df[f'{bp}_sites'] / denom_espf # XY_SiteBasePairingsQualified / X_QualifiedBases

espr (edited sites proportion in reads):

df[total_reads_col] = XA_reads + XC_reads + XG_reads + XT_reads # all reads at X positions df[espr_col] = df[f'{bp}_reads'] / df[total_reads_col] # XY_reads / sum(X*_reads)

Drip retains a line only if at least one metric value is neither NA nor zero (i.e., at least one edit has been detected somewhere). Lines containing only NA values, only 0.0 values, or a mix of both are removed by default.

Author and contributors

Eduardo Ascarrunz (@eascarrunz)
Jacques Dainat (@Juke34)

Contributing

Contributions from the community are welcome ! See the Contributing guidelines

TODO

update pluviometer to set NA for start end and strand instead of . to be able to use column as int64 in drip and barometer e.g. dtype={"SeqID": str, "Start": "Int64", "End": "Int64", "Strand": str}

About

RAIN - RNA Alterations Investigation using Nextflow with particular focus on A-to-I editing events

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors