RAIN is a Nextflow workflow designed for epitranscriptomic analyses, enabling the detection of RNA modifications in comparison to a reference genome. Its primary goal is to distinguish true RNA editing events from genomic variants such as SNPs, with a particular emphasis on identifying A-to-I (Adenosine-to-Inosine) editing.
- RAIN - RNA Alterations Investigation using Nextflow
RNA editing and other epitranscriptomic alterations are increasingly recognized as important sources of biological regulation and potential biomarkers in health and disease. However, their study remains technically challenging because true RNA modifications must be distinguished from sequencing artefacts, mapping issues, and genomic variants. This makes reproducible and accessible analysis workflows essential for turning raw RNA sequencing data into interpretable biological signals.
RAIN was developed to address this need by providing an end-to-end Nextflow workflow for the detection, aggregation, and downstream exploration of RNA editing events. Starting from FASTQ, BAM, or sample sheet inputs, RAIN combines alignment, quality control, editing-site detection, feature-level aggregation, and downstream filtering into a single reproducible framework. Its goal is to make epitranscriptomic analyses easier to run, easier to reproduce, and more useful for identifying candidate biomarkers from properly designed cohorts.
The diagram below summarizes the main steps of the RAIN workflow, from input preparation to feature- and aggregate-level editing matrices.
%%{init: {"flowchart": {"htmlLabels": true, "useMaxWidth": true, "wrappingWidth": 260}} }%%
flowchart TD
A["Input reads<br/>FASTQ, BAM or CSV sample sheet"] --> B["Input validation<br/>and metadata parsing"]
R["Reference genome<br/>and optional annotation"] --> C["Reference preparation<br/>FASTA unzip and annotation normalization"]
B --> D["AliNe alignment workflow<br/>and library type handling"]
C --> D
D --> E["BAM post-processing<br/>sorting, indexing, optional duplicate removal and clip overlap"]
D --> F["Optional QC<br/>FastQC and MultiQC"]
E --> G["Editing site detection<br/>Reditools3 by default, or Reditools2, Jacusa2"]
E --> H["Optional hyper-editing detection<br/>on unmapped reads"]
G --> I["Pluviometer aggregation<br/>feature, transcript, gene, chromosome and genome levels"]
H --> I
I --> J["Drip filtering and metric computation<br/>ESPF and ESPR tables"]
J --> L["Barometer downstream analysis<br/>candidate biomarker detection and reporting"]
The prerequisites to run the pipeline are:
- Nextflow >= 22.04.0
- Docker or Singularity
- Build the required container images locally by running
./build_containers.shafter cloning the repository
-
Via conda
See here
conda create -n nextflow conda activate nextflow conda install bioconda::nextflow
-
Manually
See here
Nextflow runs on most POSIX systems (Linux, macOS, etc) and can typically be installed by running these commands:# Make sure 11 or later is installed on your computer by using the command: java -version # Install Nextflow by entering this command in your terminal(it creates a file nextflow in the current dir): curl -s https://get.nextflow.io | bash # Add Nextflow binary to your user's PATH: mv nextflow ~/bin/ # OR system-wide installation: # sudo mv nextflow /usr/local/bin
To run the workflow you will need a container platform: docker or singularity. See below for installation instructions.
Please follow the instructions at the Docker website
Please follow the instructions at the Singularity website
Softwares are distributed as container images, ensuring reproducibility and enabling automatic retrieval by the pipeline.
Version information for each software package is defined in config/softwares.config.
To update a software version, simply modify the corresponding container image reference in that file.
Some images are pulled automatically from public registries, but RAIN also relies on project-specific images such as jacusa2, reditools2, reditools3, and pluviometer.
These project-specific images must be built locally before running the workflow, otherwise the docker and singularity profiles will not be able to start the corresponding processes.
After cloning the repository and installing your container platform, run:
./build_containers.shThe script auto-detects the available container engine and builds the corresponding images:
- Docker images for the
dockerprofile - Singularity/Apptainer
.sifimages insif_images/for thesingularityprofile
You can also target a specific engine explicitly:
./build_containers.sh --docker
./build_containers.sh --singularityYou can first check the available options and parameters by running:
nextflow run Juke34/RAIN -r v0.0.3 --helpBy default, RAIN uses a local executor for scheduling tasks. However, to provide the software environment, you should run with a container profile.
Select a profile according to the container platform you want to use:
singularity, a profile using Singularity to run the containersdocker, a profile using Docker to run the containers
The command will look like that:
nextflow run Juke34/RAIN -r v0.0.3 -profile docker <rest of paramaters>Another profile is also available:
slurm, to add if your system has a slurm executor (local by default)
The use of the slurm profile gives a command like this one:
nextflow run Juke34/RAIN -r v0.0.3 -profile singularity,slurm <rest of paramaters>Recommended usage is to always include a container profile (docker or singularity).
RAIN parallelizes execution at two levels:
- Task-level parallelism: multiple pipeline tasks can run concurrently (controlled by
maxForks). - Per-task resources: each process can request specific CPU, memory, and time values.
With the local profile, Nextflow runs tasks on the local machine.
Resource settings are defined in config/resources/local.config.
- Default per task:
cpus = 1,time = 1h - Maximum concurrent tasks:
maxForks = 8 - Examples of process overrides:
hisat2:cpus = 6aline:cpus = 4,memory = 16GB,time = 2dfastqc:cpus = 2drip:cpus = 4reditools3:cpus = 2,memory = 4G
Local throughput depends on the resources of your workstation.
With the SLURM profile, tasks are submitted to the scheduler and run as cluster jobs.
Main settings come from nextflow.config and config/resources/hpc.config.
- Scheduler settings:
executor = slurmqueue = normalmaxForks = 12errorStrategy = retry,maxRetries = 3clusterOptions = "--constraint=infiniband"
- Default per task in HPC config:
cpus = 1,time = 4h - Examples of process overrides:
fastp,jacusa2,reditools3,samtools: up tocpus = 16fastqc,pigz,pluviometer:cpus = 8drip:cpus = 16aline: launcher process in RAIN (cpus = 1,memory = 4GB), while AliNe manages its own submitted job
In SLURM mode, effective parallelism depends on maxForks, cluster availability/quotas, and per-process resource requests.
For details and tuning, see config/resources/local.config and config/resources/hpc.config.
Several built-in test profiles are available:
test_short_single_list: test using an explicit list of input filestest_short_paired_folder: test using a folder containing paired-end FASTQ filestest_short_single_csv: test using a CSV sample sheet (single-end data)test_short_paired_csv: test using a CSV sample sheet (paired-end data)test_mix_csv: test using a mixed CSV input (single-end data, paired-end data, and bam files)
With Nextflow and Docker available, you can run one of these tests as follows:
nextflow run -profile docker,test_short_single_list Juke34/RAIN -r v0.0.3Replace test_short_single_list with any of the other test profiles above depending on the input mode you want to validate.
Or via a clone of the repository:
git clone https://github.com/Juke34/rain.git
cd rain
nextflow run -profile docker,test_short_single_list rain.nf
RAIN - RNA Alterations Investigation using Nextflow - v0.0.3
Usage example:
nextflow run rain.nf -profile docker --genome /path/to/genome.fa --annotation /path/to/annotation.gff3 --reads /path/to/reads_folder --output /path/to/output --aligner hisat2
Parameters:
--help Prints the help section
Input sequences:
--annotation Path to the annotation file (GFF or GTF)
--genome Path to the reference genome in FASTA format.
--read_type Type of reads among this list [short_paired, short_single, pacbio, ont] [no default]
--reads path to the reads file, folder or csv. If a folder is provided, all the files with proper extension in the folder will be used. You can provide remote files (commma separated list).
file extension expected : <.fastq.gz>, <.fq.gz>, <.fastq>, <.fq> or <.bam>.
for paired reads extra <_R1_001> or <_R2_001> is expected where <R> and <_001> are optional. e.g. <sample_1.fastq.gz>, <sample_R1.fastq.gz>, <sample_R1_001.fastq.gz>)
CSV Input Format:
4 required columns: group, input_1, strandedness and read_type.
3 optional columns: - input_2 : in case of paired-end data
- sample : by default, the sample name is extracted from the filename and would be uniq. However, you must provide this column if you have technical replicates, so that replicates from the same sample are correctly grouped.
- replicate : Specifies the replicate number. Otherwise, each sample will be treated as independent and assigned as rep1. Required the sample column.
Strandedness, read_type expects same values as corresponding AliNe parameter; If a value is provided via AliNe parameter, it will override the value in the csv file.
Example of csv file with biological replicates (with paired end data and bam files):
group,input_1,input_2,strandedness,read_type
control,path/to/data1.bam,,auto,short_single
control,path/to/data2.bam,,auto,short_single
desease,path/to/data3_R1.fastq.gz,path/to/data3_R2.fastq.gz,auto,short_paired
desease,path/to/data4_R1.fastq.gz,path/to/data4_R2.fastq.gz,auto,short_paired
Example of csv file with techinical replicates (with paired end data and bam files):
group,input_1,input_2,strandedness,read_type,sample,replicate
control,path/to/data1.bam,,auto,short_single,control1,replicate1
control,path/to/data2.bam,,auto,short_single,control1,replicate2
control,path/to/data3.bam,,auto,short_single,control2,replicate1
control,path/to/data4.bam,,auto,short_single,control2,replicate2
desease,path/to/data5_R1.fastq.gz,path/to/data5_R2.fastq.gz,auto,short_paired,desease1,replicate1
desease,path/to/data6_R1.fastq.gz,path/to/data6_R2.fastq.gz,auto,short_paired,desease1,replicate2
Output:
--output Path to the output directory [default: rain_result]
Optional input:
--aggregation_mode Mode for aggregating edition counts mapped on genomic features. See documentation for details. Options are: "all" (default) or "cds_longest"
--aligner Aligner to use [default: hisat2]
--clean_duplicate Remove PCR duplicates from BAM files using GATK MarkDuplicates. [default: true]
--clip_overlap Clip overlapping sequences in read pairs to avoid double counting. [default: null]
--cov_threshold Minimal coverage to consider a site for editing detection [default: 10]
--debug Enable debug output for troubleshooting. [default: false]
--edit_site_tool Tool used for detecting edited sites. [default: reditools3]
--edit_threshold Minimal number of edited reads to count a site as edited [default: 1]
--fastqc run fastqc on main steps [default: false]
--skip_hyper_editing Skip hyper-editing detection step for unmapped reads. [default: false]
--strandedness Set the strandedness for all your input reads [default: null]. In auto mode salmon will guess the library type for each fastq sample. [ 'U', 'IU', 'MU', 'OU', 'ISF', 'ISR', 'MSF', 'MSR', 'OSF', 'OSR', 'auto' ]
Nextflow options:
-profile Change the profile of nextflow both the engine and executor more details on github README [debug, slurm, itrop, singularity, local, docker]
Here the description of typical ouput you will get from RAIN:
└── rain_results # Output folder set using --outdir. Default: <alignment_results>
│
├── AliNe # Folder containing AliNe alignment pipeline result (see https://github.com/Juke34/AliNe)
│ ├── alignment # bam alignment used by RAIN
│ ├── salmon_strandedness # strandedness collected by AliNe in case auto mode was in used for fastq files
│ └── ...
│
├── bam_indicies # bam and indices bam.bai
│
├── FastQC # bam and indices bam.bai
│
├── gatk_markduplicates # metrics and bam after markduplicates
│
└── Reditools2/Reditools3/Jacusa/sapin/ # Editing output from corresponding tool
│
└── feature_edits # Editing computed at different level (genomic features, chromosome, global)
More details about the output format
Pluviometer produces an output file for features (features.tsv) and another for aggregates (aggregates.tsv), plus a log file called pluviometer.log. The `--output` option can be used to add prefixes to the file names.The two output formats are tables of comma-separated values with a header.
| Column Name | Values or type | Description |
|---|---|---|
| SeqID | String | Identifier for contigs or chromosomes, as given in the GFF file |
| ParentIDs | Comma-separated feature IDs and . |
Path of parent features (. represents the "root" of the path, corresponding to the contig or chromosome itself) |
| FeatureID | String | Feature ID from the GFF file |
| Type | String | Feature type from the GFF file |
| Start | Positive integer | Starting position of the feature (inclusive) |
| End | Positive integer | Ending position of the feature (inclusive) |
| Strand | 1 or -1 |
Whether the features is located on the positive (5'->3') or negative (3'->5') strand |
| TotalSites | Positive integer | Number of sites in the feature |
| ObservedBases | Comma-separated positive integers | Number and type of the bases in the feature in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3) |
| QualifiedBases | Comma-separated positive integers | Number and type of of the bases in the feature in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov) |
| SiteBasePairingsQualified | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reference level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |
| ReadBasePairingsQualified | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reads level in the feature (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |
[!note] The number of QualifiedBases can differ from sum of AA,CC,GG,TT from SiteBasePairingsQualified because we can have site 100% edited that will not fall into one of these categories.
An example of the feature output format is shown below, with some alterations to make the text line up in columns.
SeqID ParentIDs FeatureID Type Start End Strand CoveredSites GenomeBases SiteBasePairings ReadBasePairings
21 . gene:ENSG00000237735 ncRNA_gene 692123 815688 -1 123448 38320,21652,22461,41015 38198,220,216,238,141,21557,136,126,134,151,22379,157,258,831,230,40857 275468,867,820,942,564,156616,577,481,543,606,162232,564,1056,2074,948,296797
21 .,gene:ENSG00000237735 transcript:ENST00000753406 lnc_RNA 692123 755798 -1 63595 20528,11342,11596,20129 20458,118,123,138,70,11298,67,71,72,77,11550,75,134,409,121,20044 146979,496,459,541,263,82312,258,262,297,321,83658,264,579,1059,519,145598
21 .,gene:ENSG00000237735,transcript:ENST00000753406 ENSE00004100538 exon 692123 692253 -1 131 38,20,25,48 38,0,0,0,0,20,2,0,0,0,25,0,0,1,0,48 280,0,0,0,0,134,7,0,0,0,165,0,0,4,0,360
21 .,gene:ENSG00000237735,transcript:ENST00000753406 ENSE00001745431 exon 755680 755798 -1 119 20,37,29,33 20,1,0,0,0,37,0,0,0,1,29,0,0,0,0,33 132,4,0,0,0,269,0,0,0,4,216,0,0,0,0,252
21 .,gene:ENSG00000237735 transcript:ENST00000444868 lnc_RNA 754519 815688 -1 61133 18166,10572,11104,21291 18114,103,94,101,72,10519,72,56,62,75,11068,84,126,430,109,21217 131936,375,362,403,310,76638,335,225,246,289,80760,308,483,1043,429,154705
21 .,gene:ENSG00000237735,transcript:ENST00000444868 ENSE00001776707 exon 754519 754770 -1 252 96,41,45,70 96,0,0,0,0,41,1,0,0,0,45,0,0,1,0,70 830,0,0,0,0,357,5,0,0,0,410,0,0,1,0,573
21 .,gene:ENSG00000237735,transcript:ENST00000444868 agat-exon-1199 exon 755680 755798 -1 119 20,37,29,33 20,1,0,0,0,37,0,0,0,1,29,0,0,0,0,33 132,4,0,0,0,269,0,0,0,4,216,0,0,0,0,252
21 .,gene:ENSG00000237735,transcript:ENST00000444868 ENSE00001602968 exon 815622 815688 -1 67 31,10,17,9 31,0,0,0,0,10,0,0,0,1,17,0,0,0,0,9 357,0,0,0,0,122,0,0,0,8,199,0,0,0,0,108
[!warning] The numbers in the output format examples shown here are just for demonstration. They do not come from empirical data.
The hierarchical relationships between features can be recovered from the fields in the ParentIDs column. The output snippet above matches the following hierarchy (recall that . represents the "root", i.e. the chromosome or contig).
%%{init: {'theme': 'forest', "flowchart" : { "curve" : "monotoneY" } } }%%
graph TD
. --> gene:ENSG00000237735
gene:ENSG00000237735 --> transcript:ENST00000753406
transcript:ENST00000753406 --> ENSE00004100538
transcript:ENST00000753406 --> ENSE00001745431
gene:ENSG00000237735 --> transcript:ENST00000444868
transcript:ENST00000444868 --> ENSE00001776707
transcript:ENST00000444868 --> agat-exon-1199
transcript:ENST00000444868 --> ENSE00001602968
This hierarchical information is provided in the same manner in the aggregate file format.
| Column Name | Values or type | Description |
|---|---|---|
| SeqID | String | Identifier for contigs or chromosomes, as given in the GFF file |
| ParentIDs | Comma-separated feature IDs | Path of parent features |
| AggregateID | String | ID assigned after the feature under which the aggregation was done |
| ParentType | String | Type of the parent of the feature under which the aggregation was done |
| AggregateType | String | Type of the features that are aggregated |
| AggregationMode | all_isoforms, longest_isoform, chimaera, feature or all-sites |
Way in which the aggregation was performed |
| TotalSites | Positive integer | Number of sites in the aggregated features |
| ObservedBases | Comma-separated positive integers | Number and type of the bases in the aggregated features in the reference genome (order: A, C, G, T) observed. The total of the 4 values corresponds to the total observed sites (reported by the editing tools e.g. Reditools3) |
| QualifiedBases | Comma-separated positive integers | Number and type of of the bases in the aggregated features in the reference genome (order: A, C, G, T) that satisfy the minimum level of coverage and editing. The total of the 4 values corresponds to the total qualified sites (> cov) |
| SiteBasePairingsQualifed | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reference level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) observed |
| ReadBasePairingsQualifed | Comma-separated positive integers | Number of sites in which each genome-variant base pairings is found at reads level in the aggregated features (order: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) that satisfy the minimum level of coverage and editing |
In the output of Pluviometer, aggregation is the sum of counts from several features of the same type at some feature level. For instance, exons can be aggregated at transcript level, gene level, chromosome level, and genome level.
An example of the aggregate output format is shown below.
SeqID ParentIDs AggregateID ParentType AggregateType AggregationMode CoveredSites GenomeBases SiteBasePairings ReadBasePairings
21 .,gene:ENSG00000237735 transcript:ENST00000444868-longest_isoform ncRNA_gene exon longest_isoform 438 147,88,91,112 147,1,0,0,0,88,1,0,0,2,91,0,0,1,0,112 1319,4,0,0,0,748,5,0,0,12,825,0,0,1,0,933
21 .,gene:ENSG00000237735 gene:ENSG00000237735-all_isoforms ncRNA_gene exon all_isoforms 688 205,145,145,193 205,2,0,0,0,145,3,0,0,3,145,0,0,2,0,193 1731,8,0,0,0,1151,12,0,0,16,1206,0,0,5,0,1545
21 .,gene:ENSG00000237735 transcript:ENST00000753406 ncRNA_gene exon feature 250 58,57,54,81 58,1,0,0,0,57,2,0,0,1,54,0,0,1,0,81 412,4,0,0,0,403,7,0,0,4,381,0,0,4,0,612
21 .,gene:ENSG00000237735 transcript:ENST00000444868 ncRNA_gene exon feature 438 147,88,91,112 147,1,0,0,0,88,1,0,0,2,91,0,0,1,0,112 1319,4,0,0,0,748,5,0,0,12,825,0,0,1,0,933
21 .,gene:ENSG00000237735 gene:ENSG00000237735-exon-chimaera ncRNA_gene exon-chimaera chimaera 569 185,108,116,160 185,1,0,0,0,108,3,0,0,2,116,0,0,2,0,160 1599,4,0,0,0,882,12,0,0,12,990,0,0,5,0,1293
The existence of alternative transcripts of a same gene causes some complication in determining the most meaningful way of aggregating features, because exons of different transcripts can have overlapping ranges. To handle this difficulty, Pluviometer reports three types of aggregation, shown with an example in the figure below.
-
Longest isoform (Transcript 1 in the figure): Report the counts only from the transcript with the longest CDS or greatest total exon length (CDS are always preferred over exons). Its ID is composed of the feature ID of the longest isoform plus "-longest_isoform"
-
All isoforms (Transcripts 1 to 3 in the figure): Report the sum of the counts from all the isoforms, regardless of counting several times the same site. Its ID is composed of the ID of the gene plus "-all_isoforms".
-
Chimaera (Chimaera in the figure): Report the counts from the union of feature ranges over all the isoforms. Its ID is composed of the ID of the gene plus "-chimaera". The aggregation types of chimaeras are postfixed with "-chimaera" as well.
-
Feature Standard mode for regular features. Aggregates data from sub-features (children) of a given feature. For example, for an exon or CDS, it aggregates the counts of all its constituent elements.
In the example below, a gene has three transcripts. For the longest isoform aggregation, Transcript 1 would be selected, because it has the greatest sum of exon lengths (numbers under the exon boxes). For the all isoforms aggregation, all the transcripts (1, 2, and 3) would be used. For chimaera aggregation, the aggregation ranges are the union of the ranges of the exons of all the transcripts. Therefore, the total length of the chimaeric features is always equal ot greater than the longest transcript.
Chromosomes and genomes also have an all sites aggregation mode that simply counts variants over all the sites ignoring all the other features.
Aggregatation follows the same logic for chromosomes/contigs and the entire genome. The main difference is that the aggregates of a chromsosome/contig have a SeqID, but no parent IDs (represented with just a .) and no feature IDs (represented with just a .). See the example below with aggregates of chromosome 21:
SeqID ParentIDs FeatureID ParentType AggregateType AggregationMode CoveredSites GenomeBases SiteBasePairings ReadBasePairings
21 . . . exon longest_isoform 8884 2874,1621,1613,2776 2868,11,20,18,9,1619,8,10,9,10,1608,11,17,52,18,2763 19234,43,72,58,22,11322,31,40,30,56,11248,49,60,113,108,18070
21 . . . exon all_isoforms 19064 5995,3466,3520,6083 5980,23,45,39,12,3462,23,16,18,20,3507,28,40,110,37,6055 41820,74,210,146,28,24433,70,50,81,108,25400,147,145,238,207,41714
21 . . . exon-chimaera chimaera 11853 3742,2228,2188,3695 3733,15,26,24,10,2225,14,12,11,11,2181,15,24,65,23,3680 25416,54,108,76,24,15487,48,42,41,62,15398,73,85,136,133,24713
21 . . . . all_sites 999437 330327,177650,176540,314920 329214,1950,1986,1989,1170,177038,1036,1054,1006,1030,175968,1142,1868,6373,1848,313853 2409166,7937,7754,7884,4656,1297210,4261,4149,4087,4215,1287279,4302,7457,15760,7686,2293932The genome-level aggregates are marked the same way, but because they represent the total of all chromosomes/contigs, the SeqID field is just a .. Like so:
SeqID ParentIDs FeatureID ParentType AggregateType AggregationMode CoveredSites GenomeBases SiteBasePairings ReadBasePairings
. . . . exon longest_isoform 8884 2874,1621,1613,2776 2868,11,20,18,9,1619,8,10,9,10,1608,11,17,52,18,2763 19234,43,72,58,22,11322,31,40,30,56,11248,49,60,113,108,18070
. . . . exon all_isoforms 19064 5995,3466,3520,6083 5980,23,45,39,12,3462,23,16,18,20,3507,28,40,110,37,6055 41820,74,210,146,28,24433,70,50,81,108,25400,147,145,238,207,41714
. . . . exon-chimaera chimaera 11853 3742,2228,2188,3695 3733,15,26,24,10,2225,14,12,11,11,2181,15,24,65,23,3680 25416,54,108,76,24,15487,48,42,41,62,15398,73,85,136,133,24713
. . . . . all_sites 999437 330327,177650,176540,314920 329214,1950,1986,1989,1170,177038,1036,1054,1006,1030,175968,1142,1868,6373,1848,313853 2409166,7937,7754,7884,4656,1297210,4261,4149,4087,4215,1287279,4302,7457,15760,7686,2293932The editing level (Bazak et al. 2014) of a feature or feature aggregate can be easily computed from the ReadBasePairings values, with the formula:
denom_espf = df[f'{genome_base}_count'] # X_QualifiedBases (e.g. C_count) df[espf_col] = df[f'{bp}_sites'] / denom_espf # XY_SiteBasePairingsQualified / X_QualifiedBases
df[total_reads_col] = XA_reads + XC_reads + XG_reads + XT_reads # all reads at X positions df[espr_col] = df[f'{bp}_reads'] / df[total_reads_col] # XY_reads / sum(X*_reads)
Drip retains a line only if at least one metric value is neither NA nor zero (i.e., at least one edit has been detected somewhere). Lines containing only NA values, only 0.0 values, or a mix of both are removed by default.
Eduardo Ascarrunz (@eascarrunz)
Jacques Dainat (@Juke34)
Contributions from the community are welcome ! See the Contributing guidelines
update pluviometer to set NA for start end and strand instead of . to be able to use column as int64 in drip and barometer e.g. dtype={"SeqID": str, "Start": "Int64", "End": "Int64", "Strand": str}


