~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Genome-inclusive fusion finder for long reads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Currently tested on ONT RNAseq data, cDNA and direct read sequencing (DRS), against human genome versions hg38 and T2T.
GIFFL is an ultra-sensitive fusion caller that reports all evidence of fusions of a gene to any other type of sequence within the genome (ie, "genome inclusive"). This differs to other gene fusion detection tools, which primarily report only gene:gene fusions.
Maximum sensitivity is attained by reporting any evidence of any fusion event, supported by at least one read. This approach was chosen to minimise false negatives at the cost of increased false positives, an approach best suited to investigations targeting rare and/or novel events.
The advantages of GIFFL over existing long read fusion callers:
- GIFFL uses a highly sensitive gene/feature-centric approach. GIFFL can run over a single gene, a list of genes, all features within a genomic region, a bespoke BED file of features, or all coding genes within the genome. This customisable input modality is more flexible than typical gene fusion detection tools which focus only on gene--gene fusions within an input BAM file.
- Since GIFFL is gene/feature-centric, a rapid result on precise inputs can be obtained within minutes, much faster than typical gene fusion detection tools designed to operate genome-wide.
- Conversely, studies with no a priori knowledge can use GIFFL to query all-coding or indeed any list of features, knowing that the inclusive nature of the GIFFL methodology will report ALL findings, coding or otherwise.
- Multiple methods are used to assess putative fusions, with sites supported from just a single read taken forward. Putative chimeras are marked, and numerous metrics to enable highly customisable filtering are included in the output file.
- GIFFL uses two sources of information to report detailed metrics of putative fusions: read mappings from the input BAM file, as well as a de novo assembly of all reads mapping to the query gene/feature. By providing both sources of information, with detailed metrics and cross-support values included in the raw output, sensitivity and specificity can be tailored to suit the experiment.
- Transcript assemblies for all target genes/features passing the assembly stage are included in the output.
- a coordinate sorted, indexed BAM file of sample reads mapped to a reference genome
- a GTF annotation file
- a bed12 annotation file, bgzipped and tabix-indexed
- a minimap2 index (.mmi) for the reference genome (BAM, gtf, bed12 and mmi must match)
- a list of target inputs (genes or chromosomal region)
GIFFL can take one of 5 input target types:
- a single gene, eg "GAPDH"
- a list of comma-separated genes, eg "GAPDH,SOX,HES7"
- a chromosomal region, eg "chr1:100000000-150000000"
- 'allcoding' for all coding genes within the supplied GTF
- a pre-prepared BED file matching the required format
To prepare target gene inputs for giffl, first run:
giffl-pi --input <input> --gtf <gtf> --label <label>For gene, gene list, 'allcoding', or region, a new BED file is prepared. For a pre-prepared BED file, giffl-pi will run a minimal check that the correct number of columns are provided. Users should check their BED file input before running giffl to ensure it matches the format of the example below:
chr1 154269392 154275882 HAX1 protein_coding
chr6 32821832 32838784 TAP2 protein_coding
chr8 142727190 142736938 THEM6 protein_coding
chr13 41216144 41263775 MTRF1 protein_coding
chr19 14408797 14419383 DDX39A protein_codingGIFFL is run per-sample per input BED, with the following required parameters:
sampleInput sample IDseq_type'drs' or 'cdna'bam_inInput sample BAM filegenes_bedOutput ofgiffl-pi: Input BED file of 5 columns: chr, start - 1, end, gene_name, gene_biotypemmiminimap2 index of the reference genomegenome_bed12bgzipped bed12 annotation file. Tabix index must exist alongside .bed12.gzthreadsNumber of cores to use for muli-threaded componentsjava_memJava memory for de novo transcript assemblylabelDescriptive name for output file, eg 'candidate-genes'
Optional parameters:
debugDebug mode. 0 for off (default) or 1 for on. Debug mode prevents deletion of temp files. Caution - there are a lot, so ensure cleanup after debug to avoid overloading your filesystem.outdirOutput directory for GIFFL results. Defaults to the present working directory.
Example command:
giffl \
--sample Sample01 \
--seq_type drs \
--bam_in /path/to/Sample01.sort.bam \
--genes_bed mygenes.bed \
--mmi /path/to/ref/genome.fna.mmi \
--genome_bed12 /path/to/ref/genome.annotation.bed12.gz \
--threads 12 \
--java_mem 42 \
--label my-candidate-genesGIFFL can be easily run with a singularity container including all required dependencies (recommended), or called directly via the scripts provided in the bin folder of this repository. If running outside of the provided singularity container, please ensure the following dependencies are available in your path:
samtools(tested with version 2.30)minimap2(tested with version 1.22.1)rnabloom(tested with version 2.0.1)htslib(tested with version 1.23)
Your Linux distribution must also have GNU awk and bc.
To call from the singularity container, first pull the image, then run with eg:
singularity run giffl_1.0.1.sif giffl <args>To run directly, first clone the repository, then provide path to script, eg:
git clone git@github.com:Sydney-Informatics-Hub/GIFFL.git
GIFFL/bin/giffl_prep_input.sh <args>The recommended resources for GIFFL are 12 cores and 48 GB RAM. Runs may be achieved with less resources depending on your hardware, the size and depth of your samples, and the exact target genes.
On average, GIFFL can process a list of 100 target genes in < 1 hour with 12 CPU. A single gene of interest can run in seconds while a candidate region of 1 Mb containing 10-20 genes may complete within a few minutes. This makes GIFFL ideal for rapid querying of candidate gene lists or candidate regions, with a much faster time to targeted results compared to tools that only search the whole genome per run.
Some genes can be vastly slower than others to process, for example those with a high exon count, high read depth, or a high number of isoforms.
To run GIFFL in 'allcoding' search mode, HPC is recommended to scatter chunks of 100-200 input genes over numerous 12-core parallel tasks/array sub-jobs.
The main GIFFL output file is a TSV titled <sample>.<label>.GIFFL.raw.tsv, with the columns described below.
Other outputs include a verbose log and assembled transcripts for all genes passing the de novo assembly stage of the analysis.
Fusion_namename of fusion, ie geneA--regionB, where region B can be a gene name, a non-coding feature name, or intergenic. If >1 feature found in gtf at this site, a '' will be added. If potential missed chimera (fusion site = tagret gene end), '' will be added. These are cumulative, so '' is possibleFusion_descriptiondescribes the biotypes of the target gene and the fusion gene/feature/regionFusion_breakpoint_pairthe genomic locations of the left and right breakpointsTarget_gene_nametarget/input geneTarget_gene_regiongene coding sequence span (from GTF)DN_contigs_numnumber of de novo contigs produced from the reads covering the target gene coding sequence spanDN_IDde-novo assembled contig ID and range (affected sequence within contig). For hits with 1 read, this will be read ID (no assembly performed)DN_contig_lengthlength of the de novo contig giving rise to this hitDN_contig_to_ref_mapQmapping quality of the denovo assembled cotnig back to the refernece (60 = best, 0 = worst, multi-mapper)DN_contig_to_ref_mapSiteshow many places in the reference genome does this contig map toDN_read_supporthow many of the samples input reads that were used to assemble the contig actually mapped back to the denovo assembled contig (a measure of contig quality, higher = better). Contigs with a zero value here are discardedDN_read_covwhat is the average coverage of reads mapping back to the denovo assembled contig a measure of contig quality, higher = better)DN_fusion_sitede novo fusion site, ie where do the split reads from target gene map to in the reference genome, as predicted by the de novo contig mapped back to reference genomeDN_fusion_site_namename of the feature that the target gene fuses to, eg gene name, intergenicDN_fusion_site_cov_SAof the reads covering the de novo fusion site, how many are supplementary (not primary) mappingsDN_fusion_site_cov_totof the reads covering the de novo fusion site, how many are there in total (primary or supplementary). For a coding site, may expect a broader discrepenacy between this value and the above. For a non-coding site, expect the numbers to more closely match, since this is RNAseq dataDN_fusion_site_cov_supportratio of SA reads to all covering reads at the site. Higher values mean higher support for a fusion to this location.DN_breakpoint_sitebreakpoint site within the target gene, as predicted by the de novo contig mapped back to reference genomeALN_breakpoint_covthe number of reads in the input BAM that cover the gene_breakend siteALN_readthrough_covthe number of reads that "read through" the gene_breakend site, ie do not support a break. For samples homozygous for the fusion, expect the readthrough_coverage to be at/near zero. For heterozygote samples, expect readthrough_coverage to be roughly 25-75% of gene_breakend coverageALN_breakpoint_cov_supportratio of covering reads that support the break out of all covering reads at the siteALN_SA_tothow many reads that mapped to the target gene have supplementary mappings, ie support a break in the coding seqeunce compared to the referenceALN_SA_uniqof these supplementary mappings, how many unique chromosomal locations do they representALN_SA_ord_listlist of the supplementary mappings found within reads mapping to the target gene, listed in order of number of reads supporting that supplementary mapping (higher = gretaer support for a fusion to that site)ALN-DN_match_supportif there is a match between one of the supplementary sites listed above (from input BAM) and the denovo assembly mapped back to reference genome method, there will be a non-zero value, and this value represents the number of sample reads that support the fusion siteALN-DN_concordanceif the fusion site predicted by the de novo method matches one of the locations from the alignment method (ie the SA tags) then this wil be TRUE, otherwise, FALSE. Fusion sites supported by both methods have higher confidence
In its present verson, GIFFL outputs only raw unfiltered hits, and all downstream filtering is left to the individual user. The TSV results file contains comprehensive information for each putative hit, enabling users to rapidly flter the results based on what criteria is most important to their goals.
In testing, we found samples to have between 4 and 5 thousand raw hits each, which can be easily reduced to sub 500 through the application of multiple filters.
Filters can be easily applied within Excel, R, or similar.
Future versions will include a command-line filter utility.