Skip to content

calizilla/GIFFL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GIFFL

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Genome-inclusive fusion finder for long reads

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Currently tested on ONT RNAseq data, cDNA and direct read sequencing (DRS), against human genome versions hg38 and T2T.

GIFFL is an ultra-sensitive fusion caller that reports all evidence of fusions of a gene to any other type of sequence within the genome (ie, "genome inclusive"). This differs to other gene fusion detection tools, which primarily report only gene:gene fusions.

Maximum sensitivity is attained by reporting any evidence of any fusion event, supported by at least one read. This approach was chosen to minimise false negatives at the cost of increased false positives, an approach best suited to investigations targeting rare and/or novel events.

The advantages of GIFFL over existing long read fusion callers:

  • GIFFL uses a highly sensitive gene/feature-centric approach. GIFFL can run over a single gene, a list of genes, all features within a genomic region, a bespoke BED file of features, or all coding genes within the genome. This customisable input modality is more flexible than typical gene fusion detection tools which focus only on gene--gene fusions within an input BAM file.
  • Since GIFFL is gene/feature-centric, a rapid result on precise inputs can be obtained within minutes, much faster than typical gene fusion detection tools designed to operate genome-wide.
  • Conversely, studies with no a priori knowledge can use GIFFL to query all-coding or indeed any list of features, knowing that the inclusive nature of the GIFFL methodology will report ALL findings, coding or otherwise.
  • Multiple methods are used to assess putative fusions, with sites supported from just a single read taken forward. Putative chimeras are marked, and numerous metrics to enable highly customisable filtering are included in the output file.
  • GIFFL uses two sources of information to report detailed metrics of putative fusions: read mappings from the input BAM file, as well as a de novo assembly of all reads mapping to the query gene/feature. By providing both sources of information, with detailed metrics and cross-support values included in the raw output, sensitivity and specificity can be tailored to suit the experiment.
  • Transcript assemblies for all target genes/features passing the assembly stage are included in the output.

Input requirements

  • a coordinate sorted, indexed BAM file of sample reads mapped to a reference genome
  • a GTF annotation file
  • a bed12 annotation file, bgzipped and tabix-indexed
  • a minimap2 index (.mmi) for the reference genome (BAM, gtf, bed12 and mmi must match)
  • a list of target inputs (genes or chromosomal region)

Usage

Step 1: Prepare target gene inputs

GIFFL can take one of 5 input target types:

  • a single gene, eg "GAPDH"
  • a list of comma-separated genes, eg "GAPDH,SOX,HES7"
  • a chromosomal region, eg "chr1:100000000-150000000"
  • 'allcoding' for all coding genes within the supplied GTF
  • a pre-prepared BED file matching the required format

To prepare target gene inputs for giffl, first run:

giffl-pi --input <input> --gtf <gtf> --label <label>

For gene, gene list, 'allcoding', or region, a new BED file is prepared. For a pre-prepared BED file, giffl-pi will run a minimal check that the correct number of columns are provided. Users should check their BED file input before running giffl to ensure it matches the format of the example below:

chr1    154269392       154275882       HAX1    protein_coding
chr6    32821832        32838784        TAP2    protein_coding
chr8    142727190       142736938       THEM6   protein_coding
chr13   41216144        41263775        MTRF1   protein_coding
chr19   14408797        14419383        DDX39A  protein_coding

Step 2: Run GIFFL

GIFFL is run per-sample per input BED, with the following required parameters:

  • sample Input sample ID
  • seq_type 'drs' or 'cdna'
  • bam_in Input sample BAM file
  • genes_bed Output of giffl-pi: Input BED file of 5 columns: chr, start - 1, end, gene_name, gene_biotype
  • mmi minimap2 index of the reference genome
  • genome_bed12 bgzipped bed12 annotation file. Tabix index must exist alongside .bed12.gz
  • threads Number of cores to use for muli-threaded components
  • java_mem Java memory for de novo transcript assembly
  • label Descriptive name for output file, eg 'candidate-genes'

Optional parameters:

  • debug Debug mode. 0 for off (default) or 1 for on. Debug mode prevents deletion of temp files. Caution - there are a lot, so ensure cleanup after debug to avoid overloading your filesystem.
  • outdir Output directory for GIFFL results. Defaults to the present working directory.

Example command:

giffl \
	--sample Sample01 \
	--seq_type drs \
	--bam_in /path/to/Sample01.sort.bam \
	--genes_bed mygenes.bed \
	--mmi /path/to/ref/genome.fna.mmi \
	--genome_bed12 /path/to/ref/genome.annotation.bed12.gz \
	--threads 12 \
	--java_mem 42 \
	--label my-candidate-genes

Installation and run options

GIFFL can be easily run with a singularity container including all required dependencies (recommended), or called directly via the scripts provided in the bin folder of this repository. If running outside of the provided singularity container, please ensure the following dependencies are available in your path:

  • samtools (tested with version 2.30)
  • minimap2 (tested with version 1.22.1)
  • rnabloom (tested with version 2.0.1)
  • htslib (tested with version 1.23)

Your Linux distribution must also have GNU awk and bc.

To call from the singularity container, first pull the image, then run with eg:

singularity run giffl_1.0.1.sif giffl <args>

To run directly, first clone the repository, then provide path to script, eg:

git clone git@github.com:Sydney-Informatics-Hub/GIFFL.git
GIFFL/bin/giffl_prep_input.sh <args>

Compute requirements

The recommended resources for GIFFL are 12 cores and 48 GB RAM. Runs may be achieved with less resources depending on your hardware, the size and depth of your samples, and the exact target genes.

On average, GIFFL can process a list of 100 target genes in < 1 hour with 12 CPU. A single gene of interest can run in seconds while a candidate region of 1 Mb containing 10-20 genes may complete within a few minutes. This makes GIFFL ideal for rapid querying of candidate gene lists or candidate regions, with a much faster time to targeted results compared to tools that only search the whole genome per run.

Some genes can be vastly slower than others to process, for example those with a high exon count, high read depth, or a high number of isoforms.

To run GIFFL in 'allcoding' search mode, HPC is recommended to scatter chunks of 100-200 input genes over numerous 12-core parallel tasks/array sub-jobs.

Output files

The main GIFFL output file is a TSV titled <sample>.<label>.GIFFL.raw.tsv, with the columns described below.

Other outputs include a verbose log and assembled transcripts for all genes passing the de novo assembly stage of the analysis.

Raw output file fields

  • Fusion_name name of fusion, ie geneA--regionB, where region B can be a gene name, a non-coding feature name, or intergenic. If >1 feature found in gtf at this site, a '' will be added. If potential missed chimera (fusion site = tagret gene end), '' will be added. These are cumulative, so '' is possible
  • Fusion_description describes the biotypes of the target gene and the fusion gene/feature/region
  • Fusion_breakpoint_pair the genomic locations of the left and right breakpoints
  • Target_gene_name target/input gene
  • Target_gene_region gene coding sequence span (from GTF)
  • DN_contigs_num number of de novo contigs produced from the reads covering the target gene coding sequence span
  • DN_ID de-novo assembled contig ID and range (affected sequence within contig). For hits with 1 read, this will be read ID (no assembly performed)
  • DN_contig_length length of the de novo contig giving rise to this hit
  • DN_contig_to_ref_mapQ mapping quality of the denovo assembled cotnig back to the refernece (60 = best, 0 = worst, multi-mapper)
  • DN_contig_to_ref_mapSites how many places in the reference genome does this contig map to
  • DN_read_support how many of the samples input reads that were used to assemble the contig actually mapped back to the denovo assembled contig (a measure of contig quality, higher = better). Contigs with a zero value here are discarded
  • DN_read_cov what is the average coverage of reads mapping back to the denovo assembled contig a measure of contig quality, higher = better)
  • DN_fusion_site de novo fusion site, ie where do the split reads from target gene map to in the reference genome, as predicted by the de novo contig mapped back to reference genome
  • DN_fusion_site_name name of the feature that the target gene fuses to, eg gene name, intergenic
  • DN_fusion_site_cov_SA of the reads covering the de novo fusion site, how many are supplementary (not primary) mappings
  • DN_fusion_site_cov_tot of the reads covering the de novo fusion site, how many are there in total (primary or supplementary). For a coding site, may expect a broader discrepenacy between this value and the above. For a non-coding site, expect the numbers to more closely match, since this is RNAseq data
  • DN_fusion_site_cov_support ratio of SA reads to all covering reads at the site. Higher values mean higher support for a fusion to this location.
  • DN_breakpoint_site breakpoint site within the target gene, as predicted by the de novo contig mapped back to reference genome
  • ALN_breakpoint_cov the number of reads in the input BAM that cover the gene_breakend site
  • ALN_readthrough_cov the number of reads that "read through" the gene_breakend site, ie do not support a break. For samples homozygous for the fusion, expect the readthrough_coverage to be at/near zero. For heterozygote samples, expect readthrough_coverage to be roughly 25-75% of gene_breakend coverage
  • ALN_breakpoint_cov_support ratio of covering reads that support the break out of all covering reads at the site
  • ALN_SA_tot how many reads that mapped to the target gene have supplementary mappings, ie support a break in the coding seqeunce compared to the reference
  • ALN_SA_uniq of these supplementary mappings, how many unique chromosomal locations do they represent
  • ALN_SA_ord_list list of the supplementary mappings found within reads mapping to the target gene, listed in order of number of reads supporting that supplementary mapping (higher = gretaer support for a fusion to that site)
  • ALN-DN_match_support if there is a match between one of the supplementary sites listed above (from input BAM) and the denovo assembly mapped back to reference genome method, there will be a non-zero value, and this value represents the number of sample reads that support the fusion site
  • ALN-DN_concordance if the fusion site predicted by the de novo method matches one of the locations from the alignment method (ie the SA tags) then this wil be TRUE, otherwise, FALSE. Fusion sites supported by both methods have higher confidence

Filtering raw results

In its present verson, GIFFL outputs only raw unfiltered hits, and all downstream filtering is left to the individual user. The TSV results file contains comprehensive information for each putative hit, enabling users to rapidly flter the results based on what criteria is most important to their goals.

In testing, we found samples to have between 4 and 5 thousand raw hits each, which can be easily reduced to sub 500 through the application of multiple filters.

Filters can be easily applied within Excel, R, or similar.

Future versions will include a command-line filter utility.

About

Genome-inclusive fusion finder for long reads

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages