Add persistent genome insertion with delta overlay mode#2686
Add persistent genome insertion with delta overlay mode#2686justinblethrow-cloud wants to merge 8 commits into
Conversation
|
Update: pushed f73e8a6, which updates SAindex incrementally for inserted genomes instead of rebuilding it from the expanded suffix array. Validation run locally:
Compared with the earlier full-CHM13 genomeInsert benchmark at 8:08.15, the incremental SAindex update reduces the persistent insertion path to 5:47.46. The benchmark validation reports Genome, SAindex, chromosome files, SJDB files, and exon/gene/transcript sidecars as byte-identical to full rebuild. SA remains different-equivalent-ordering, as expected for this path. |
|
Updated this PR with an additional genomeInsert overlay mode commit ( New behavior:
Additional implementation details:
Validation performed locally:
One note on benchmarking: the CHM13 run was warm-cache, so the most robust comparison is the insertion phase. The lazy SA shift removed the previous full-SA pre-pass and moved the SA search checkpoint to immediately after genome load. |
Add persistent genome insertion with delta overlay mode
Summary
This PR adds
--runMode genomeInsert, which persistently adds named FASTA records and optional insert-only GTF annotations to an existing STAR genome index without modifying the input--genomeDir.It supports three output modes:
Full: write a complete updated genome index to--genomeInsertOutDir.Overlay: write a small manifest that references the base genome index plus inserted FASTA/GTF records.Delta: write a compact suffix-array insertion plan, allowing a reusable overlay directory instead of rewriting or copying the large baseSAfile.The default behavior of existing run modes is unchanged.
Closes #2685.
Motivation
Some workflows maintain prebuilt STAR indexes but occasionally need to add small named references such as spike-ins, controls, plasmid records, or transgene sequences. Today this usually requires a full
--runMode genomeGeneraterebuild from the original reference FASTA files plus the added records.genomeInsertmakes this incremental use case explicit. The most important practical change isDeltamode: for small added sequences, it changes the operational model from “rebuild or copy a large STAR index” to “create a tiny reusable overlay directory from a prebuilt reference index.”Interface
Behavior:
--genomeDiris read but not modified.--genomeInsertOutDirreceives the selected output representation.--genomeFastaFiles.--sjdbGTFfilecan be supplied for annotations on inserted sequences only.Limitations:
--sjdbGTFfileis insert-only in this mode; it is not used to edit or replace base genome annotations.--sjdbFileChrStartEndand--twopassModeare rejected forgenomeInsert.Delta Mode
Deltamode is the main new scalability feature in this revision.Instead of writing a full updated
SAfile, STAR records the positions where inserted-sequence suffixes belong relative to the base suffix array. The resultinggenomeInsertDelta.bincan be reused with the base genome index.At alignment time, when the inserted GTF has no splice junctions, STAR uses a virtual SA overlay:
Genome,SA, andSAindexare reused;SA;If inserted annotations define splice junctions, STAR takes the conservative materialized path because junction insertion changes the genome/SJDB sequence layout.
Validation
The regression script builds a small annotated base genome, inserts two named sequences with insert-only GTF annotations, and compares the result to a full rebuild from base plus inserted FASTA/GTF.
Validated checks include:
genomeInsertidempotence,Genome,SAindex, chromosome metadata, and annotation sidecar files versus full rebuild,SJ.out.tabmatching a full rebuild,Overlayalignment equivalence,Deltaalignment equivalence,SAis not required to be byte-identical to the full rebuild because suffixes with equal ordering keys can be stored in a different but equivalent order.Local validation commands:
git diff --check origin/master..HEAD make -C source -j8 STAR extras/tests/scripts/testGenomeInsert.sh THREADS=1 extras/tests/scripts/testGenomeInsert.shAll passed locally.
Representative Benchmarks
Full Updated Index Versus Full Rebuild
Using a large prebuilt human genome STAR index and two small added public FASTA records:
genomeInsertfull updated index from prebuilt index8:08.15genomeGeneraterebuild20:24.38Observed wall-clock speedup was
2.51x, saving736.23 s, or60.1%of the full rebuild time.Delta Overlay Build
Using a large GRCh38/Ensembl 114 + ERCC STAR index with inserted GFP and GST records:
27 sgenomeInsertOverlay.tsv571 bytesgenomeInsertDelta.bin54 KBAlignment Behavior With Synthetic Transgene Reads
A representative production single-end RNA-seq FASTQ with about 13.1M 92 bp reads was spiked with 1,000 synthetic GFP reads and 1,000 synthetic GST reads. The same spiked FASTQ was aligned against the base GRCh38 index and the GRCh38+GFP+GST Delta overlay index.
92 s91 s010001000The expected special-row change was observed: the 2,000 synthetic transgene reads moved from
N_unmappedin the base alignment into the inserted GFP/GST gene rows in the Delta alignment. No existing reference gene-count row changed.