Skip to content

Add persistent genome insertion with delta overlay mode#2686

Open
justinblethrow-cloud wants to merge 8 commits into
alexdobin:masterfrom
justinblethrow-cloud:genome-insert-upstream
Open

Add persistent genome insertion with delta overlay mode#2686
justinblethrow-cloud wants to merge 8 commits into
alexdobin:masterfrom
justinblethrow-cloud:genome-insert-upstream

Conversation

@justinblethrow-cloud
Copy link
Copy Markdown

@justinblethrow-cloud justinblethrow-cloud commented May 17, 2026

Add persistent genome insertion with delta overlay mode

Summary

This PR adds --runMode genomeInsert, which persistently adds named FASTA records and optional insert-only GTF annotations to an existing STAR genome index without modifying the input --genomeDir.

It supports three output modes:

  • Full: write a complete updated genome index to --genomeInsertOutDir.
  • Overlay: write a small manifest that references the base genome index plus inserted FASTA/GTF records.
  • Delta: write a compact suffix-array insertion plan, allowing a reusable overlay directory instead of rewriting or copying the large base SA file.

The default behavior of existing run modes is unchanged.

Closes #2685.

Motivation

Some workflows maintain prebuilt STAR indexes but occasionally need to add small named references such as spike-ins, controls, plasmid records, or transgene sequences. Today this usually requires a full --runMode genomeGenerate rebuild from the original reference FASTA files plus the added records.

genomeInsert makes this incremental use case explicit. The most important practical change is Delta mode: for small added sequences, it changes the operational model from “rebuild or copy a large STAR index” to “create a tiny reusable overlay directory from a prebuilt reference index.”

Interface

STAR \
  --runMode genomeInsert \
  --runThreadN 16 \
  --genomeDir /path/to/prebuilt/index \
  --genomeFastaFiles added_sequences.fa \
  --sjdbGTFfile added_sequences.gtf \
  --genomeInsertOutMode Delta \
  --genomeInsertOutDir /path/to/updated/index

Behavior:

  • --genomeDir is read but not modified.
  • --genomeInsertOutDir receives the selected output representation.
  • Multiple FASTA files can be supplied with --genomeFastaFiles.
  • --sjdbGTFfile can be supplied for annotations on inserted sequences only.
  • Existing annotation sidecar files are preserved and merged with inserted-sequence annotations when a complete updated index is written.

Limitations:

  • Only new named FASTA sequences are added.
  • Existing chromosome sequences are not edited.
  • --sjdbGTFfile is insert-only in this mode; it is not used to edit or replace base genome annotations.
  • --sjdbFileChrStartEnd and --twopassMode are rejected for genomeInsert.

Delta Mode

Delta mode is the main new scalability feature in this revision.

Instead of writing a full updated SA file, STAR records the positions where inserted-sequence suffixes belong relative to the base suffix array. The resulting genomeInsertDelta.bin can be reused with the base genome index.

At alignment time, when the inserted GTF has no splice junctions, STAR uses a virtual SA overlay:

  • base Genome, SA, and SAindex are reused;
  • inserted suffixes are resolved lazily from the delta plan;
  • existing suffix-array positions are shifted virtually rather than materializing a full expanded SA;
  • alignment output remains equivalent to a full rebuild for the covered cases.

If inserted annotations define splice junctions, STAR takes the conservative materialized path because junction insertion changes the genome/SJDB sequence layout.

Validation

The regression script builds a small annotated base genome, inserts two named sequences with insert-only GTF annotations, and compares the result to a full rebuild from base plus inserted FASTA/GTF.

Validated checks include:

  • same-directory output guard,
  • inserted-only GTF guard,
  • repeated genomeInsert idempotence,
  • byte-equivalent Genome, SAindex, chromosome metadata, and annotation sidecar files versus full rebuild,
  • alignment body and SJ.out.tab matching a full rebuild,
  • GeneCounts output matching a full rebuild,
  • Overlay alignment equivalence,
  • Delta alignment equivalence,
  • virtual-SA overlay path for no-junction inserted GTF annotations,
  • expected alignments to both inserted references and original base reference.

SA is not required to be byte-identical to the full rebuild because suffixes with equal ordering keys can be stored in a different but equivalent order.

Local validation commands:

git diff --check origin/master..HEAD
make -C source -j8 STAR
extras/tests/scripts/testGenomeInsert.sh
THREADS=1 extras/tests/scripts/testGenomeInsert.sh

All passed locally.

Representative Benchmarks

Full Updated Index Versus Full Rebuild

Using a large prebuilt human genome STAR index and two small added public FASTA records:

Scenario Wall time
genomeInsert full updated index from prebuilt index 8:08.15
Full genomeGenerate rebuild 20:24.38

Observed wall-clock speedup was 2.51x, saving 736.23 s, or 60.1% of the full rebuild time.

Delta Overlay Build

Using a large GRCh38/Ensembl 114 + ERCC STAR index with inserted GFP and GST records:

Scenario Result
Delta build wall time 27 s
genomeInsertOverlay.tsv 571 bytes
genomeInsertDelta.bin 54 KB

Alignment Behavior With Synthetic Transgene Reads

A representative production single-end RNA-seq FASTQ with about 13.1M 92 bp reads was spiked with 1,000 synthetic GFP reads and 1,000 synthetic GST reads. The same spiked FASTQ was aligned against the base GRCh38 index and the GRCh38+GFP+GST Delta overlay index.

Check Result
Base alignment wall time 92 s
Delta alignment wall time 91 s
Existing reference gene rows changed 0
GFP count in Delta output 1000
GST count in Delta output 1000

The expected special-row change was observed: the 2,000 synthetic transgene reads moved from N_unmapped in the base alignment into the inserted GFP/GST gene rows in the Delta alignment. No existing reference gene-count row changed.

@justinblethrow-cloud
Copy link
Copy Markdown
Author

Update: pushed f73e8a6, which updates SAindex incrementally for inserted genomes instead of rebuilding it from the expanded suffix array.

Validation run locally:

  • make -C source -j8 STAR: pass
  • extras/tests/scripts/testGenomeInsert.sh: pass
  • chr20+GFP GeneCounts/SAM validation against full rebuild: pass
  • full CHM13+ERCC with GFP+GST benchmark: genomeInsert 5:47.46 wall, full rebuild 21:14.55 wall

Compared with the earlier full-CHM13 genomeInsert benchmark at 8:08.15, the incremental SAindex update reduces the persistent insertion path to 5:47.46. The benchmark validation reports Genome, SAindex, chromosome files, SJDB files, and exon/gene/transcript sidecars as byte-identical to full rebuild. SA remains different-equivalent-ordering, as expected for this path.

@justinblethrow-cloud
Copy link
Copy Markdown
Author

Updated this PR with an additional genomeInsert overlay mode commit (329d070).

New behavior:

  • Adds --genomeInsertOutMode Full|Overlay.
  • Full preserves the persistent full-index output behavior.
  • Overlay writes a small genomeInsertOverlay.tsv manifest that references an existing base genome plus inserted FASTA/GTF records, avoiding a full Genome/SA/SAindex copy.
  • Overlay directories can be used for alignReads with --genomeLoad NoSharedMemory.

Additional implementation details:

  • Shared annotation sidecar merge helper for inserted GTF support and --quantMode GeneCounts.
  • Fast path that skips junction index insertion when the inserted GTF adds no junctions, which is the common transgene case.
  • Lazy SA coordinate shifting during genome insertion, avoiding an up-front full suffix-array rewrite while preserving full-rebuild-equivalent alignments.
  • Expanded extras/tests/scripts/testGenomeInsert.sh coverage for SJDB-backed base indexes, overlay manifests, overlay alignments, and gene counts.

Validation performed locally:

  • make -C source -j8 STAR
  • extras/tests/scripts/testGenomeInsert.sh
  • Full CHM13 + GFP/GST inserted FASTA/GTF overlay benchmark: 1:51.10 wall time after the overlay/lazy-shift optimizations.
  • Real RNA-seq spike-in check: 1000 real J89B9H reads plus 12 synthetic GFP reads. Overlay SAM body exactly matched a full-rebuild index; all 12 GFP reads mapped uniquely to M62653.1; overlay GeneCounts reported GFP 12 12 0 and GST 0 0 0.

One note on benchmarking: the CHM13 run was warm-cache, so the most robust comparison is the insertion phase. The lazy SA shift removed the previous full-SA pre-pass and moved the SA search checkpoint to immediately after genome load.

@justinblethrow-cloud justinblethrow-cloud changed the title Add persistent genome sequence insertion mode Add persistent genome insertion with delta overlay mode May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistently add named FASTA sequences to an existing genome index

1 participant