Improve genomeGenerate multicore index build performance by justinblethrow-cloud · Pull Request #2687 · alexdobin/STAR

justinblethrow-cloud · 2026-05-20T15:17:44Z

Summary

improve genomeGenerate multi-core throughput for large references
batch suffix-array chunk filling by prefix bin, retain sorted chunks in RAM when memory allows, and split very large prefix bins into ordered sub-bins before sorting
parallelize SAindex traversal and junction-index work where output order can remain deterministic
add extras/tests/scripts/benchmarkGenomeGenerate.sh to capture wall time, CPU/I/O samples, and genomeGenerate stage timings

Benchmark notes

Local full CHM13v2 + ERCC + GTF benchmark, 96 threads, --genomeSAindexNbases 14, --genomeChrBinNbits 18, --limitGenomeGenerateRAM 300000000000
Baseline STAR 2.7.11b local run: 1164.89s
Best local run from this branch: 653.76s
Final pushed PR validation run: 685.29s in results/full_chm13_prfinal_20260520
In the final validation run, suffix-array chunk sorting took 282s wall; profiled internals were fill_count_seconds=1.629, fill_scatter_seconds=36.699, sort_seconds=237.783, finalize_seconds=1.788
Adaptive large-bin splitting sub-binned 116 large prefixes into 7427 sort ranges covering 194,932,600 suffixes
Outputs were byte-identical for Genome, SA, SAindex, chromosome metadata, and sjdbList.out.tab against the prior accepted optimized build

Validation

make -C source STAR
smoke benchmarkGenomeGenerate.sh runs at 1 and 16 threads
full CHM13v2 + ERCC + GTF benchmark with STAR_PROFILE_SA_SORT=1
byte comparisons for core generated index outputs

Notes for review

The benchmark harness is intentionally isolated under extras/tests/scripts/.
Optional suffix-sort profiling is gated behind STAR_PROFILE_SA_SORT=1; normal runs do not emit those profiling details.
The implementation keeps deterministic ordering for equal suffixes and was checked with byte comparisons on full-reference outputs.

Add a local benchmark script for genomeGenerate that records wall time, CPU and I/O samples, selected index-build settings, and per-stage timings from Log.out. This keeps performance validation reproducible while leaving STAR runtime behavior unchanged.

Use parallel traversal for SAindex generation and parallel bucket sorting for junction insertion indices. Keep deterministic output ordering while reducing the remaining annotation-heavy genomeGenerate stages.

justinblethrow-cloud · 2026-05-20T15:21:43Z

Cleaned the branch history for review. The PR now has three logical commits: benchmark harness, SAindex/junction parallelization, and suffix-array construction improvements. The cleaned tree is identical to the previously pushed tree (verified by matching git tree hashes), so the benchmark and byte-identity results in the PR description still apply. Local validation after cleanup: bash -n extras/tests/scripts/benchmarkGenomeGenerate.sh, git diff --check, and make -C source STAR.

Reduce genomeGenerate suffix-array build time by batching prefix-bin fills, retaining sorted chunks in RAM when memory allows, splitting very large prefix bins into ordered sub-bins, and using a comparator that can skip already-known prefix words. Optional SA sort profiling remains gated behind STAR_PROFILE_SA_SORT=1.

justinblethrow-cloud · 2026-05-20T16:03:23Z

Update after final hardening pass:

Force-pushed cleaned 3-commit history at 47e6051.
Added a narrow guard around the adaptive sub-bin prefix lookup so it only uses funSAsortPrefixAtOffset when the extra prefix bases are available.
Rebuilt with make -C source STAR.
Re-ran full CHM13v2 + ERCC + GTF genomeGenerate validation at 96 threads: 685.29s wall in results/full_chm13_prfinal_20260520.
Byte-compared Genome, SA, SAindex, chromosome metadata, and sjdbList.out.tab against the prior accepted optimized output: all identical.

One broader skip-first-word safety guard was tested but not kept because it regressed suffix-sort time from the good ~238s profiled sort regime back to ~275s; the genomeGenerate buffer already has leading padding, and the retained guard is limited to the extra sub-bin prefix read.

Codex added 2 commits May 20, 2026 15:20

Parallelize SAindex and junction indexing

2d12a0c

Use parallel traversal for SAindex generation and parallel bucket sorting for junction insertion indices. Keep deterministic output ordering while reducing the remaining annotation-heavy genomeGenerate stages.

justinblethrow-cloud force-pushed the indexbuild-opt-20260519 branch from 692ceef to fb426f8 Compare May 20, 2026 15:21

justinblethrow-cloud force-pushed the indexbuild-opt-20260519 branch from fb426f8 to 47e6051 Compare May 20, 2026 16:02

justinblethrow-cloud marked this pull request as ready for review May 20, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve genomeGenerate multicore index build performance#2687

Improve genomeGenerate multicore index build performance#2687
justinblethrow-cloud wants to merge 3 commits into
alexdobin:masterfrom
justinblethrow-cloud:indexbuild-opt-20260519

justinblethrow-cloud commented May 20, 2026 •

edited

Loading

Uh oh!

justinblethrow-cloud commented May 20, 2026

Uh oh!

justinblethrow-cloud commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justinblethrow-cloud commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark notes

Validation

Notes for review

Uh oh!

justinblethrow-cloud commented May 20, 2026

Uh oh!

justinblethrow-cloud commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

justinblethrow-cloud commented May 20, 2026 •

edited

Loading