Skip to content

Improve genomeGenerate multicore index build performance#2687

Open
justinblethrow-cloud wants to merge 3 commits into
alexdobin:masterfrom
justinblethrow-cloud:indexbuild-opt-20260519
Open

Improve genomeGenerate multicore index build performance#2687
justinblethrow-cloud wants to merge 3 commits into
alexdobin:masterfrom
justinblethrow-cloud:indexbuild-opt-20260519

Conversation

@justinblethrow-cloud
Copy link
Copy Markdown

@justinblethrow-cloud justinblethrow-cloud commented May 20, 2026

Summary

  • improve genomeGenerate multi-core throughput for large references
  • batch suffix-array chunk filling by prefix bin, retain sorted chunks in RAM when memory allows, and split very large prefix bins into ordered sub-bins before sorting
  • parallelize SAindex traversal and junction-index work where output order can remain deterministic
  • add extras/tests/scripts/benchmarkGenomeGenerate.sh to capture wall time, CPU/I/O samples, and genomeGenerate stage timings

Benchmark notes

  • Local full CHM13v2 + ERCC + GTF benchmark, 96 threads, --genomeSAindexNbases 14, --genomeChrBinNbits 18, --limitGenomeGenerateRAM 300000000000
  • Baseline STAR 2.7.11b local run: 1164.89s
  • Best local run from this branch: 653.76s
  • Final pushed PR validation run: 685.29s in results/full_chm13_prfinal_20260520
  • In the final validation run, suffix-array chunk sorting took 282s wall; profiled internals were fill_count_seconds=1.629, fill_scatter_seconds=36.699, sort_seconds=237.783, finalize_seconds=1.788
  • Adaptive large-bin splitting sub-binned 116 large prefixes into 7427 sort ranges covering 194,932,600 suffixes
  • Outputs were byte-identical for Genome, SA, SAindex, chromosome metadata, and sjdbList.out.tab against the prior accepted optimized build

Validation

  • make -C source STAR
  • smoke benchmarkGenomeGenerate.sh runs at 1 and 16 threads
  • full CHM13v2 + ERCC + GTF benchmark with STAR_PROFILE_SA_SORT=1
  • byte comparisons for core generated index outputs

Notes for review

  • The benchmark harness is intentionally isolated under extras/tests/scripts/.
  • Optional suffix-sort profiling is gated behind STAR_PROFILE_SA_SORT=1; normal runs do not emit those profiling details.
  • The implementation keeps deterministic ordering for equal suffixes and was checked with byte comparisons on full-reference outputs.

Codex added 2 commits May 20, 2026 15:20
Add a local benchmark script for genomeGenerate that records wall time, CPU and I/O samples, selected index-build settings, and per-stage timings from Log.out. This keeps performance validation reproducible while leaving STAR runtime behavior unchanged.
Use parallel traversal for SAindex generation and parallel bucket sorting for junction insertion indices. Keep deterministic output ordering while reducing the remaining annotation-heavy genomeGenerate stages.
@justinblethrow-cloud
Copy link
Copy Markdown
Author

Cleaned the branch history for review. The PR now has three logical commits: benchmark harness, SAindex/junction parallelization, and suffix-array construction improvements. The cleaned tree is identical to the previously pushed tree (verified by matching git tree hashes), so the benchmark and byte-identity results in the PR description still apply. Local validation after cleanup: bash -n extras/tests/scripts/benchmarkGenomeGenerate.sh, git diff --check, and make -C source STAR.

Reduce genomeGenerate suffix-array build time by batching prefix-bin fills, retaining sorted chunks in RAM when memory allows, splitting very large prefix bins into ordered sub-bins, and using a comparator that can skip already-known prefix words. Optional SA sort profiling remains gated behind STAR_PROFILE_SA_SORT=1.
@justinblethrow-cloud
Copy link
Copy Markdown
Author

Update after final hardening pass:

  • Force-pushed cleaned 3-commit history at 47e6051.
  • Added a narrow guard around the adaptive sub-bin prefix lookup so it only uses funSAsortPrefixAtOffset when the extra prefix bases are available.
  • Rebuilt with make -C source STAR.
  • Re-ran full CHM13v2 + ERCC + GTF genomeGenerate validation at 96 threads: 685.29s wall in results/full_chm13_prfinal_20260520.
  • Byte-compared Genome, SA, SAindex, chromosome metadata, and sjdbList.out.tab against the prior accepted optimized output: all identical.

One broader skip-first-word safety guard was tested but not kept because it regressed suffix-sort time from the good ~238s profiled sort regime back to ~275s; the genomeGenerate buffer already has leading padding, and the retained guard is limited to the extra sub-bin prefix read.

@justinblethrow-cloud justinblethrow-cloud marked this pull request as ready for review May 20, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant