Auto-select CSI BAM index for long-chromosome genomes#1102
Merged
Conversation
ensure_bam_index built and recognized only BAI indexes, which cannot address reference contigs longer than 2**29 bp (about 537 Mb). BAM files from long-chromosome genomes (wheat, barley chr2H ~665 Mb, salamander) therefore failed to index, with htslib reporting that the region "cannot be stored in a bai index". Read the BAM header first and, when any contig exceeds the BAI limit, build a CSI index (samtools index -c) instead; behavior is unchanged for typical human references, which still get a .bai. Index discovery now also recognizes .csi files, and a stale .bai is removed after a CSI is built so it cannot shadow the new index. Add test/test_samutil.py covering BAI/CSI selection, the 2**29 boundary, CSI usability beyond the BAI coordinate limit, index reuse, and stale-.bai removal.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1102 +/- ##
==========================================
+ Coverage 70.35% 70.41% +0.05%
==========================================
Files 73 73
Lines 7891 7907 +16
Branches 1395 1400 +5
==========================================
+ Hits 5552 5568 +16
Misses 1891 1891
Partials 448 448
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #817.
ensure_bam_indexpreviously built and recognized only BAI indexes. TheBAI format cannot address reference contigs longer than 2**29 bp (about
537 Mb), so BAM files from long-chromosome genomes — wheat, barley
(chr2H ~665 Mb), salamander — failed to index entirely:
CNVkit now reads the BAM header and, when any reference contig exceeds the
BAI limit, builds a CSI index (
samtools index -c) automatically. Typicalhuman references (hg19/hg38, longest contig ~250 Mb) are unaffected: they
still get a
.bai, byte-for-byte as before. An existing.csiisdiscovered and reused just like
.bai, and a stale.baiis removed aftera CSI is built so it cannot shadow the new index.
Tests
New
test/test_samutil.py:fetchandidxstatswork through a read positioned beyond the BAI coordinate limit);.csiis reused without rebuilding;.baiis removed once a CSI is built.Existing
coverageandbatchtests pass unchanged, confirming the BAIpath is unregressed.
Clinical impact
None. No change to numerical output or output file formats (
.cnr,.cns,.cnn, SEG, VCF) for existing workflows; the BAI code path isunchanged. This only enables genomes that previously could not be indexed
at all.