Skip to content

Auto-select CSI BAM index for long-chromosome genomes#1102

Merged
etal merged 2 commits into
masterfrom
bug-cnvkit-0mg-samutil-csi-index
Jun 1, 2026
Merged

Auto-select CSI BAM index for long-chromosome genomes#1102
etal merged 2 commits into
masterfrom
bug-cnvkit-0mg-samutil-csi-index

Conversation

@etal

@etal etal commented May 30, 2026

Copy link
Copy Markdown
Owner

Fixes #817.

ensure_bam_index previously built and recognized only BAI indexes. The
BAI format cannot address reference contigs longer than 2**29 bp (about
537 Mb), so BAM files from long-chromosome genomes — wheat, barley
(chr2H ~665 Mb), salamander — failed to index entirely:

[E::hts_idx_check_range] Region X..Y cannot be stored in a bai index. Try using a csi index

CNVkit now reads the BAM header and, when any reference contig exceeds the
BAI limit, builds a CSI index (samtools index -c) automatically. Typical
human references (hg19/hg38, longest contig ~250 Mb) are unaffected: they
still get a .bai, byte-for-byte as before. An existing .csi is
discovered and reused just like .bai, and a stale .bai is removed after
a CSI is built so it cannot shadow the new index.

Tests

New test/test_samutil.py:

  • short contig → BAI is built and returned;
  • contig > 2**29 → a CSI index is built, returned, and usable (fetch and
    idxstats work through a read positioned beyond the BAI coordinate limit);
  • contig of exactly 2**29 → BAI (verifies the strict threshold);
  • a fresh .csi is reused without rebuilding;
  • a leftover .bai is removed once a CSI is built.

Existing coverage and batch tests pass unchanged, confirming the BAI
path is unregressed.

Clinical impact

None. No change to numerical output or output file formats (.cnr,
.cns, .cnn, SEG, VCF) for existing workflows; the BAI code path is
unchanged. This only enables genomes that previously could not be indexed
at all.

etal added 2 commits May 30, 2026 08:47
ensure_bam_index built and recognized only BAI indexes, which cannot
address reference contigs longer than 2**29 bp (about 537 Mb). BAM files
from long-chromosome genomes (wheat, barley chr2H ~665 Mb, salamander)
therefore failed to index, with htslib reporting that the region "cannot
be stored in a bai index".

Read the BAM header first and, when any contig exceeds the BAI limit,
build a CSI index (samtools index -c) instead; behavior is unchanged for
typical human references, which still get a .bai. Index discovery now
also recognizes .csi files, and a stale .bai is removed after a CSI is
built so it cannot shadow the new index.

Add test/test_samutil.py covering BAI/CSI selection, the 2**29 boundary,
CSI usability beyond the BAI coordinate limit, index reuse, and stale-.bai
removal.
@codecov

codecov Bot commented May 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.88889% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.41%. Comparing base (42c8ce6) to head (1746322).

Files with missing lines Patch % Lines
cnvlib/samutil.py 88.88% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1102      +/-   ##
==========================================
+ Coverage   70.35%   70.41%   +0.05%     
==========================================
  Files          73       73              
  Lines        7891     7907      +16     
  Branches     1395     1400       +5     
==========================================
+ Hits         5552     5568      +16     
  Misses       1891     1891              
  Partials      448      448              
Flag Coverage Δ
unittests 70.41% <88.88%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@etal etal merged commit 34a697e into master Jun 1, 2026
13 checks passed
@etal etal deleted the bug-cnvkit-0mg-samutil-csi-index branch June 1, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error while indexing BAM files

1 participant