Skip to content

Release/5.3.0#20

Merged
rhshah merged 26 commits into
mainfrom
release/5.3.0
May 16, 2026
Merged

Release/5.3.0#20
rhshah merged 26 commits into
mainfrom
release/5.3.0

Conversation

@rhshah

@rhshah rhshah commented May 16, 2026

Copy link
Copy Markdown
Member

No description provided.

rhshah added 25 commits May 14, 2026 17:55
- Add has_structural_alt field (sticky flag, same pattern as has_n_base)
- Add is_structural param to observe() with debug_assert guard
- Add structural ALT priority in resolve(): CIGAR I/D evidence wins
  over non-structural REF when fragment reads disagree on INDELs
- Full doc comments including known Phase 3 limitation
- Line 1231: count_variant_from_cache (main counting path)
- Line 1737: count_single_variant (discovered during compilation)
- Line 2320: count_per_transcript (P4b per-transcript counting)
- resolve() call sites (1378, 1862, 2336) unchanged — structural
  info is already inside FragmentEvidence by resolve() time
…ority

- 9 test cases: INS/DEL conflict, agreement, singleton, SNP tie, REF agree
- All 4 counting invariants asserted (code-quality.md)
- Both count_bam and count_bam_binned parity verified via count_both()
- SNP regression test (test 5) confirms unchanged quality consensus
…tural priority

- Test 6: INS R1=I(2) → Phase 3, R2=I(1) → structural → ALT wins
- Test 7: DEL R1=D(2) → Phase 3, R2=D(1) → structural → ALT wins
- Confirms wrong-length reads don't set is_structural (Phase 3 non-structural)
- 11 tests total (up from 9), all with invariant + parity assertions
…ogic

- 8 resolve() tests: structural wins, quality wins, tie discard, edge cases
- 3 observe() tests: sticky flag, REF-structural guard, non-structural ALT
- 161 Rust tests total (up from 150), 0 Clippy warnings
- Tests isolated from BAM I/O for fast regression detection
- counting-metrics.md: Phase B flowchart now shows structural priority
  diamond before quality comparison, new table row, info admonition
  with Phase 3 scope and rationale, FragmentEvidence description updated
- architecture.md: comparison table updated for consistency
- Both docs mention the CIGAR I/D priority mechanism and its scope
- Test Categories: added test_indel_fragment_consensus.py entry
- File tree: added in alphabetical position
- Rust test table: added fragment consensus row (11 unit tests)
- Accuracy Validation: added INDEL fragment consensus row (11 cases)
- Updated test counts: 266 Python + 161 Rust tests
… priority

- code-quality.md: add test_indel_fragment_consensus.py to test table,
  update test counts to 266 Python + 161 Rust
- parameters.md: update fragment_qual_threshold description to note
  that INDEL structural evidence bypasses the quality threshold
Resolves systematic fragment-level INDEL discard caused by anchor-BQ
tie in quality-weighted consensus. Structural CIGAR I/D evidence now
wins unconditionally over base-quality comparison for INDEL conflicts.

Changes:
- ClassifyResult.is_structural flag + is_alt_structural() constructor
- 6 CIGAR I/D return sites tagged structural in variant_checks.rs
- FragmentEvidence.has_structural_alt sticky flag + resolve() priority
- 3 engine.rs observe() call sites threaded
- 11 Python integration tests (INS/DEL conflict, wrong-length, regression)
- 11 Rust unit tests (resolve/observe branch coverage)
- Documentation: counting-metrics, architecture, testing-guide, CHANGELOG,
  agent rules, Nextflow parameters

Test results: 266 Python + 161 Rust tests pass, 0 Clippy warnings
Steps 0-6 of the merge implementation plan:

- build: add polars>=1.0.0 as core dependency
- feat(io): add centralized io/batch.py (read_maf, scan_maf, read_parquet, write_maf)
- refactor(mfsd): replace pyarrow+csv with Polars batch helpers
- feat(models): add MergeConfig Pydantic model with validation
- feat(merge): add Polars-based multi-BAM MAF merge engine
- feat(cli): add gbcms merge subcommand (--input type:path)
- test: add 23 tests (7 batch_io + 15 merge + 1 CLI integration)

All 289 tests pass. Zero performance warnings. ruff clean.
GAP A: Fill meta columns (gbcms_status, etc.) with '' instead of null
       after outer join. Prevents null propagation to downstream tools.

GAP B: Add comprehensive 'different row count' tests:
       - test 16: 5 vs 3 rows with 2 overlapping (asymmetric counts)
       - test 17: combined columns with unmatched variants
       - test 18: annotation column nulls for right-only variants
       - test 19: meta column null-fill verification

GAP C: Log unmatched variant counts per type for monitoring.

GAP E: test 20: CLI invalid input format error path.

All 294 tests pass. merge.py at 96% coverage.
Step A: Expose fisher_exact_2x2 to Python via PyO3
  - Added #[pyfunction] wrapper in shared/stats.rs
  - Registered in lib.rs as gbcms._rs.fisher_exact_2x2
  - Same Rust implementation as per-BAM engine → numerically identical

Step B: Expand combined additive columns
  - Read-level: ref_count, alt_count → total_count, vaf
  - Read strand: ref/alt_count_forward/reverse (4 cols)
  - Fragment-level: ref/alt_count_fragment → total_count_fragment, vaf_fragment
  - Fragment strand: ref/alt_count_fragment_forward/reverse (4 cols)
  - Schema-aware: only computes columns that exist in input MAFs

Step C: Combined strand bias via Rust Fisher exact test
  - simplex_duplex_strand_bias_p_value + odds_ratio (read-level)
  - simplex_duplex_fragment_strand_bias_p_value + odds_ratio (fragment-level)
  - Computed per-row using the Rust fisher_exact_2x2 function

Total: 20 combined columns when all inputs have full column set.
Graceful degradation when strand columns are absent.

Tests: 4 new tests (21-24), 298 total passing.
Steps 7-8 of merge implementation plan:
- Nextflow MERGE_COUNTS process with bam_type samplesheet grouping
- Auto-derive --column-prefix from meta.bam_type in DNA module
- docs/cli/merge.md CLI reference
- mkdocs.yml nav entry
- Nextflow params/samplesheet docs (bam_type, merge options)
- release-guide.md updated (9 version locations)
- CHANGELOG.md comprehensive Unreleased section
- __init__.py exports (merge_mafs, MergeConfig)
- _rs.pyi type stub (fisher_exact_2x2)
- architecture.md + output-formats.md updated

298 tests | 0 lint errors | No breaking changes
Squash merging rewrites commit SHAs, breaking shared ancestry
between main and develop. This causes merge conflicts on every
changed file during the back-merge step. Regular merge commits
preserve history and make back-merges conflict-free.
Canonical release notes live in CHANGELOG.md. This standalone file
was a leftover from v5.1.0 and is not referenced by any CI, docs,
or release automation.
When bam_type is set but suffix is not provided, the output suffix
is now automatically derived as '-{bam_type}' (e.g., bam_type=duplex
produces suffix=-duplex → sample1-duplex.maf).

Priority: explicit suffix > auto-derived from bam_type > global params.suffix

This eliminates redundancy in samplesheet rows where suffix was
always just '-{bam_type}'.
Errors out early if multiple samplesheet rows share the same sample
ID without a distinguishing suffix or bam_type, which would cause
output MAFs to silently overwrite each other.
@rhshah rhshah self-assigned this May 16, 2026
@rhshah

rhshah commented May 16, 2026

Copy link
Copy Markdown
Member Author

#19

@rhshah rhshah merged commit e1bb985 into main May 16, 2026
9 checks passed
@rhshah rhshah deleted the release/5.3.0 branch May 16, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant