Skip to content

Clonal lineage trees (BCR) + clone-size repertoires (TCR / flat-BCR)#4

Merged
MuteJester merged 59 commits into
masterfrom
clonal-lineage-simulation
Jun 16, 2026
Merged

Clonal lineage trees (BCR) + clone-size repertoires (TCR / flat-BCR)#4
MuteJester merged 59 commits into
masterfrom
clonal-lineage-simulation

Conversation

@MuteJester

Copy link
Copy Markdown
Owner

Replaces the star-topology expand_clones with two real clonal models, plus docs and validation.

What's new

clonal_lineage — BCR affinity-maturation lineage trees

  • Generation-synchronous Poisson birth–death + logistic carrying capacity
  • Per-division context-sensitive S5F somatic hypermutation
  • Optional affinity selection (BLOSUM62 sequence-distance proxy → fitness → offspring rate; neutral by default)
  • Living-population sampling + genotype-collapse; founder-survival guard so n_clones is reliably produced (allow_extinction opts out)
  • Full per-cell AIRR records (clone_id, lineage_*, consistent pool-derived mutation counts) + per-clone ground-truth trees (to_newick/to_fasta/to_node_table_tsv)
  • Post-fork library-prep/sequencing passes apply per cell; validate_records / expose_provenance supported; BCR-only (rejects TCR loci)

clonal_repertoire — TCR & flat-BCR abundance repertoires

  • Heavy-tailed clone-size distributions (power-law / lognormal) + unexpanded-singleton fraction
  • N reads per clone through post-fork passes, genotype-collapsed into AIRR records with standard duplicate_count
  • TCR (no SHM) and flat BCR; no-post-fork shortcut for large clones

expand_clones deprecated (kept working). DSL ordering guards cover all three forks.

Docs

  • New guides: clonal-lineage.md, clonal-repertoire.md (auto-published via the mkdocs deploy)
  • Detection figure (Change-O recovers planted clones, ARI=1.0) — claims restricted to what was actually run
  • README "Clonal lineages & repertoires" section; homepage advertises the flagship clonal_lineage

Validation

Two independent expert reviews of the engine + DSL drove fixes for: TCR-SHM guard, living-population sampling, live-call cache refresh, mutation-count consistency, and library-prep wiring.

Test plan

  • Rust cargo test --all-features: green
  • Full pytest tests/: green (incl. clonal/lineage/repertoire suites + plan-split/ordering contracts)
  • mkdocs build --strict: clean; docs-website contract green

…mutation_count) so validation passes

Lineage node Outcomes were built with an empty event ledger but a nonzero
Simulation.mutation_count, causing validate_record to false-fail the
mutation-count sum invariant (MutationCountSumMismatch) on mutated nodes,
and build_airr_record to report zero per-segment SHM counts while
n_mutations was nonzero.

Fix: synthesize_shm_event_record scans sim.pool for positions where
base != germline and emits one SimulationEvent::BaseChanged per mutated
site into a single MUTATE_S5F EventRecord. mutation_count is then set
to the net pool-mutation count. The resulting Outcome is self-consistent:
build_airr_record reads the events to derive per-segment and V-subregion
counts, and mutation_count matches the event count so the validator
passes.

Consequence: PyFamilyOutcome::airr_records no longer needs the manual
pool_mutation_counts dict-overwrite path. It now delegates to
build_airr_record directly, removing the PoolMutCounts struct and
pool_mutation_counts helper (~90 lines net reduction).
…live sampling + survival guard); ARI still 1.0
…truth trees) instead of confusing repertoire+mutate
…ge + clonal_repertoire); flag expand_clones as legacy
…ract to resolve against the published mkdocs site (site_docs/)
…age/clonal_repertoire) at DSL time

The duplicate-fork guards in expand_clones() and clonal_lineage() only
checked a subset of the three fork step types, so stacking two different
fork methods (e.g. clonal_repertoire().clonal_lineage()) slipped past the
DSL guard and surfaced as a confusing 'unsupported pipeline step type'
TypeError at compile(). All three guards now check all three fork types
and raise the canonical 'once per pipeline' message. Adds a parametrized
regression test covering every ordered pair.
@MuteJester MuteJester merged commit 304c532 into master Jun 16, 2026
13 checks passed
MuteJester added a commit that referenced this pull request Jun 16, 2026
…e identity, gapped projection

Addresses critic findings on the novel-allele slice:
- #1 functional validation: synthesized V/J coding sequence is checked for an
  intact conserved anchor codon (Cys/Trp|Phe) and stop-free coding frame; a
  broken variant is rejected unless allow_nonfunctional=True (then kept + marked
  non-functional). Closes the 'nonfunctional emitted as productive' gap.
- #2 gene identity: the novel allele's gene is taken from its NAME and must equal
  the base allele's gene; dropped the gene= override that left allele.gene stale.
- #3 anchor: inherited from base (correct for same-length/substitution-only
  variants — the conserved residue does not move) and validated to remain intact;
  no reliance on the unavailable _native anchor resolver.
- #4 name uniqueness enforced across all segments + novel set (prevents truth-table
  mislabeling).
- #5 to_tsv now emits the 'novel' column.
- #6 substitutions projected onto the gapped sequence (no stale gapped_seq).
- #7 mutation positions/bases type-checked with clean ValueErrors.
Tests expanded: stop/anchor rejection + allow_nonfunctional, gene-mismatch,
cross-segment collision, tsv export, type-check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant