You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce a MutantTranscript abstraction that represents the result of applying one or more variants to a reference transcript — carrying the mutated cDNA and protein sequences along with provenance. This reshapes effect annotation from "compute the delta against the reference by reasoning about offsets" to "construct the mutant sequence, translate it, compare to the reference protein."
The old and new annotators coexist as pluggable EffectAnnotator implementations. SNVs (and other trivial point variants) take a shared fast path in both; the heavy machinery only fires for variants that need it. The EffectCollection records which annotator produced it, and serialized output carries that provenance in a header so it's not lost.
This is a consolidation point. Almost every open item on the roadmap asks for the same abstraction, and several existing bugs were caused by not having it. Done well, this simplifies the code rather than complicating it.
Motivation
The current effect code lives in predict_in_frame_coding_effect and cousins. It decides between Silent / PrematureStop / StopLoss / Insertion / Deletion / Substitution / ... by reasoning about offsets, shared prefixes, and boundary conditions — without ever materializing the full mutant protein and comparing it to the reference.
Integration: Exacto (Add loader for Exacto output formats #260) and Isovar (see below) produce exactly this object already. Today we'd have to tear apart their output to jam it into varcode's existing per-variant Effect model. With MutantTranscript, they import cleanly.
Apply germline variants first → produces a germline MutantTranscript. Somatic annotation diffs the final mutant against germline, not reference. One-line composition.
Each splice outcome (normal, exon skip, cryptic donor, intron retention) is a different MutantTranscript with a plausibility score. The "possibility set" is just List[MutantTranscript].
RNA observations areMutantTranscript objects. Importing RNA support means attaching evidence (read counts, fragment IDs) to the right MutantTranscript.
One MutantTranscript per transcript per variant set; no "top priority" coupling.
Pluggable EffectAnnotator implementations
The old and new annotators are not a transitional dual-implementation; they're distinct implementations that coexist behind a shared interface. "Annotator" (rather than "Predictor") matches the field's vocabulary — VEP, SnpEff, and ANNOVAR are all called annotators — and avoids ML connotations that don't apply to varcode's deterministic rule-driven computation.
# Global defaultvarcode.set_default_annotator("sequence_diff")
# Per-call overridevariant.effects(annotator="legacy")
variant.effects(annotator=my_custom_annotator)
# Scoped override (useful for A/B testing)withvarcode.use_annotator("sequence_diff"):
effects=vc.effects()
Note that Variant.effects() keeps its name — it's the user-facing accessor and doesn't need the churn. "Annotator" appears at the module / backend layer.
Guarantees
Same Effect output types. Both annotators return the existing Substitution, Silent, PrematureStop, etc. classes. Downstream code is unaffected by annotator choice unless it explicitly asks about evidence/provenance that only one annotator produces.
Feature declaration. Each annotator declares its supports set. The legacy annotator supports {"snv", "indel", "mnv"}. The sequence_diff annotator supports those plus {"splice_set", "sv", "phased"}. Asking the legacy annotator to handle an SV is an explicit UnsupportedVariantError, not silent wrong output.
Third-party annotators. Isovar and Exacto can implement their own annotator (taking their assembled/annotated output as input) and register it with the varcode annotator registry. Users get a consistent API regardless of evidence source.
Why pluggable annotators, not a flag
A flag flips a global once; annotators are a stable contract. Concretely this gets us:
A/B testing. Users can run both annotators on real data and report regressions without patching varcode.
Extension point. Downstream tools ship their own annotator; we don't absorb them all into varcode.
Graceful deprecation. The legacy annotator stays available for users who depend on exact-byte-for-byte compatibility even after we flip the default. Removing it is a separate decision.
Fast path for SNVs (and small indels)
SNVs dominate any realistic somatic VCF — typically >95% of records. The design must not impose the full MutantTranscript construction + translation cost on them. Both annotators use a shared fast path:
defannotate_on_transcript(variant, transcript, context=None):
# Fast path: single-codon SNV with no adjacent variants, no splice# boundary, no germline in the same codon, no phasing complications.if_is_trivial_point_variant(variant, transcript, context):
return_fast_path_annotate(variant, transcript)
# Otherwise: construct MutantTranscript, translate, diff.return_slow_path_annotate(variant, transcript, context)
Requirements for the fast path:
Stays identical to the 2.0.0 behaviour for point variants. No new bugs introduced, no behavioural drift.
No MutantTranscript construction. The fast path reads one reference codon, applies one base substitution, translates the single codon, and emits the Effect. No full-sequence materialization.
Shared between annotators. Both the legacy and sequence_diff annotators dispatch to the same fast-path code. This guarantees they agree on the common case and removes it as a source of A/B divergence.
Opt out when context would change the answer. If germline variants land in the same codon, if the variant is in a phase block with a near neighbour, or if the variant is within 3bp of an exon-intron boundary, fall through to the slow path. Getting this triage right matters: too eager on the fast path → wrong results; too conservative → no perf gain.
The slow path is where SVs, phased haplotypes, splice possibility sets, germline-aware annotation, and Isovar evidence integration live. By the time a variant reaches the slow path, it's because we already know it's non-trivial.
Expected performance profile
Typical somatic VCF (10k variants, ~95% SNVs): 95% of annotations hit the fast path. Performance should be ≤ 2.0.0 baseline, potentially better because the fast path is more focused than the current monolithic branching.
Clinical exome with germline + somatic + phasing: lower fast-path hit rate, but the slow path only runs when context genuinely requires it.
SV-heavy long-read VCF: essentially all slow path, but volume is much lower (SVs are rarer than SNVs).
EffectCollection records its annotator; serialization preserves it
An EffectCollection is the output of running an annotator against a set of variants + transcripts. The collection should know which annotator produced it so that:
Downstream consumers can decide whether to trust results from a particular annotator version
Serialized files are self-describing (you can look at the top of a CSV and see what produced it)
A/B comparisons can assert "this collection came from sequence_diff, that one from legacy"
New fields on EffectCollection
classEffectCollection(Collection):
annotator: str# e.g. "sequence_diff"annotator_version: str# e.g. "1.0.0"varcode_version: str# e.g. "2.1.0" — the varcode version at annotation timereference: str|None# e.g. "GRCh38 (Ensembl 81)"annotated_at: datetime|None# when annotation ran
Populated automatically when an annotator runs; preserved across filter/groupby/clone_with_new_elements operations so derived collections keep provenance.
CSV header metadata
Today to_csv (inherited from sercol) emits just column names and rows. Extend it to prepend #-prefixed header lines carrying the provenance above:
This is a common convention (GFF3, VCF, MAF headers all use #-prefixed lines), and pandas' read_csv(comment='#') skips them cleanly for anyone who doesn't care about provenance.
API
collection.to_csv("effects.csv") # writes header by defaultcollection.to_csv("effects.csv", include_header=False) # opt out for legacy consumers# New classmethod to round-trip:EffectCollection.from_csv("effects.csv") # recovers variants + metadata
from_csv reads the header lines first to recover annotator, annotator_version, varcode_version, etc., then parses the CSV body to rehydrate effects. A mismatch between the serialized annotator and the current environment produces a clear warning (or opt-in strict failure) rather than silent reinterpretation.
Applies to other formats too
The same pattern extends to any format with a comment convention:
JSON: top-level metadata object alongside effects list
MAF output (if we add one): # comment lines at the top
Isovar integration (first-class evidence source)
Isovar already assembles RNA reads into mutant coding sequences, incorporating proximal germline/somatic variants and splicing alterations, then translates them. An IsovarResult is essentially a MutantTranscript with read-level provenance.
With this abstraction in place, Isovar integration becomes:
Isovar produces MutantTranscript candidates — one per assembled contig, each with supporting read count, assembled cDNA, and translated protein.
Concretely, Isovar can ship its own annotator (isovar.varcode_annotator, registered with varcode) that, given a variant and a transcript, returns effects computed from Isovar's assembled haplotype rather than inferred from DNA alone. Or, more compositionally, Isovar produces MutantTranscript candidates and the sequence_diff annotator consumes them directly — either model works with the annotator interface.
The direction is: varcode defines the abstraction, Isovar (and Exacto) populate it with evidence. Currently Isovar wraps varcode and patches around varcode's limitations. After this refactor, Isovar becomes a plugin-style evidence provider or its own registered annotator.
Performance is a hard constraint
A naive implementation would materialize full transcript sequences for every variant and blow up memory and compute. The design must not regress performance. Measures:
Fast path for SNVs (above) — the dominant case avoids the full abstraction.
Delta representation, not full copy: MutantTranscript stores edits against the reference (anchor transcript ID + list of edits). Full sequences are a computed property, materialized only when needed.
Lazy translation with memoization: mutant_protein_sequence is @memoized_property, not eager.
Benchmark before and after: add a performance test fixture (time to annotate a representative VCF of ~10k variants against GRCh38) and require no regression on that benchmark. Run it in CI.
Profile the hot path: most variants exercise the fast path. The abstraction must degenerate to the fast case when there's no complexity to track.
Performance acceptance criteria
Time to annotate a typical 10k-variant somatic VCF: ≤ current baseline + 5% with the sequence_diff annotator as the default. (The 5% budget allows for annotator dispatch overhead but not much else — the fast path should absorb most of this.)
Memory: peak RSS ≤ current baseline + 20%. Most of this goes to slow-path MutantTranscript objects plus any cached sequences (which we can LRU-bound).
Startup time: no regression (the refactor shouldn't touch pyensembl initialization).
The legacy annotator must remain as-fast-as-2.0.0 — it's the reference baseline the sequence_diff annotator is measured against.
If those budgets can't be met, the sequence_diff annotator doesn't become the default (it stays opt-in). The legacy annotator stays the default and remains fully supported.
API sketch
@dataclass(frozen=True)classEdit:
"""A normalized variant edit in transcript cDNA coordinates."""cdna_start: intcdna_end: int# exclusive; equal to cdna_start for insertionsreplacement: str# empty for pure deletionsclassMutantTranscript:
"""A reference transcript with a set of variant edits applied, plus optional provenance (which variants, which haplotype, what evidence). """transcript: pyensembl.Transcriptedits: Tuple[Edit, ...] # ordered, non-overlappingprovenance: Provenance# variants, haplotype, evidence# Lazy / memoized views@memoized_propertydefcdna_sequence(self) ->str: ...
@memoized_propertydefprotein_sequence(self) ->str: ... # translated to first stop@memoized_propertydefuses_three_prime_utr(self) ->bool: ...
# Plausibility / confidence (for candidate sets)plausibility: float=1.0evidence: Optional[Evidence] =None# The sequence_diff annotator's slow path becomes a diffdef_slow_path_annotate(variants, transcript, context):
mutant=MutantTranscript.apply(transcript, variants, context)
ifmutant.protein_sequence==transcript.protein_sequence:
returnSilent(...)
# ... remaining branches are expressed as protein-sequence diffs,# not offset arithmetic.
Edit insertions, deletions, substitutions, and splice-junction edits are all expressible. SV-style edits that join two transcripts are a separate JoinEdit type that references a second transcript.
Migration plan
Land the EffectAnnotator interface and fast path — introduce EffectAnnotator, the registry, and the shared fast path. The legacy annotator wraps the existing code; the sequence_diff annotator is a stub that falls back to legacy. Default stays legacy. This PR is infrastructure only, no behaviour change.
EffectCollection provenance + serialized headers — add the annotator / annotator_version / varcode_version fields, update to_csv to emit header comments, add from_csv. This can land in parallel with step 1.
Implement MutantTranscript + sequence_diff slow path — for point variants and small indels first. Add a parity test harness that runs both annotators on the full test corpus and fails on any disagreement.
Add the performance benchmark — baseline both annotators on a 10k-variant VCF. Establish the regression budget in CI.
Extend sequence_diff to splice, SV, phasing — each is a new slow-path capability; fast path remains the same.
Flip the default — once sequence_diff is at parity on the test corpus and within the performance budget. Legacy stays available.
Remove legacy — separate decision, separate release. Only after the sequence_diff annotator has been the default for at least one release cycle without regressions.
Each step lands as a separate PR. No single change is bigger than what a reviewer can hold in their head.
Sub-issues (to be filed once this issue is approved in principle)
EffectCollection provenance + header-metadata serialization — can land in parallel.
Design of Edit / MutantTranscript data model — first prerequisite for the sequence_diff slow path.
Sequence-diff annotator for coding variants — replaces the offset-based code in the slow path.
Performance benchmark suite — lands before the sequence_diff default flip.
Parity test harness — runs both annotators on the full corpus, catches drift.
Isovar integration surface — what shape of MutantTranscript does Isovar import, and what evidence does it carry? Does Isovar get its own annotator or compose with sequence_diff?
Partial infrastructure has already shipped under this umbrella:
MutantTranscript + TranscriptEdit + ReferenceSegment data classes (varcode/mutant_transcript.py). The SV analog of JoinEdit from the API sketch above ended up as reference_segments: Tuple[ReferenceSegment, ...] — a tuple of contiguous-reference-sequence pointers, each with (source, start, end, strand, label). A fusion is two segments; an inversion is three (forward / reverse-complement / forward); an assembled-allele SV is one synthetic segment. This is strictly more general than JoinEdit (which only covered the two-transcript join case) and handles inversions, long-read assemblies, and intergenic translocations in the same shape.
apply_variant_to_transcript(variant, transcript) produces a MutantTranscript for point variants with cdna_sequence and mutant_protein_sequence populated (mitochondrial codon table selected per-transcript).
StructuralVariantAnnotator (PR Add StructuralVariantAnnotator with multi-outcome SV effects (#252) #333) classifies SVs into LargeDeletion / LargeDuplication / Inversion / GeneFusion / TranslocationToIntergenic, but does not yet populate MutantTranscript.reference_segments on the returned effects — see the follow-ups below.
The SV annotator always constructs effects with candidates=None, which defaults to (self,), so every SV effect today is a single-outcome wrapper with no mutant_transcript attached. External tools that want to read "what protein does this fusion produce" cannot do so without running their own fusion math. This is the main thing this issue enables after refactor.
Gap: splice and SV outcomes don't yet share the Outcome.effect contract
SpliceOutcomeSet.candidates contains SpliceCandidate dataclasses (not MutationEffect subclasses). The inherited MultiOutcomeEffect.outcomes wraps them as Outcome(effect=<SpliceCandidate>), violating the declared contract (Outcome.effect: MutationEffect). StructuralVariantEffect.candidates holds real MutationEffect instances, so SV outcomes are correctly typed — but downstream consumers still have to isinstance-branch because SpliceCandidate.coding_effect hides the actual effect two hops deep. This issue's refactor is where those two producers should converge.
Outcome contract (SV + splice unification)
After #271, every MultiOutcomeEffect subclass must guarantee:
effect.outcomes# -> Tuple[Outcome, ...]# For every outcome o in effect.outcomes:isinstance(o.effect, MutationEffect) # strict — no dataclass impostorso.effect.short_description# always presentgetattr(o.effect, "mutant_protein_sequence", None) # present when computable, None when notgetattr(o.effect, "mutant_transcript", None) # present for SV-shape effects, None for point-variant effectso.probability# float in [0, 1] or None (unscored)o.source# "varcode", "spliceai", "isovar", ...o.evidence# open-ended dict, source-specific shape
This collapses three ad-hoc multi-outcome shapes (ExonicSpliceSite.alternate_effect, SpliceOutcomeSet, StructuralVariantEffect) onto the same iteration pattern. Provenance (probability, source, evidence dict) stays on the Outcome; the effect is always a MutationEffect carrying whatever the DNA-level classification produced; mutant_protein_sequence lookup is one-hop regardless of outcome kind.
The concrete work (migrating SpliceCandidate contents onto Outcome fields + MutationEffect subclasses) is tracked in #339.
These address the gaps identified above and the SV-cohesiveness audit. All are blocked by #271 — they assume the annotator interface, MutantTranscript materialization, and unified outcome shape are in place:
This issue is not about changing the public Effect classes. Substitution, Silent, PrematureStop, etc. remain the output types. What changes is how they are computed and which annotator does the computing.
Variant.effects() keeps its name and shape — "annotator" lives at the module/backend layer, not in the user-facing accessor.
This is not a rewrite. It's a refactor that lets us delete complexity from the offset-based code (eventually) while opening a substrate for the roadmap features.
EffectCollection already has a working deserialization path:
EffectCollection.from_dict(ec.to_dict()) # works todayEffectCollection.from_json(ec.to_json()) # works todaypickle.loads(pickle.dumps(ec)) # works today
These are inherited from Serializable (via sercol). Covered by tests/test_effect_collection_serialization.py.
What's missing (gaps this issue needs to close):
from_csv: to_csv exists (inherited), but the inverse does not. This is straightforward once we add header metadata — the reader parses # key=value lines for provenance, then uses pd.read_csv(comment='#') for the body.
Annotator / version in round-trips: because annotator/annotator_version/varcode_version don't exist as fields yet, neither to_dict nor to_json carry them. Adding the fields closes this automatically for dict/JSON; CSV needs the header support above.
Cross-varcode-version compatibility: a collection serialized by 2.0.0 and read by 2.1.0 today works (same schema); after this refactor the deserializer should check the embedded varcode_version and either accept, warn, or error based on a policy (probably: accept minor-version drift, warn on major).
Ingesting an existing EffectCollection CSV/JSON:
ec=EffectCollection.from_csv("effects.csv") # NEW — parses header + bodyprint(ec.annotator, ec.annotator_version, ec.varcode_version)
ec=EffectCollection.from_json("effects.json") # EXISTS; extended to carry metadata
Migration from Serializable to dataclasses
python-serializable pre-dates dataclasses landing in the stdlib. Most of what it does — __init__ generation, __repr__, __eq__, to_dict/from_dict — is now covered more cleanly by @dataclass(frozen=True) plus a bit of serialization helper code. Moving off Serializable where we can reduces dependencies, makes the types easier to reason about (standard Python semantics, no custom metaclass), and lines up with how the rest of the Python ecosystem works in 2026.
What Serializable provides that dataclasses don't natively
Polymorphic round-tripping: MutationEffect can be any of 30+ subclasses (Substitution, PrematureStop, ExonicSpliceSite, …). Deserializing a list of effects requires knowing which subclass to rebuild. Serializable handles this via a class-name registry keyed on __class__.__name__ in the dict.
Nested serialization: recursively walks fields that are themselves Serializable instances. Dataclasses need asdict() for this, which is fine but doesn't roundtrip subclasses automatically.
Everything else that Serializable does is redundant with @dataclass(frozen=True).
Proposed migration
Start with leaf value types.Edit, Provenance, Evidence, MutantTranscript are new; make them @dataclass(frozen=True) from the start. They have no polymorphic subclass structure, so dataclasses cover everything.
Migrate existing leaf types.Variant currently extends Serializable. It has no subclass hierarchy (all variants are Variant instances, distinguished by fields); it's a pure value type. Converting to @dataclass(frozen=True) is a mechanical change that also gives us a proper __hash__ for free (useful for using variants as dict keys, which we already do).
Write a tiny polymorphic shim for the effect hierarchy. ~10 lines of code: EFFECT_CLASSES = {cls.__name__: cls for cls in MutationEffect.__subclasses__(recursive=True)}; from_dict looks up the right class by name and passes the remaining fields to its constructor. This lets the effect classes become dataclasses too, while preserving polymorphic round-trip.
Drop the Serializable dependency. Once the last user migrates, remove it from pyproject.toml.
Scope
This migration is complementary to the main refactor in this issue, not a prerequisite. It can proceed as a separate sub-issue at its own pace — each type converted is a small, well-contained PR with straightforward before/after semantics. Test coverage (existing `to_dict`/`from_dict` round-trip tests) stays the regression gate.
Adding to sub-issues
Migrate value types from Serializable to dataclasses — starts with Variant, then extends to new types (Edit, Provenance, MutantTranscript) as they're introduced. Can begin immediately, independent of the annotator refactor.
Summary
Introduce a
MutantTranscriptabstraction that represents the result of applying one or more variants to a reference transcript — carrying the mutated cDNA and protein sequences along with provenance. This reshapes effect annotation from "compute the delta against the reference by reasoning about offsets" to "construct the mutant sequence, translate it, compare to the reference protein."The old and new annotators coexist as pluggable
EffectAnnotatorimplementations. SNVs (and other trivial point variants) take a shared fast path in both; the heavy machinery only fires for variants that need it. TheEffectCollectionrecords which annotator produced it, and serialized output carries that provenance in a header so it's not lost.This is a consolidation point. Almost every open item on the roadmap asks for the same abstraction, and several existing bugs were caused by not having it. Done well, this simplifies the code rather than complicating it.
Motivation
The current effect code lives in
predict_in_frame_coding_effectand cousins. It decides between Silent / PrematureStop / StopLoss / Insertion / Deletion / Substitution / ... by reasoning about offsets, shared prefixes, and boundary conditions — without ever materializing the full mutant protein and comparing it to the reference.That approach has real costs:
MutantTranscript, they import cleanly.Why this unlocks the roadmap
MutantTranscriptsimplifies itMutantTranscript. Somatic annotation diffs the final mutant against germline, not reference. One-line composition.MutantTranscriptper haplotype. Cis variants go in the same one; trans variants go in different ones. Joint effects fall out of the diff.MutantTranscriptwith a plausibility score. The "possibility set" is justList[MutantTranscript].MutantTranscriptobjects. Importing RNA support means attaching evidence (read counts, fragment IDs) to the rightMutantTranscript.MutantTranscriptjoining segments of two reference transcripts. Doesn't fit the offset model at all; fits here naturally.translate-structsoutput is aMutantTranscript. Loading = construct directly, no effect inference needed.MutantTranscriptfrom a reference span.mutated_sequenceon effectsMutantTranscriptreference.MutantTranscriptper transcript per variant set; no "top priority" coupling.Pluggable
EffectAnnotatorimplementationsThe old and new annotators are not a transitional dual-implementation; they're distinct implementations that coexist behind a shared interface. "Annotator" (rather than "Predictor") matches the field's vocabulary — VEP, SnpEff, and ANNOVAR are all called annotators — and avoids ML connotations that don't apply to varcode's deterministic rule-driven computation.
Selection
Note that
Variant.effects()keeps its name — it's the user-facing accessor and doesn't need the churn. "Annotator" appears at the module / backend layer.Guarantees
Substitution,Silent,PrematureStop, etc. classes. Downstream code is unaffected by annotator choice unless it explicitly asks about evidence/provenance that only one annotator produces.supportsset. The legacy annotator supports{"snv", "indel", "mnv"}. The sequence_diff annotator supports those plus{"splice_set", "sv", "phased"}. Asking the legacy annotator to handle an SV is an explicitUnsupportedVariantError, not silent wrong output.Why pluggable annotators, not a flag
A flag flips a global once; annotators are a stable contract. Concretely this gets us:
Fast path for SNVs (and small indels)
SNVs dominate any realistic somatic VCF — typically >95% of records. The design must not impose the full
MutantTranscriptconstruction + translation cost on them. Both annotators use a shared fast path:Requirements for the fast path:
MutantTranscriptconstruction. The fast path reads one reference codon, applies one base substitution, translates the single codon, and emits the Effect. No full-sequence materialization.The slow path is where SVs, phased haplotypes, splice possibility sets, germline-aware annotation, and Isovar evidence integration live. By the time a variant reaches the slow path, it's because we already know it's non-trivial.
Expected performance profile
EffectCollection records its annotator; serialization preserves it
An
EffectCollectionis the output of running an annotator against a set of variants + transcripts. The collection should know which annotator produced it so that:New fields on
EffectCollectionPopulated automatically when an annotator runs; preserved across
filter/groupby/clone_with_new_elementsoperations so derived collections keep provenance.CSV header metadata
Today
to_csv(inherited from sercol) emits just column names and rows. Extend it to prepend#-prefixed header lines carrying the provenance above:This is a common convention (GFF3, VCF, MAF headers all use
#-prefixed lines), and pandas'read_csv(comment='#')skips them cleanly for anyone who doesn't care about provenance.API
from_csvreads the header lines first to recoverannotator,annotator_version,varcode_version, etc., then parses the CSV body to rehydrate effects. A mismatch between the serialized annotator and the current environment produces a clear warning (or opt-in strict failure) rather than silent reinterpretation.Applies to other formats too
The same pattern extends to any format with a comment convention:
metadataobject alongsideeffectslistvcf_output.py): additional##header lines (##varcode=2.1.0,##annotator=sequence_diff, ...)#comment lines at the topIsovar integration (first-class evidence source)
Isovar already assembles RNA reads into mutant coding sequences, incorporating proximal germline/somatic variants and splicing alterations, then translates them. An
IsovarResultis essentially aMutantTranscriptwith read-level provenance.With this abstraction in place, Isovar integration becomes:
MutantTranscriptcandidates — one per assembled contig, each with supporting read count, assembled cDNA, and translated protein.MutantTranscriptcandidates differ (e.g., two plausible splice outcomes from Prototype multi-effect candidates for splice-site variants #262), Isovar says which one RNA actually supports and at what coverage.Concretely, Isovar can ship its own annotator (
isovar.varcode_annotator, registered with varcode) that, given a variant and a transcript, returns effects computed from Isovar's assembled haplotype rather than inferred from DNA alone. Or, more compositionally, Isovar producesMutantTranscriptcandidates and the sequence_diff annotator consumes them directly — either model works with the annotator interface.The direction is: varcode defines the abstraction, Isovar (and Exacto) populate it with evidence. Currently Isovar wraps varcode and patches around varcode's limitations. After this refactor, Isovar becomes a plugin-style evidence provider or its own registered annotator.
Performance is a hard constraint
A naive implementation would materialize full transcript sequences for every variant and blow up memory and compute. The design must not regress performance. Measures:
MutantTranscriptstores edits against the reference (anchor transcript ID + list of edits). Full sequences are a computed property, materialized only when needed.mutant_protein_sequenceis@memoized_property, not eager.Performance acceptance criteria
MutantTranscriptobjects plus any cached sequences (which we can LRU-bound).If those budgets can't be met, the sequence_diff annotator doesn't become the default (it stays opt-in). The legacy annotator stays the default and remains fully supported.
API sketch
Editinsertions, deletions, substitutions, and splice-junction edits are all expressible. SV-style edits that join two transcripts are a separateJoinEdittype that references a second transcript.Migration plan
EffectAnnotatorinterface and fast path — introduceEffectAnnotator, the registry, and the shared fast path. The legacy annotator wraps the existing code; the sequence_diff annotator is a stub that falls back to legacy. Default stayslegacy. This PR is infrastructure only, no behaviour change.annotator/annotator_version/varcode_versionfields, updateto_csvto emit header comments, addfrom_csv. This can land in parallel with step 1.MutantTranscript+ sequence_diff slow path — for point variants and small indels first. Add a parity test harness that runs both annotators on the full test corpus and fails on any disagreement.Each step lands as a separate PR. No single change is bigger than what a reviewer can hold in their head.
Sub-issues (to be filed once this issue is approved in principle)
EffectAnnotatorinterface + registry + fast path — lands first, enables everything else.EffectCollectionprovenance + header-metadata serialization — can land in parallel.Edit/MutantTranscriptdata model — first prerequisite for the sequence_diff slow path.MutantTranscriptdoes Isovar import, and what evidence does it carry? Does Isovar get its own annotator or compose with sequence_diff?MutantTranscriptloader — already planned in Add loader for Exacto output formats #260, but after this refactor it becomes much thinner.MutantTranscript— enabler for Prototype multi-effect candidates for splice-site variants #262 and Incorporate RNA-level evidence for variant effects #259.Current landed state (2026-04-20)
Partial infrastructure has already shipped under this umbrella:
MutantTranscript+TranscriptEdit+ReferenceSegmentdata classes (varcode/mutant_transcript.py). The SV analog ofJoinEditfrom the API sketch above ended up asreference_segments: Tuple[ReferenceSegment, ...]— a tuple of contiguous-reference-sequence pointers, each with(source, start, end, strand, label). A fusion is two segments; an inversion is three (forward / reverse-complement / forward); an assembled-allele SV is one synthetic segment. This is strictly more general thanJoinEdit(which only covered the two-transcript join case) and handles inversions, long-read assemblies, and intergenic translocations in the same shape.apply_variant_to_transcript(variant, transcript)produces aMutantTranscriptfor point variants withcdna_sequenceandmutant_protein_sequencepopulated (mitochondrial codon table selected per-transcript).StructuralVariantAnnotator(PR Add StructuralVariantAnnotator with multi-outcome SV effects (#252) #333) classifies SVs intoLargeDeletion/LargeDuplication/Inversion/GeneFusion/TranslocationToIntergenic, but does not yet populateMutantTranscript.reference_segmentson the returned effects — see the follow-ups below.Outcome(Add unified Outcome type for multi-outcome effects (#299) #330) defines the unified outcome shape (effect,probability,source,evidence) andMultiOutcomeEffect.outcomeslifts the existingcandidatestuple into it.Gap: SV effects don't yet produce sequences
The SV annotator always constructs effects with
candidates=None, which defaults to(self,), so every SV effect today is a single-outcome wrapper with nomutant_transcriptattached. External tools that want to read "what protein does this fusion produce" cannot do so without running their own fusion math. This is the main thing this issue enables after refactor.Gap: splice and SV outcomes don't yet share the
Outcome.effectcontractSpliceOutcomeSet.candidatescontainsSpliceCandidatedataclasses (notMutationEffectsubclasses). The inheritedMultiOutcomeEffect.outcomeswraps them asOutcome(effect=<SpliceCandidate>), violating the declared contract (Outcome.effect: MutationEffect).StructuralVariantEffect.candidatesholds realMutationEffectinstances, so SV outcomes are correctly typed — but downstream consumers still have toisinstance-branch becauseSpliceCandidate.coding_effecthides the actual effect two hops deep. This issue's refactor is where those two producers should converge.Outcome contract (SV + splice unification)
After #271, every
MultiOutcomeEffectsubclass must guarantee:This collapses three ad-hoc multi-outcome shapes (
ExonicSpliceSite.alternate_effect,SpliceOutcomeSet,StructuralVariantEffect) onto the same iteration pattern. Provenance (probability, source, evidence dict) stays on theOutcome; theeffectis always aMutationEffectcarrying whatever the DNA-level classification produced;mutant_protein_sequencelookup is one-hop regardless of outcome kind.The concrete work (migrating
SpliceCandidatecontents ontoOutcomefields +MutationEffectsubclasses) is tracked in #339.Post-#271 follow-up issues (filed 2026-04-20)
These address the gaps identified above and the SV-cohesiveness audit. All are blocked by #271 — they assume the annotator interface,
MutantTranscriptmaterialization, and unified outcome shape are in place:StructuralVariantAnnotatorshould emitMutantTranscript(reference_segments=...)for DEL / DUP / INV / INS / fusion / translocation, not just classify the top-level consequence.GeneFusionshould compute the fused cDNA + protein from partner transcripts, populatingmutant_transcripton the effect.varcode/cryptic_exons.py) into the SV annotator so candidates attach as additionalOutcomeentries rather than requiring callers to invoke it manually.StructuralVariantAnnotatorshould honorStructuralVariant.alt_assembly(long-read resolution hook) — currently documented but not read.SpliceOutcomeSetandStructuralVariantEffectsoOutcome.effectis always aMutationEffect. MovesSpliceCandidate's fields ontoOutcome+ placeholderMutationEffectsubclasses.StructuralVariantEffectneedspriority_classentries ineffect_priorityso SV effects sort consistently against point-variant effects and splice sets.Non-goals
Effectclasses.Substitution,Silent,PrematureStop, etc. remain the output types. What changes is how they are computed and which annotator does the computing.Variant.effects()keeps its name and shape — "annotator" lives at the module/backend layer, not in the user-facing accessor.Related
mutated_sequence(direct request for this)Deserialization today (baseline)
EffectCollectionalready has a working deserialization path:These are inherited from
Serializable(via sercol). Covered bytests/test_effect_collection_serialization.py.What's missing (gaps this issue needs to close):
from_csv:to_csvexists (inherited), but the inverse does not. This is straightforward once we add header metadata — the reader parses# key=valuelines for provenance, then usespd.read_csv(comment='#')for the body.annotator/annotator_version/varcode_versiondon't exist as fields yet, neitherto_dictnorto_jsoncarry them. Adding the fields closes this automatically for dict/JSON; CSV needs the header support above.varcode_versionand either accept, warn, or error based on a policy (probably: accept minor-version drift, warn on major).Ingesting an existing EffectCollection CSV/JSON:
Migration from
Serializableto dataclassespython-serializable pre-dates
dataclasseslanding in the stdlib. Most of what it does —__init__generation,__repr__,__eq__,to_dict/from_dict— is now covered more cleanly by@dataclass(frozen=True)plus a bit of serialization helper code. Moving offSerializablewhere we can reduces dependencies, makes the types easier to reason about (standard Python semantics, no custom metaclass), and lines up with how the rest of the Python ecosystem works in 2026.What
Serializableprovides that dataclasses don't nativelyMutationEffectcan be any of 30+ subclasses (Substitution, PrematureStop, ExonicSpliceSite, …). Deserializing a list of effects requires knowing which subclass to rebuild.Serializablehandles this via a class-name registry keyed on__class__.__name__in the dict.Serializableinstances. Dataclasses needasdict()for this, which is fine but doesn't roundtrip subclasses automatically.Everything else that
Serializabledoes is redundant with@dataclass(frozen=True).Proposed migration
Edit,Provenance,Evidence,MutantTranscriptare new; make them@dataclass(frozen=True)from the start. They have no polymorphic subclass structure, so dataclasses cover everything.Variantcurrently extendsSerializable. It has no subclass hierarchy (all variants are Variant instances, distinguished by fields); it's a pure value type. Converting to@dataclass(frozen=True)is a mechanical change that also gives us a proper__hash__for free (useful for using variants as dict keys, which we already do).EFFECT_CLASSES = {cls.__name__: cls for cls in MutationEffect.__subclasses__(recursive=True)};from_dictlooks up the right class by name and passes the remaining fields to its constructor. This lets the effect classes become dataclasses too, while preserving polymorphic round-trip.Serializabledependency. Once the last user migrates, remove it frompyproject.toml.Scope
This migration is complementary to the main refactor in this issue, not a prerequisite. It can proceed as a separate sub-issue at its own pace — each type converted is a small, well-contained PR with straightforward before/after semantics. Test coverage (existing `to_dict`/`from_dict` round-trip tests) stays the regression gate.
Adding to sub-issues
Serializableto dataclasses — starts withVariant, then extends to new types (Edit,Provenance,MutantTranscript) as they're introduced. Can begin immediately, independent of the annotator refactor.