Skip to content

Foundational refactor: MutantTranscript abstraction #271

Description

@iskandr

Summary

Introduce a MutantTranscript abstraction that represents the result of applying one or more variants to a reference transcript — carrying the mutated cDNA and protein sequences along with provenance. This reshapes effect annotation from "compute the delta against the reference by reasoning about offsets" to "construct the mutant sequence, translate it, compare to the reference protein."

The old and new annotators coexist as pluggable EffectAnnotator implementations. SNVs (and other trivial point variants) take a shared fast path in both; the heavy machinery only fires for variants that need it. The EffectCollection records which annotator produced it, and serialized output carries that provenance in a header so it's not lost.

This is a consolidation point. Almost every open item on the roadmap asks for the same abstraction, and several existing bugs were caused by not having it. Done well, this simplifies the code rather than complicating it.

Motivation

The current effect code lives in predict_in_frame_coding_effect and cousins. It decides between Silent / PrematureStop / StopLoss / Insertion / Deletion / Substitution / ... by reasoning about offsets, shared prefixes, and boundary conditions — without ever materializing the full mutant protein and comparing it to the reference.

That approach has real costs:

Why this unlocks the roadmap

Issue How MutantTranscript simplifies it
#268 germline-aware Apply germline variants first → produces a germline MutantTranscript. Somatic annotation diffs the final mutant against germline, not reference. One-line composition.
#269 phasing One MutantTranscript per haplotype. Cis variants go in the same one; trans variants go in different ones. Joint effects fall out of the diff.
#262 multi-effect splice Each splice outcome (normal, exon skip, cryptic donor, intron retention) is a different MutantTranscript with a plausibility score. The "possibility set" is just List[MutantTranscript].
#259 RNA evidence RNA observations are MutantTranscript objects. Importing RNA support means attaching evidence (read counts, fragment IDs) to the right MutantTranscript.
#257 SV types A translocation produces a MutantTranscript joining segments of two reference transcripts. Doesn't fit the offset model at all; fits here naturally.
#260 Exacto loader Exacto's translate-structs output is a MutantTranscript. Loading = construct directly, no effect inference needed.
#264 symbolic alleles Each symbolic allele type is a rule for constructing a MutantTranscript from a reference span.
#179 mutated_sequence on effects Trivially: every effect carries a MutantTranscript reference.
#195 annotate all transcripts One MutantTranscript per transcript per variant set; no "top priority" coupling.

Pluggable EffectAnnotator implementations

The old and new annotators are not a transitional dual-implementation; they're distinct implementations that coexist behind a shared interface. "Annotator" (rather than "Predictor") matches the field's vocabulary — VEP, SnpEff, and ANNOVAR are all called annotators — and avoids ML connotations that don't apply to varcode's deterministic rule-driven computation.

class EffectAnnotator(Protocol):
    name: str            # e.g. "legacy", "sequence_diff", "isovar"
    version: str         # annotator's own version, not varcode's
    supports: set[str]   # e.g. {"snv", "indel", "mnv", "splice_set", "sv", "phased"}

    def annotate_on_transcript(
        self,
        variant_or_set: Variant | VariantSet,
        transcript: Transcript,
        context: AnnotationContext | None = None,
    ) -> Effect | EffectSet: ...


# Built-in annotators
varcode.annotators.legacy          # offset-based, matches 2.0.0 behaviour
varcode.annotators.sequence_diff   # MutantTranscript + diff

Selection

# Global default
varcode.set_default_annotator("sequence_diff")

# Per-call override
variant.effects(annotator="legacy")
variant.effects(annotator=my_custom_annotator)

# Scoped override (useful for A/B testing)
with varcode.use_annotator("sequence_diff"):
    effects = vc.effects()

Note that Variant.effects() keeps its name — it's the user-facing accessor and doesn't need the churn. "Annotator" appears at the module / backend layer.

Guarantees

  • Same Effect output types. Both annotators return the existing Substitution, Silent, PrematureStop, etc. classes. Downstream code is unaffected by annotator choice unless it explicitly asks about evidence/provenance that only one annotator produces.
  • Feature declaration. Each annotator declares its supports set. The legacy annotator supports {"snv", "indel", "mnv"}. The sequence_diff annotator supports those plus {"splice_set", "sv", "phased"}. Asking the legacy annotator to handle an SV is an explicit UnsupportedVariantError, not silent wrong output.
  • Third-party annotators. Isovar and Exacto can implement their own annotator (taking their assembled/annotated output as input) and register it with the varcode annotator registry. Users get a consistent API regardless of evidence source.

Why pluggable annotators, not a flag

A flag flips a global once; annotators are a stable contract. Concretely this gets us:

  • A/B testing. Users can run both annotators on real data and report regressions without patching varcode.
  • Extension point. Downstream tools ship their own annotator; we don't absorb them all into varcode.
  • Graceful deprecation. The legacy annotator stays available for users who depend on exact-byte-for-byte compatibility even after we flip the default. Removing it is a separate decision.

Fast path for SNVs (and small indels)

SNVs dominate any realistic somatic VCF — typically >95% of records. The design must not impose the full MutantTranscript construction + translation cost on them. Both annotators use a shared fast path:

def annotate_on_transcript(variant, transcript, context=None):
    # Fast path: single-codon SNV with no adjacent variants, no splice
    # boundary, no germline in the same codon, no phasing complications.
    if _is_trivial_point_variant(variant, transcript, context):
        return _fast_path_annotate(variant, transcript)
    # Otherwise: construct MutantTranscript, translate, diff.
    return _slow_path_annotate(variant, transcript, context)

Requirements for the fast path:

  1. Stays identical to the 2.0.0 behaviour for point variants. No new bugs introduced, no behavioural drift.
  2. No MutantTranscript construction. The fast path reads one reference codon, applies one base substitution, translates the single codon, and emits the Effect. No full-sequence materialization.
  3. Shared between annotators. Both the legacy and sequence_diff annotators dispatch to the same fast-path code. This guarantees they agree on the common case and removes it as a source of A/B divergence.
  4. Opt out when context would change the answer. If germline variants land in the same codon, if the variant is in a phase block with a near neighbour, or if the variant is within 3bp of an exon-intron boundary, fall through to the slow path. Getting this triage right matters: too eager on the fast path → wrong results; too conservative → no perf gain.

The slow path is where SVs, phased haplotypes, splice possibility sets, germline-aware annotation, and Isovar evidence integration live. By the time a variant reaches the slow path, it's because we already know it's non-trivial.

Expected performance profile

  • Typical somatic VCF (10k variants, ~95% SNVs): 95% of annotations hit the fast path. Performance should be ≤ 2.0.0 baseline, potentially better because the fast path is more focused than the current monolithic branching.
  • Clinical exome with germline + somatic + phasing: lower fast-path hit rate, but the slow path only runs when context genuinely requires it.
  • SV-heavy long-read VCF: essentially all slow path, but volume is much lower (SVs are rarer than SNVs).

EffectCollection records its annotator; serialization preserves it

An EffectCollection is the output of running an annotator against a set of variants + transcripts. The collection should know which annotator produced it so that:

  • Downstream consumers can decide whether to trust results from a particular annotator version
  • Serialized files are self-describing (you can look at the top of a CSV and see what produced it)
  • A/B comparisons can assert "this collection came from sequence_diff, that one from legacy"

New fields on EffectCollection

class EffectCollection(Collection):
    annotator: str                    # e.g. "sequence_diff"
    annotator_version: str            # e.g. "1.0.0"
    varcode_version: str              # e.g. "2.1.0" — the varcode version at annotation time
    reference: str | None             # e.g. "GRCh38 (Ensembl 81)"
    annotated_at: datetime | None     # when annotation ran

Populated automatically when an annotator runs; preserved across filter/groupby/clone_with_new_elements operations so derived collections keep provenance.

CSV header metadata

Today to_csv (inherited from sercol) emits just column names and rows. Extend it to prepend #-prefixed header lines carrying the provenance above:

# varcode_version=2.1.0
# annotator=sequence_diff
# annotator_version=1.0.0
# reference=GRCh38 (Ensembl 81)
# annotated_at=2026-04-12T14:30:00Z
# n_variants=9842
# n_effects=38221
variant,contig,start,ref,alt,gene_id,gene_name,transcript_id,transcript_name,effect_type,effect
...

This is a common convention (GFF3, VCF, MAF headers all use #-prefixed lines), and pandas' read_csv(comment='#') skips them cleanly for anyone who doesn't care about provenance.

API

collection.to_csv("effects.csv")                    # writes header by default
collection.to_csv("effects.csv", include_header=False)  # opt out for legacy consumers

# New classmethod to round-trip:
EffectCollection.from_csv("effects.csv")            # recovers variants + metadata

from_csv reads the header lines first to recover annotator, annotator_version, varcode_version, etc., then parses the CSV body to rehydrate effects. A mismatch between the serialized annotator and the current environment produces a clear warning (or opt-in strict failure) rather than silent reinterpretation.

Applies to other formats too

The same pattern extends to any format with a comment convention:

  • JSON: top-level metadata object alongside effects list
  • VCF output (vcf_output.py): additional ## header lines (##varcode=2.1.0, ##annotator=sequence_diff, ...)
  • MAF output (if we add one): # comment lines at the top

Isovar integration (first-class evidence source)

Isovar already assembles RNA reads into mutant coding sequences, incorporating proximal germline/somatic variants and splicing alterations, then translates them. An IsovarResult is essentially a MutantTranscript with read-level provenance.

With this abstraction in place, Isovar integration becomes:

  1. Isovar produces MutantTranscript candidates — one per assembled contig, each with supporting read count, assembled cDNA, and translated protein.
  2. Isovar is the evidence source — when MutantTranscript candidates differ (e.g., two plausible splice outcomes from Prototype multi-effect candidates for splice-site variants #262), Isovar says which one RNA actually supports and at what coverage.
  3. Isovar guides phasing (Phasing: cis/trans-aware effect prediction for nearby variants #269) — reads that span multiple variants establish cis/trans directly. Isovar already does this; varcode just needs to consume the assembled haplotypes.
  4. Isovar guides splicing (Incorporate RNA-level evidence for variant effects #259, Prototype multi-effect candidates for splice-site variants #262) — assembled RNA reveals the actual splice junctions used. Instead of enumerating plausible splice isoforms from DNA alone, Isovar narrows the set to what's observed.

Concretely, Isovar can ship its own annotator (isovar.varcode_annotator, registered with varcode) that, given a variant and a transcript, returns effects computed from Isovar's assembled haplotype rather than inferred from DNA alone. Or, more compositionally, Isovar produces MutantTranscript candidates and the sequence_diff annotator consumes them directly — either model works with the annotator interface.

The direction is: varcode defines the abstraction, Isovar (and Exacto) populate it with evidence. Currently Isovar wraps varcode and patches around varcode's limitations. After this refactor, Isovar becomes a plugin-style evidence provider or its own registered annotator.

Performance is a hard constraint

A naive implementation would materialize full transcript sequences for every variant and blow up memory and compute. The design must not regress performance. Measures:

  1. Fast path for SNVs (above) — the dominant case avoids the full abstraction.
  2. Delta representation, not full copy: MutantTranscript stores edits against the reference (anchor transcript ID + list of edits). Full sequences are a computed property, materialized only when needed.
  3. Lazy translation with memoization: mutant_protein_sequence is @memoized_property, not eager.
  4. Reference sequence sharing: pyensembl already caches reference sequences; don't duplicate.
  5. Benchmark before and after: add a performance test fixture (time to annotate a representative VCF of ~10k variants against GRCh38) and require no regression on that benchmark. Run it in CI.
  6. Profile the hot path: most variants exercise the fast path. The abstraction must degenerate to the fast case when there's no complexity to track.

Performance acceptance criteria

  • Time to annotate a typical 10k-variant somatic VCF: ≤ current baseline + 5% with the sequence_diff annotator as the default. (The 5% budget allows for annotator dispatch overhead but not much else — the fast path should absorb most of this.)
  • Memory: peak RSS ≤ current baseline + 20%. Most of this goes to slow-path MutantTranscript objects plus any cached sequences (which we can LRU-bound).
  • Startup time: no regression (the refactor shouldn't touch pyensembl initialization).
  • The legacy annotator must remain as-fast-as-2.0.0 — it's the reference baseline the sequence_diff annotator is measured against.

If those budgets can't be met, the sequence_diff annotator doesn't become the default (it stays opt-in). The legacy annotator stays the default and remains fully supported.

API sketch

@dataclass(frozen=True)
class Edit:
    """A normalized variant edit in transcript cDNA coordinates."""
    cdna_start: int
    cdna_end: int        # exclusive; equal to cdna_start for insertions
    replacement: str     # empty for pure deletions


class MutantTranscript:
    """A reference transcript with a set of variant edits applied, plus
    optional provenance (which variants, which haplotype, what evidence).
    """
    transcript: pyensembl.Transcript
    edits: Tuple[Edit, ...]           # ordered, non-overlapping
    provenance: Provenance             # variants, haplotype, evidence

    # Lazy / memoized views
    @memoized_property
    def cdna_sequence(self) -> str: ...

    @memoized_property
    def protein_sequence(self) -> str: ...  # translated to first stop

    @memoized_property
    def uses_three_prime_utr(self) -> bool: ...

    # Plausibility / confidence (for candidate sets)
    plausibility: float = 1.0
    evidence: Optional[Evidence] = None


# The sequence_diff annotator's slow path becomes a diff
def _slow_path_annotate(variants, transcript, context):
    mutant = MutantTranscript.apply(transcript, variants, context)
    if mutant.protein_sequence == transcript.protein_sequence:
        return Silent(...)
    # ... remaining branches are expressed as protein-sequence diffs,
    # not offset arithmetic.

Edit insertions, deletions, substitutions, and splice-junction edits are all expressible. SV-style edits that join two transcripts are a separate JoinEdit type that references a second transcript.

Migration plan

  1. Land the EffectAnnotator interface and fast path — introduce EffectAnnotator, the registry, and the shared fast path. The legacy annotator wraps the existing code; the sequence_diff annotator is a stub that falls back to legacy. Default stays legacy. This PR is infrastructure only, no behaviour change.
  2. EffectCollection provenance + serialized headers — add the annotator / annotator_version / varcode_version fields, update to_csv to emit header comments, add from_csv. This can land in parallel with step 1.
  3. Implement MutantTranscript + sequence_diff slow path — for point variants and small indels first. Add a parity test harness that runs both annotators on the full test corpus and fails on any disagreement.
  4. Add the performance benchmark — baseline both annotators on a 10k-variant VCF. Establish the regression budget in CI.
  5. Extend sequence_diff to splice, SV, phasing — each is a new slow-path capability; fast path remains the same.
  6. Flip the default — once sequence_diff is at parity on the test corpus and within the performance budget. Legacy stays available.
  7. Remove legacy — separate decision, separate release. Only after the sequence_diff annotator has been the default for at least one release cycle without regressions.

Each step lands as a separate PR. No single change is bigger than what a reviewer can hold in their head.

Sub-issues (to be filed once this issue is approved in principle)

  • EffectAnnotator interface + registry + fast path — lands first, enables everything else.
  • EffectCollection provenance + header-metadata serialization — can land in parallel.
  • Design of Edit / MutantTranscript data model — first prerequisite for the sequence_diff slow path.
  • Sequence-diff annotator for coding variants — replaces the offset-based code in the slow path.
  • Performance benchmark suite — lands before the sequence_diff default flip.
  • Parity test harness — runs both annotators on the full corpus, catches drift.
  • Isovar integration surface — what shape of MutantTranscript does Isovar import, and what evidence does it carry? Does Isovar get its own annotator or compose with sequence_diff?
  • Exacto → MutantTranscript loader — already planned in Add loader for Exacto output formats #260, but after this refactor it becomes much thinner.
  • Splice possibility sets on MutantTranscript — enabler for Prototype multi-effect candidates for splice-site variants #262 and Incorporate RNA-level evidence for variant effects #259.

Current landed state (2026-04-20)

Partial infrastructure has already shipped under this umbrella:

  • MutantTranscript + TranscriptEdit + ReferenceSegment data classes (varcode/mutant_transcript.py). The SV analog of JoinEdit from the API sketch above ended up as reference_segments: Tuple[ReferenceSegment, ...] — a tuple of contiguous-reference-sequence pointers, each with (source, start, end, strand, label). A fusion is two segments; an inversion is three (forward / reverse-complement / forward); an assembled-allele SV is one synthetic segment. This is strictly more general than JoinEdit (which only covered the two-transcript join case) and handles inversions, long-read assemblies, and intergenic translocations in the same shape.
  • apply_variant_to_transcript(variant, transcript) produces a MutantTranscript for point variants with cdna_sequence and mutant_protein_sequence populated (mitochondrial codon table selected per-transcript).
  • StructuralVariantAnnotator (PR Add StructuralVariantAnnotator with multi-outcome SV effects (#252) #333) classifies SVs into LargeDeletion / LargeDuplication / Inversion / GeneFusion / TranslocationToIntergenic, but does not yet populate MutantTranscript.reference_segments on the returned effects — see the follow-ups below.
  • Outcome (Add unified Outcome type for multi-outcome effects (#299) #330) defines the unified outcome shape (effect, probability, source, evidence) and MultiOutcomeEffect.outcomes lifts the existing candidates tuple into it.

Gap: SV effects don't yet produce sequences

The SV annotator always constructs effects with candidates=None, which defaults to (self,), so every SV effect today is a single-outcome wrapper with no mutant_transcript attached. External tools that want to read "what protein does this fusion produce" cannot do so without running their own fusion math. This is the main thing this issue enables after refactor.

Gap: splice and SV outcomes don't yet share the Outcome.effect contract

SpliceOutcomeSet.candidates contains SpliceCandidate dataclasses (not MutationEffect subclasses). The inherited MultiOutcomeEffect.outcomes wraps them as Outcome(effect=<SpliceCandidate>), violating the declared contract (Outcome.effect: MutationEffect). StructuralVariantEffect.candidates holds real MutationEffect instances, so SV outcomes are correctly typed — but downstream consumers still have to isinstance-branch because SpliceCandidate.coding_effect hides the actual effect two hops deep. This issue's refactor is where those two producers should converge.

Outcome contract (SV + splice unification)

After #271, every MultiOutcomeEffect subclass must guarantee:

effect.outcomes  # -> Tuple[Outcome, ...]

# For every outcome o in effect.outcomes:
isinstance(o.effect, MutationEffect)          # strict — no dataclass impostors
o.effect.short_description                    # always present
getattr(o.effect, "mutant_protein_sequence", None)  # present when computable, None when not
getattr(o.effect, "mutant_transcript", None)  # present for SV-shape effects, None for point-variant effects
o.probability                                  # float in [0, 1] or None (unscored)
o.source                                       # "varcode", "spliceai", "isovar", ...
o.evidence                                     # open-ended dict, source-specific shape

This collapses three ad-hoc multi-outcome shapes (ExonicSpliceSite.alternate_effect, SpliceOutcomeSet, StructuralVariantEffect) onto the same iteration pattern. Provenance (probability, source, evidence dict) stays on the Outcome; the effect is always a MutationEffect carrying whatever the DNA-level classification produced; mutant_protein_sequence lookup is one-hop regardless of outcome kind.

The concrete work (migrating SpliceCandidate contents onto Outcome fields + MutationEffect subclasses) is tracked in #339.

Post-#271 follow-up issues (filed 2026-04-20)

These address the gaps identified above and the SV-cohesiveness audit. All are blocked by #271 — they assume the annotator interface, MutantTranscript materialization, and unified outcome shape are in place:

Non-goals

  • This issue is not about changing the public Effect classes. Substitution, Silent, PrematureStop, etc. remain the output types. What changes is how they are computed and which annotator does the computing.
  • Variant.effects() keeps its name and shape — "annotator" lives at the module/backend layer, not in the user-facing accessor.
  • This is not a rewrite. It's a refactor that lets us delete complexity from the offset-based code (eventually) while opening a substrate for the roadmap features.

Related

Deserialization today (baseline)

EffectCollection already has a working deserialization path:

EffectCollection.from_dict(ec.to_dict())   # works today
EffectCollection.from_json(ec.to_json())   # works today
pickle.loads(pickle.dumps(ec))              # works today

These are inherited from Serializable (via sercol). Covered by tests/test_effect_collection_serialization.py.

What's missing (gaps this issue needs to close):

  • from_csv: to_csv exists (inherited), but the inverse does not. This is straightforward once we add header metadata — the reader parses # key=value lines for provenance, then uses pd.read_csv(comment='#') for the body.
  • Annotator / version in round-trips: because annotator/annotator_version/varcode_version don't exist as fields yet, neither to_dict nor to_json carry them. Adding the fields closes this automatically for dict/JSON; CSV needs the header support above.
  • Cross-varcode-version compatibility: a collection serialized by 2.0.0 and read by 2.1.0 today works (same schema); after this refactor the deserializer should check the embedded varcode_version and either accept, warn, or error based on a policy (probably: accept minor-version drift, warn on major).

Ingesting an existing EffectCollection CSV/JSON:

ec = EffectCollection.from_csv("effects.csv")     # NEW — parses header + body
print(ec.annotator, ec.annotator_version, ec.varcode_version)

ec = EffectCollection.from_json("effects.json")   # EXISTS; extended to carry metadata

Migration from Serializable to dataclasses

python-serializable pre-dates dataclasses landing in the stdlib. Most of what it does — __init__ generation, __repr__, __eq__, to_dict/from_dict — is now covered more cleanly by @dataclass(frozen=True) plus a bit of serialization helper code. Moving off Serializable where we can reduces dependencies, makes the types easier to reason about (standard Python semantics, no custom metaclass), and lines up with how the rest of the Python ecosystem works in 2026.

What Serializable provides that dataclasses don't natively

  • Polymorphic round-tripping: MutationEffect can be any of 30+ subclasses (Substitution, PrematureStop, ExonicSpliceSite, …). Deserializing a list of effects requires knowing which subclass to rebuild. Serializable handles this via a class-name registry keyed on __class__.__name__ in the dict.
  • Nested serialization: recursively walks fields that are themselves Serializable instances. Dataclasses need asdict() for this, which is fine but doesn't roundtrip subclasses automatically.

Everything else that Serializable does is redundant with @dataclass(frozen=True).

Proposed migration

  1. Start with leaf value types. Edit, Provenance, Evidence, MutantTranscript are new; make them @dataclass(frozen=True) from the start. They have no polymorphic subclass structure, so dataclasses cover everything.
  2. Migrate existing leaf types. Variant currently extends Serializable. It has no subclass hierarchy (all variants are Variant instances, distinguished by fields); it's a pure value type. Converting to @dataclass(frozen=True) is a mechanical change that also gives us a proper __hash__ for free (useful for using variants as dict keys, which we already do).
  3. Write a tiny polymorphic shim for the effect hierarchy. ~10 lines of code: EFFECT_CLASSES = {cls.__name__: cls for cls in MutationEffect.__subclasses__(recursive=True)}; from_dict looks up the right class by name and passes the remaining fields to its constructor. This lets the effect classes become dataclasses too, while preserving polymorphic round-trip.
  4. Drop the Serializable dependency. Once the last user migrates, remove it from pyproject.toml.

Scope

This migration is complementary to the main refactor in this issue, not a prerequisite. It can proceed as a separate sub-issue at its own pace — each type converted is a small, well-contained PR with straightforward before/after semantics. Test coverage (existing `to_dict`/`from_dict` round-trip tests) stays the regression gate.

Adding to sub-issues

  • Migrate value types from Serializable to dataclasses — starts with Variant, then extends to new types (Edit, Provenance, MutantTranscript) as they're introduced. Can begin immediately, independent of the annotator refactor.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions