Foundational refactor: MutantTranscript abstraction

## Summary

Introduce a `MutantTranscript` abstraction that represents the result of applying one or more variants to a reference transcript — carrying the mutated cDNA and protein sequences along with provenance. This reshapes effect annotation from *"compute the delta against the reference by reasoning about offsets"* to *"construct the mutant sequence, translate it, compare to the reference protein."*

The old and new annotators coexist as **pluggable `EffectAnnotator` implementations**. SNVs (and other trivial point variants) take a shared **fast path** in both; the heavy machinery only fires for variants that need it. The `EffectCollection` records which annotator produced it, and serialized output carries that provenance in a header so it's not lost.

This is a consolidation point. Almost every open item on the roadmap asks for the same abstraction, and several existing bugs were caused by not having it. Done well, this simplifies the code rather than complicating it.

## Motivation

The current effect code lives in `predict_in_frame_coding_effect` and cousins. It decides between Silent / PrematureStop / StopLoss / Insertion / Deletion / Substitution / ... by reasoning about offsets, shared prefixes, and boundary conditions — without ever materializing the full mutant protein and comparing it to the reference.

That approach has real costs:
- **Correctness**: the #250/#205/#201 cluster (all fixed in 2.0.0) was a 3' UTR boundary case the offset logic missed. If the annotator had constructed the mutant protein and diffed it, those bugs couldn't exist.
- **Extensibility**: #257 (SVs joining two transcripts), #259 (multiple spliced isoforms per DNA event), #262 (multi-effect splice candidates), #268 (germline-aware annotation), and #269 (phasing) all need the same primitive: "apply some variants to a transcript, get back a sequence." The offset-based code cannot express most of these.
- **Integration**: Exacto (#260) and Isovar (see below) produce *exactly this object* already. Today we'd have to tear apart their output to jam it into varcode's existing per-variant Effect model. With `MutantTranscript`, they import cleanly.

## Why this unlocks the roadmap

| Issue | How `MutantTranscript` simplifies it |
|-------|--------------------------------------|
| #268 germline-aware | Apply germline variants first → produces a germline `MutantTranscript`. Somatic annotation diffs the final mutant against germline, not reference. One-line composition. |
| #269 phasing | One `MutantTranscript` per haplotype. Cis variants go in the same one; trans variants go in different ones. Joint effects fall out of the diff. |
| #262 multi-effect splice | Each splice outcome (normal, exon skip, cryptic donor, intron retention) is a different `MutantTranscript` with a plausibility score. The "possibility set" is just `List[MutantTranscript]`. |
| #259 RNA evidence | RNA observations *are* `MutantTranscript` objects. Importing RNA support means attaching evidence (read counts, fragment IDs) to the right `MutantTranscript`. |
| #257 SV types | A translocation produces a `MutantTranscript` joining segments of two reference transcripts. Doesn't fit the offset model at all; fits here naturally. |
| #260 Exacto loader | Exacto's `translate-structs` output is a `MutantTranscript`. Loading = construct directly, no effect inference needed. |
| #264 symbolic alleles | Each symbolic allele type is a rule for constructing a `MutantTranscript` from a reference span. |
| #179 `mutated_sequence` on effects | Trivially: every effect carries a `MutantTranscript` reference. |
| #195 annotate all transcripts | One `MutantTranscript` per transcript per variant set; no "top priority" coupling. |

## Pluggable `EffectAnnotator` implementations

The old and new annotators are not a transitional dual-implementation; they're distinct implementations that coexist behind a shared interface. "Annotator" (rather than "Predictor") matches the field's vocabulary — VEP, SnpEff, and ANNOVAR are all called annotators — and avoids ML connotations that don't apply to varcode's deterministic rule-driven computation.

```python
class EffectAnnotator(Protocol):
    name: str            # e.g. "legacy", "sequence_diff", "isovar"
    version: str         # annotator's own version, not varcode's
    supports: set[str]   # e.g. {"snv", "indel", "mnv", "splice_set", "sv", "phased"}

    def annotate_on_transcript(
        self,
        variant_or_set: Variant | VariantSet,
        transcript: Transcript,
        context: AnnotationContext | None = None,
    ) -> Effect | EffectSet: ...


# Built-in annotators
varcode.annotators.legacy          # offset-based, matches 2.0.0 behaviour
varcode.annotators.sequence_diff   # MutantTranscript + diff
```

### Selection

```python
# Global default
varcode.set_default_annotator("sequence_diff")

# Per-call override
variant.effects(annotator="legacy")
variant.effects(annotator=my_custom_annotator)

# Scoped override (useful for A/B testing)
with varcode.use_annotator("sequence_diff"):
    effects = vc.effects()
```

Note that `Variant.effects()` keeps its name — it's the user-facing accessor and doesn't need the churn. "Annotator" appears at the module / backend layer.

### Guarantees

- **Same Effect output types.** Both annotators return the existing `Substitution`, `Silent`, `PrematureStop`, etc. classes. Downstream code is unaffected by annotator choice unless it explicitly asks about evidence/provenance that only one annotator produces.
- **Feature declaration.** Each annotator declares its `supports` set. The legacy annotator supports `{"snv", "indel", "mnv"}`. The sequence_diff annotator supports those plus `{"splice_set", "sv", "phased"}`. Asking the legacy annotator to handle an SV is an explicit `UnsupportedVariantError`, not silent wrong output.
- **Third-party annotators.** Isovar and Exacto can implement their own annotator (taking their assembled/annotated output as input) and register it with the varcode annotator registry. Users get a consistent API regardless of evidence source.

### Why pluggable annotators, not a flag

A flag flips a global once; annotators are a stable contract. Concretely this gets us:
- **A/B testing.** Users can run both annotators on real data and report regressions without patching varcode.
- **Extension point.** Downstream tools ship their own annotator; we don't absorb them all into varcode.
- **Graceful deprecation.** The legacy annotator stays available for users who depend on exact-byte-for-byte compatibility even after we flip the default. Removing it is a separate decision.

## Fast path for SNVs (and small indels)

SNVs dominate any realistic somatic VCF — typically >95% of records. The design must not impose the full `MutantTranscript` construction + translation cost on them. Both annotators use a shared **fast path**:

```python
def annotate_on_transcript(variant, transcript, context=None):
    # Fast path: single-codon SNV with no adjacent variants, no splice
    # boundary, no germline in the same codon, no phasing complications.
    if _is_trivial_point_variant(variant, transcript, context):
        return _fast_path_annotate(variant, transcript)
    # Otherwise: construct MutantTranscript, translate, diff.
    return _slow_path_annotate(variant, transcript, context)
```

Requirements for the fast path:
1. **Stays identical to the 2.0.0 behaviour** for point variants. No new bugs introduced, no behavioural drift.
2. **No `MutantTranscript` construction.** The fast path reads one reference codon, applies one base substitution, translates the single codon, and emits the Effect. No full-sequence materialization.
3. **Shared between annotators.** Both the legacy and sequence_diff annotators dispatch to the same fast-path code. This guarantees they agree on the common case and removes it as a source of A/B divergence.
4. **Opt out when context would change the answer.** If germline variants land in the same codon, if the variant is in a phase block with a near neighbour, or if the variant is within 3bp of an exon-intron boundary, fall through to the slow path. Getting this triage right matters: too eager on the fast path → wrong results; too conservative → no perf gain.

The slow path is where SVs, phased haplotypes, splice possibility sets, germline-aware annotation, and Isovar evidence integration live. By the time a variant reaches the slow path, it's because we already know it's non-trivial.

### Expected performance profile

- Typical somatic VCF (10k variants, ~95% SNVs): 95% of annotations hit the fast path. Performance should be ≤ 2.0.0 baseline, potentially *better* because the fast path is more focused than the current monolithic branching.
- Clinical exome with germline + somatic + phasing: lower fast-path hit rate, but the slow path only runs when context genuinely requires it.
- SV-heavy long-read VCF: essentially all slow path, but volume is much lower (SVs are rarer than SNVs).

## EffectCollection records its annotator; serialization preserves it

An `EffectCollection` is the output of running an annotator against a set of variants + transcripts. The collection should know *which annotator produced it* so that:
- Downstream consumers can decide whether to trust results from a particular annotator version
- Serialized files are self-describing (you can look at the top of a CSV and see what produced it)
- A/B comparisons can assert "this collection came from sequence_diff, that one from legacy"

### New fields on `EffectCollection`

```python
class EffectCollection(Collection):
    annotator: str                    # e.g. "sequence_diff"
    annotator_version: str            # e.g. "1.0.0"
    varcode_version: str              # e.g. "2.1.0" — the varcode version at annotation time
    reference: str | None             # e.g. "GRCh38 (Ensembl 81)"
    annotated_at: datetime | None     # when annotation ran
```

Populated automatically when an annotator runs; preserved across `filter`/`groupby`/`clone_with_new_elements` operations so derived collections keep provenance.

### CSV header metadata

Today `to_csv` (inherited from sercol) emits just column names and rows. Extend it to prepend `#`-prefixed header lines carrying the provenance above:

```
# varcode_version=2.1.0
# annotator=sequence_diff
# annotator_version=1.0.0
# reference=GRCh38 (Ensembl 81)
# annotated_at=2026-04-12T14:30:00Z
# n_variants=9842
# n_effects=38221
variant,contig,start,ref,alt,gene_id,gene_name,transcript_id,transcript_name,effect_type,effect
...
```

This is a common convention (GFF3, VCF, MAF headers all use `#`-prefixed lines), and pandas' `read_csv(comment='#')` skips them cleanly for anyone who doesn't care about provenance.

### API

```python
collection.to_csv("effects.csv")                    # writes header by default
collection.to_csv("effects.csv", include_header=False)  # opt out for legacy consumers

# New classmethod to round-trip:
EffectCollection.from_csv("effects.csv")            # recovers variants + metadata
```

`from_csv` reads the header lines first to recover `annotator`, `annotator_version`, `varcode_version`, etc., then parses the CSV body to rehydrate effects. A mismatch between the serialized annotator and the current environment produces a clear warning (or opt-in strict failure) rather than silent reinterpretation.

### Applies to other formats too

The same pattern extends to any format with a comment convention:
- **JSON**: top-level `metadata` object alongside `effects` list
- **VCF output** (`vcf_output.py`): additional `##` header lines (`##varcode=2.1.0`, `##annotator=sequence_diff`, ...)
- **MAF output** (if we add one): `#` comment lines at the top

## Isovar integration (first-class evidence source)

[Isovar](https://github.com/openvax/isovar) already assembles RNA reads into mutant coding sequences, incorporating proximal germline/somatic variants and splicing alterations, then translates them. An `IsovarResult` is essentially a `MutantTranscript` with read-level provenance.

With this abstraction in place, Isovar integration becomes:

1. **Isovar produces `MutantTranscript` candidates** — one per assembled contig, each with supporting read count, assembled cDNA, and translated protein.
2. **Isovar is the evidence source** — when `MutantTranscript` candidates differ (e.g., two plausible splice outcomes from #262), Isovar says which one RNA actually supports and at what coverage.
3. **Isovar guides phasing (#269)** — reads that span multiple variants establish cis/trans directly. Isovar already does this; varcode just needs to consume the assembled haplotypes.
4. **Isovar guides splicing (#259, #262)** — assembled RNA reveals the actual splice junctions used. Instead of enumerating plausible splice isoforms from DNA alone, Isovar narrows the set to what's observed.

Concretely, Isovar can ship its own annotator (`isovar.varcode_annotator`, registered with varcode) that, given a variant and a transcript, returns effects computed from Isovar's assembled haplotype rather than inferred from DNA alone. Or, more compositionally, Isovar produces `MutantTranscript` candidates and the sequence_diff annotator consumes them directly — either model works with the annotator interface.

The direction is: **varcode defines the abstraction, Isovar (and Exacto) populate it with evidence.** Currently Isovar wraps varcode and patches around varcode's limitations. After this refactor, Isovar becomes a plugin-style evidence provider or its own registered annotator.

## Performance is a hard constraint

A naive implementation would materialize full transcript sequences for every variant and blow up memory and compute. The design must not regress performance. Measures:

1. **Fast path for SNVs** (above) — the dominant case avoids the full abstraction.
2. **Delta representation, not full copy**: `MutantTranscript` stores *edits* against the reference (anchor transcript ID + list of edits). Full sequences are a computed property, materialized only when needed.
3. **Lazy translation with memoization**: `mutant_protein_sequence` is `@memoized_property`, not eager.
4. **Reference sequence sharing**: pyensembl already caches reference sequences; don't duplicate.
5. **Benchmark before and after**: add a performance test fixture (time to annotate a representative VCF of ~10k variants against GRCh38) and require no regression on that benchmark. Run it in CI.
6. **Profile the hot path**: most variants exercise the fast path. The abstraction must degenerate to the fast case when there's no complexity to track.

### Performance acceptance criteria

- Time to annotate a typical 10k-variant somatic VCF: ≤ current baseline + 5% with the sequence_diff annotator as the default. (The 5% budget allows for annotator dispatch overhead but not much else — the fast path should absorb most of this.)
- Memory: peak RSS ≤ current baseline + 20%. Most of this goes to slow-path `MutantTranscript` objects plus any cached sequences (which we can LRU-bound).
- Startup time: no regression (the refactor shouldn't touch pyensembl initialization).
- The legacy annotator must remain as-fast-as-2.0.0 — it's the reference baseline the sequence_diff annotator is measured against.

If those budgets can't be met, the sequence_diff annotator doesn't become the default (it stays opt-in). The legacy annotator stays the default and remains fully supported.

## API sketch

```python
@dataclass(frozen=True)
class Edit:
    """A normalized variant edit in transcript cDNA coordinates."""
    cdna_start: int
    cdna_end: int        # exclusive; equal to cdna_start for insertions
    replacement: str     # empty for pure deletions


class MutantTranscript:
    """A reference transcript with a set of variant edits applied, plus
    optional provenance (which variants, which haplotype, what evidence).
    """
    transcript: pyensembl.Transcript
    edits: Tuple[Edit, ...]           # ordered, non-overlapping
    provenance: Provenance             # variants, haplotype, evidence

    # Lazy / memoized views
    @memoized_property
    def cdna_sequence(self) -> str: ...

    @memoized_property
    def protein_sequence(self) -> str: ...  # translated to first stop

    @memoized_property
    def uses_three_prime_utr(self) -> bool: ...

    # Plausibility / confidence (for candidate sets)
    plausibility: float = 1.0
    evidence: Optional[Evidence] = None


# The sequence_diff annotator's slow path becomes a diff
def _slow_path_annotate(variants, transcript, context):
    mutant = MutantTranscript.apply(transcript, variants, context)
    if mutant.protein_sequence == transcript.protein_sequence:
        return Silent(...)
    # ... remaining branches are expressed as protein-sequence diffs,
    # not offset arithmetic.
```

`Edit` insertions, deletions, substitutions, and splice-junction edits are all expressible. SV-style edits that *join* two transcripts are a separate `JoinEdit` type that references a second transcript.

## Migration plan

1. **Land the `EffectAnnotator` interface and fast path** — introduce `EffectAnnotator`, the registry, and the shared fast path. The legacy annotator wraps the existing code; the sequence_diff annotator is a stub that falls back to legacy. Default stays `legacy`. This PR is infrastructure only, no behaviour change.
2. **EffectCollection provenance + serialized headers** — add the `annotator` / `annotator_version` / `varcode_version` fields, update `to_csv` to emit header comments, add `from_csv`. This can land in parallel with step 1.
3. **Implement `MutantTranscript` + sequence_diff slow path** — for point variants and small indels first. Add a parity test harness that runs both annotators on the full test corpus and fails on any disagreement.
4. **Add the performance benchmark** — baseline both annotators on a 10k-variant VCF. Establish the regression budget in CI.
5. **Extend sequence_diff to splice, SV, phasing** — each is a new slow-path capability; fast path remains the same.
6. **Flip the default** — once sequence_diff is at parity on the test corpus and within the performance budget. Legacy stays available.
7. **Remove legacy** — separate decision, separate release. Only after the sequence_diff annotator has been the default for at least one release cycle without regressions.

Each step lands as a separate PR. No single change is bigger than what a reviewer can hold in their head.

## Sub-issues (to be filed once this issue is approved in principle)

- **`EffectAnnotator` interface + registry + fast path** — lands first, enables everything else.
- **`EffectCollection` provenance + header-metadata serialization** — can land in parallel.
- **Design of `Edit` / `MutantTranscript` data model** — first prerequisite for the sequence_diff slow path.
- **Sequence-diff annotator for coding variants** — replaces the offset-based code in the slow path.
- **Performance benchmark suite** — lands before the sequence_diff default flip.
- **Parity test harness** — runs both annotators on the full corpus, catches drift.
- **Isovar integration surface** — what shape of `MutantTranscript` does Isovar import, and what evidence does it carry? Does Isovar get its own annotator or compose with sequence_diff?
- **Exacto → `MutantTranscript` loader** — already planned in #260, but after this refactor it becomes much thinner.
- **Splice possibility sets on `MutantTranscript`** — enabler for #262 and #259.

## Current landed state (2026-04-20)

Partial infrastructure has already shipped under this umbrella:

- `MutantTranscript` + `TranscriptEdit` + `ReferenceSegment` data classes (`varcode/mutant_transcript.py`). The SV analog of `JoinEdit` from the API sketch above ended up as **`reference_segments: Tuple[ReferenceSegment, ...]`** — a tuple of contiguous-reference-sequence pointers, each with `(source, start, end, strand, label)`. A fusion is two segments; an inversion is three (forward / reverse-complement / forward); an assembled-allele SV is one synthetic segment. This is strictly more general than `JoinEdit` (which only covered the two-transcript join case) and handles inversions, long-read assemblies, and intergenic translocations in the same shape.
- `apply_variant_to_transcript(variant, transcript)` produces a `MutantTranscript` for point variants with `cdna_sequence` and `mutant_protein_sequence` populated (mitochondrial codon table selected per-transcript).
- `StructuralVariantAnnotator` (PR #333) classifies SVs into `LargeDeletion` / `LargeDuplication` / `Inversion` / `GeneFusion` / `TranslocationToIntergenic`, but **does not yet populate `MutantTranscript.reference_segments`** on the returned effects — see the follow-ups below.
- `Outcome` (#330) defines the unified outcome shape (`effect`, `probability`, `source`, `evidence`) and `MultiOutcomeEffect.outcomes` lifts the existing `candidates` tuple into it.

### Gap: SV effects don't yet produce sequences

The SV annotator always constructs effects with `candidates=None`, which defaults to `(self,)`, so every SV effect today is a single-outcome wrapper with no `mutant_transcript` attached. External tools that want to read \"what protein does this fusion produce\" cannot do so without running their own fusion math. This is the main thing this issue enables after refactor.

### Gap: splice and SV outcomes don't yet share the `Outcome.effect` contract

`SpliceOutcomeSet.candidates` contains `SpliceCandidate` dataclasses (not `MutationEffect` subclasses). The inherited `MultiOutcomeEffect.outcomes` wraps them as `Outcome(effect=<SpliceCandidate>)`, violating the declared contract (`Outcome.effect: MutationEffect`). `StructuralVariantEffect.candidates` holds real `MutationEffect` instances, so SV outcomes are correctly typed — but downstream consumers still have to `isinstance`-branch because `SpliceCandidate.coding_effect` hides the actual effect two hops deep. This issue's refactor is where those two producers should converge.

## Outcome contract (SV + splice unification)

After #271, every `MultiOutcomeEffect` subclass must guarantee:

```python
effect.outcomes  # -> Tuple[Outcome, ...]

# For every outcome o in effect.outcomes:
isinstance(o.effect, MutationEffect)          # strict — no dataclass impostors
o.effect.short_description                    # always present
getattr(o.effect, "mutant_protein_sequence", None)  # present when computable, None when not
getattr(o.effect, "mutant_transcript", None)  # present for SV-shape effects, None for point-variant effects
o.probability                                  # float in [0, 1] or None (unscored)
o.source                                       # "varcode", "spliceai", "isovar", ...
o.evidence                                     # open-ended dict, source-specific shape
```

This collapses three ad-hoc multi-outcome shapes (`ExonicSpliceSite.alternate_effect`, `SpliceOutcomeSet`, `StructuralVariantEffect`) onto the same iteration pattern. Provenance (probability, source, evidence dict) stays on the `Outcome`; the `effect` is always a `MutationEffect` carrying whatever the DNA-level classification produced; `mutant_protein_sequence` lookup is one-hop regardless of outcome kind.

The concrete work (migrating `SpliceCandidate` contents onto `Outcome` fields + `MutationEffect` subclasses) is tracked in #339.

## Post-#271 follow-up issues (filed 2026-04-20)

These address the gaps identified above and the SV-cohesiveness audit. All are **blocked by #271** — they assume the annotator interface, `MutantTranscript` materialization, and unified outcome shape are in place:

- **#335** — `StructuralVariantAnnotator` should emit `MutantTranscript(reference_segments=...)` for DEL / DUP / INV / INS / fusion / translocation, not just classify the top-level consequence.
- **#336** — `GeneFusion` should compute the fused cDNA + protein from partner transcripts, populating `mutant_transcript` on the effect.
- **#337** — Wire the cryptic-exon enumerator (`varcode/cryptic_exons.py`) into the SV annotator so candidates attach as additional `Outcome` entries rather than requiring callers to invoke it manually.
- **#338** — `StructuralVariantAnnotator` should honor `StructuralVariant.alt_assembly` (long-read resolution hook) — currently documented but not read.
- **#339** — Unify `SpliceOutcomeSet` and `StructuralVariantEffect` so `Outcome.effect` is always a `MutationEffect`. Moves `SpliceCandidate`'s fields onto `Outcome` + placeholder `MutationEffect` subclasses.
- **#340** — `StructuralVariantEffect` needs `priority_class` entries in `effect_priority` so SV effects sort consistently against point-variant effects and splice sets.
- **#341** — When SV breakpoints land near splice sites, the annotator should attach splice-outcome candidates alongside the SV classification. This is where SV and splice cohere at the outcome layer.


## Non-goals

- This issue is not about changing the public `Effect` classes. `Substitution`, `Silent`, `PrematureStop`, etc. remain the output types. What changes is how they are *computed* and which annotator does the computing.
- `Variant.effects()` keeps its name and shape — "annotator" lives at the module/backend layer, not in the user-facing accessor.
- This is not a rewrite. It's a refactor that lets us delete complexity from the offset-based code (eventually) while opening a substrate for the roadmap features.

## Related

- #270 — Personal genome / haplotype-aware umbrella (consumer of this)
- #261 — SV / Exacto umbrella (consumer of this)
- #262 — Multi-effect splice prototype (consumer of this)
- #179 — Variant effects should contain `mutated_sequence` (direct request for this)

## Deserialization today (baseline)

`EffectCollection` already has a working deserialization path:

```python
EffectCollection.from_dict(ec.to_dict())   # works today
EffectCollection.from_json(ec.to_json())   # works today
pickle.loads(pickle.dumps(ec))              # works today
```

These are inherited from `Serializable` (via sercol). Covered by `tests/test_effect_collection_serialization.py`.

What's missing (gaps this issue needs to close):

- **`from_csv`**: `to_csv` exists (inherited), but the inverse does not. This is straightforward once we add header metadata — the reader parses `# key=value` lines for provenance, then uses `pd.read_csv(comment='#')` for the body.
- **Annotator / version in round-trips**: because `annotator`/`annotator_version`/`varcode_version` don't exist as fields yet, neither `to_dict` nor `to_json` carry them. Adding the fields closes this automatically for dict/JSON; CSV needs the header support above.
- **Cross-varcode-version compatibility**: a collection serialized by 2.0.0 and read by 2.1.0 today works (same schema); after this refactor the deserializer should check the embedded `varcode_version` and either accept, warn, or error based on a policy (probably: accept minor-version drift, warn on major).

**Ingesting an existing EffectCollection CSV/JSON**:

```python
ec = EffectCollection.from_csv("effects.csv")     # NEW — parses header + body
print(ec.annotator, ec.annotator_version, ec.varcode_version)

ec = EffectCollection.from_json("effects.json")   # EXISTS; extended to carry metadata
```

## Migration from `Serializable` to dataclasses

[python-serializable](https://github.com/openvax/serializable) pre-dates `dataclasses` landing in the stdlib. Most of what it does — `__init__` generation, `__repr__`, `__eq__`, `to_dict`/`from_dict` — is now covered more cleanly by `@dataclass(frozen=True)` plus a bit of serialization helper code. Moving off `Serializable` where we can reduces dependencies, makes the types easier to reason about (standard Python semantics, no custom metaclass), and lines up with how the rest of the Python ecosystem works in 2026.

### What `Serializable` provides that dataclasses don't natively

- **Polymorphic round-tripping**: `MutationEffect` can be any of 30+ subclasses (Substitution, PrematureStop, ExonicSpliceSite, …). Deserializing a list of effects requires knowing which subclass to rebuild. `Serializable` handles this via a class-name registry keyed on `__class__.__name__` in the dict.
- **Nested serialization**: recursively walks fields that are themselves `Serializable` instances. Dataclasses need `asdict()` for this, which is fine but doesn't roundtrip subclasses automatically.

Everything else that `Serializable` does is redundant with `@dataclass(frozen=True)`.

### Proposed migration

1. **Start with leaf value types.** `Edit`, `Provenance`, `Evidence`, `MutantTranscript` are new; make them `@dataclass(frozen=True)` from the start. They have no polymorphic subclass structure, so dataclasses cover everything.
2. **Migrate existing leaf types.** `Variant` currently extends `Serializable`. It has no subclass hierarchy (all variants are Variant instances, distinguished by fields); it's a pure value type. Converting to `@dataclass(frozen=True)` is a mechanical change that also gives us a proper `__hash__` for free (useful for using variants as dict keys, which we already do).
3. **Write a tiny polymorphic shim for the effect hierarchy.** ~10 lines of code: `EFFECT_CLASSES = {cls.__name__: cls for cls in MutationEffect.__subclasses__(recursive=True)}`; `from_dict` looks up the right class by name and passes the remaining fields to its constructor. This lets the effect classes become dataclasses too, while preserving polymorphic round-trip.
4. **Drop the `Serializable` dependency**. Once the last user migrates, remove it from `pyproject.toml`.

### Scope

This migration is **complementary** to the main refactor in this issue, not a prerequisite. It can proceed as a separate sub-issue at its own pace — each type converted is a small, well-contained PR with straightforward before/after semantics. Test coverage (existing \`to_dict\`/\`from_dict\` round-trip tests) stays the regression gate.

### Adding to sub-issues

- **Migrate value types from `Serializable` to dataclasses** — starts with `Variant`, then extends to new types (`Edit`, `Provenance`, `MutantTranscript`) as they're introduced. Can begin immediately, independent of the annotator refactor.


Issue	How `MutantTranscript` simplifies it
#268 germline-aware	Apply germline variants first → produces a germline `MutantTranscript`. Somatic annotation diffs the final mutant against germline, not reference. One-line composition.
#269 phasing	One `MutantTranscript` per haplotype. Cis variants go in the same one; trans variants go in different ones. Joint effects fall out of the diff.
#262 multi-effect splice	Each splice outcome (normal, exon skip, cryptic donor, intron retention) is a different `MutantTranscript` with a plausibility score. The "possibility set" is just `List[MutantTranscript]`.
#259 RNA evidence	RNA observations are `MutantTranscript` objects. Importing RNA support means attaching evidence (read counts, fragment IDs) to the right `MutantTranscript`.
#257 SV types	A translocation produces a `MutantTranscript` joining segments of two reference transcripts. Doesn't fit the offset model at all; fits here naturally.
#260 Exacto loader	Exacto's `translate-structs` output is a `MutantTranscript`. Loading = construct directly, no effect inference needed.
#264 symbolic alleles	Each symbolic allele type is a rule for constructing a `MutantTranscript` from a reference span.
#179 `mutated_sequence` on effects	Trivially: every effect carries a `MutantTranscript` reference.
#195 annotate all transcripts	One `MutantTranscript` per transcript per variant set; no "top priority" coupling.

Uh oh!

Foundational refactor: MutantTranscript abstraction #271

Description

Summary

Motivation

Why this unlocks the roadmap

Pluggable EffectAnnotator implementations

Selection

Guarantees

Why pluggable annotators, not a flag

Fast path for SNVs (and small indels)

Expected performance profile

EffectCollection records its annotator; serialization preserves it

New fields on EffectCollection

CSV header metadata

API

Applies to other formats too

Isovar integration (first-class evidence source)

Performance is a hard constraint

Performance acceptance criteria

API sketch

Migration plan

Sub-issues (to be filed once this issue is approved in principle)

Current landed state (2026-04-20)

Gap: SV effects don't yet produce sequences

Gap: splice and SV outcomes don't yet share the Outcome.effect contract

Outcome contract (SV + splice unification)

Post-#271 follow-up issues (filed 2026-04-20)

Non-goals

Related

Deserialization today (baseline)

Migration from Serializable to dataclasses

What Serializable provides that dataclasses don't natively

Proposed migration

Scope

Adding to sub-issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Pluggable `EffectAnnotator` implementations

New fields on `EffectCollection`

Gap: splice and SV outcomes don't yet share the `Outcome.effect` contract

Migration from `Serializable` to dataclasses

What `Serializable` provides that dataclasses don't natively