diff --git a/CLAUDE.md b/CLAUDE.md
index 71537df..65dc78c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -18,9 +18,19 @@ parallel under `refactor/v1`.
 - `LICENSES.md` — dependency license map; ⚠️ items gate respective phase entry.
 - `docs/DECISIONS.md` — non-obvious branches taken (per SPEC §0.5).
 
-**Active branch:** `refactor/v1` (cut off `feature/audio-finetune-phase1`,
-not `main` — see `docs/DECISIONS.md`). `main` is 33 commits behind v0.
-Phase 0 in progress; sign-off pending on AUDIT + LICENSES.
+**Active branch (2026-05-13):** `main`. The Modal production deploy
+(`936a5cc`) and v1 CI hardening landed on `main`; `refactor/v1` is now
+**23 commits behind `main`** and should be treated as historical. Cut new
+work branches off `main`. Older design docs (and earlier paragraphs in
+this file) may reference paths that exist on `main` but not on
+`refactor/v1` — verify with `git cat-file -e origin/main:<path>` before
+relying on them. The full pipeline (`tabvision/tabvision/pipeline.py`),
+the Modal production adapter (`tabvision-server/modal_app.py`,
+`tabvision-server/app/v1_adapter.py`), and the highres audio backend all
+live on `main`. Phase 5 fusion has shipped. See
+`docs/2026-05-12-session-handoff.md` for the production state and
+`docs/plans/2026-05-12-tab-f1-to-spec-design.md` (+ companion Phase 0
+implementation plan) for current accuracy work.
 
 ## Layout
 
diff --git a/SPEC.md b/SPEC.md
index 3fe8f5f..e666752 100644
--- a/SPEC.md
+++ b/SPEC.md
@@ -121,6 +121,41 @@ The targets above are aggregate over the full eval set. Per-difficulty-tier expe
 
 If the aggregate hits 0.88 but distorted electric scores below 0.75, treat that as a partial pass and prioritize Phase 7 distortion-augmented fine-tuning before final acceptance.
 
+### 1.4.1 v1 acceptance amendment — per-tier targets (2026-05-13)
+
+Per the 2026-05-13 design plan
+(`docs/plans/2026-05-12-tab-f1-to-spec-design.md`), v1 acceptance moves
+from the aggregate 0.88 Tab F1 in §1.4 to **per-tier targets on a
+public-corpus composite eval set**:
+
+| Tier | §1.4 stretch reference | v1 acceptance |
+|---|---:|---:|
+| Clean acoustic single-line | 0.94 | **0.85** |
+| Clean acoustic strummed | 0.86 | **0.90** |
+| Clean electric | 0.90 | **0.87** |
+| Distorted electric | 0.82 | **0.80** |
+
+Rationale: 2026-05-08 GuitarSet validation showed aggregate Tab F1 = 0.61
+with comp tracks at 0.67 and solo tracks at 0.51 despite both being near
+0.92 Pitch F1. The aggregate hid the structural failure mode (single-line
+string/fret assignment). Per-tier targets force the conversation onto the
+right axis and let work be sequenced (strummed first, distorted electric
+last).
+
+**Test-set composition amendment:** the "user's own playing" test set in
+§1.4 paragraph 1 is replaced by a public-corpus composite (GuitarSet
+held-out + Guitar-TECHS + EGDB pending license + qualifying synthetic
+training/dev material). See the design plan §5 for composite policy
+(per-tier minimums, splits, leakage rules, bootstrap CIs).
+
+**Stretch / portfolio reference:** the original §1.4 per-tier table
+(0.94 / 0.86 / 0.90 / 0.82) remains the v1.1 / portfolio stretch bar.
+Hitting it is welcome; v1 acceptance requires only the amended table.
+
+**Aggregate Tab F1** is retired as an acceptance metric. **Onset F1
+(≥ 0.92), Pitch F1 (≥ 0.90), chord-instance accuracy (≥ 0.85), and
+latency (≤ 5 min)** from §1.4 are unchanged.
+
 ### 1.5 Hard constraints
 
 - All training/inference dependencies must be free or have a free tier sufficient for this project (see §6).
diff --git a/docs/plans/2026-05-12-tab-f1-to-spec-design.md b/docs/plans/2026-05-12-tab-f1-to-spec-design.md
new file mode 100644
index 0000000..ff1569b
--- /dev/null
+++ b/docs/plans/2026-05-12-tab-f1-to-spec-design.md
@@ -0,0 +1,293 @@
+# Tab F1 v1 acceptance — Strategy & Decision Record
+
+**Date:** 2026-05-12 (revised 2026-05-13 per PR #10 review)
+**Author:** Patrick (brainstormed with Claude)
+**Status:** v3 — strategy / decision-record only; **not** an implementation plan
+**Scope note:** This is a **SPEC §1.4 amendment proposal** plus
+              strategy. Implementation detail lives in companion docs.
+**Companions:**
+- `SPEC.md` §1.4.1 (the amendment table; committed in the same change set)
+- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` (Phase 0 impl)
+- Later phase impl plans (write after Phase 0 evidence)
+**Replaces:** v1 + v2 (2026-05-12 single-aggregate-target drafts; both
+              had load-bearing license errors and stale path references
+              and have been superseded by this rewrite).
+
+## 0. License gate (must clear before any compute spend)
+
+Per SPEC §1.5 the **shipping default pipeline** must be portfolio-clean.
+NC-licensed material is acceptable in research/experiment configurations
+that are NOT shipped. Each resource is verified 2026-05-13:
+
+| Resource | License | Portfolio-default usable? | Source / verification |
+|---|---|---|---|
+| GuitarSet | CC-BY-4.0 | **yes** | https://zenodo.org/records/3371780 |
+| Guitar-TECHS | CC-BY-4.0 | **yes** | arXiv:2501.03720 §4 distribution |
+| EGDB | none on repo — **author email pending** | **gated** | https://ss12f32v.github.io/Guitar-Transcription/ (LICENSES.md ⚠️) |
+| GOAT | request-only, research-only | **no — DROPPED** | arXiv:2509.22655 §4.2 *"made available by request to better control its use for research purposes only"* |
+| SynthTab dataset | **CC-BY-NC-4.0** | **no — DROPPED** | github.com/yongyizang/SynthTab README *"SynthTab is released with CC BY-NC 4.0 license"* |
+| SynthTab rendering code | CC-BY-4.0 | n/a (we're not redistributing the code) | repo `LICENSE` file |
+| DadaGP | access-by-email research-only; underlying GP tabs derive from copyrighted songs | **research/dev only** — NOT in default path | github.com/dada-bots/dadaGP README; underlying tab copyright unsettled |
+| Basic Pitch | Apache-2.0 | yes (Phase 1 pitch ensemble) | github.com/spotify/basic-pitch |
+| highres (xavriley) | MIT | yes — current production audio backend | github.com/xavriley/hf_midi_transcription |
+| MediaPipe Hands | Apache-2.0 | yes — video pipeline | per LICENSES.md |
+| YOLO-OBB (ultralytics) | AGPL-3.0 (accepted per DECISIONS.md) | yes (portfolio is AGPL-OK) | per LICENSES.md |
+| Free amp/cab IRs | varies (most free-public) | yes for default if redistribution terms allow; verify per-pack | Modern Music Solutions Declassified, Djammincabs |
+
+**Drops vs v2 plan:**
+- **SynthTab dropped** because the dataset is CC-BY-NC-4.0; pretraining
+  the shipping audio backend on it taints derived weights (SynthTab paper
+  treats trained models as derivative work). Distillation as a laundering
+  step is rejected — both legally murky and explicitly out of bounds
+  per the 2026-05-13 review.
+- **GOAT dropped** because it's request-only research-only. Cannot
+  evaluate a public portfolio against it.
+
+**Hard rule:** any phase that depends on a "gated" or "no" row must
+produce evidence that the gate cleared (e.g., a written reply from the
+EGDB author) BEFORE that phase ships. No conditional commits, no
+"we'll-figure-it-out-later" merges.
+
+## 1. Decisions
+
+These supersede the v2 D1–D10 set. Append to `docs/DECISIONS.md` per
+SPEC §0.5 once the plan is approved.
+
+| # | Decision | Rationale |
+|---|---|---|
+| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). |
+| D2 | Per-tier v1 acceptance targets: **0.85 / 0.90 / 0.87 / 0.80** for clean acoustic single-line / strummed / clean electric / distorted electric. | User-stated floor (0.80) and strummed (≥ 0.90); middle tiers proposed and accepted. Original SPEC numbers (0.94 / 0.86 / 0.90 / 0.82) become the v1.1 / portfolio stretch reference. |
+| D3 | Eval set is a **multi-source public-corpus composite**: GuitarSet + Guitar-TECHS + EGDB (license-pending) + qualifying synthetic. Personal videos banned. GOAT dropped. SynthTab dropped from default path. | Per-tier evaluation requires per-tier sources; portfolio constraint excludes NC and request-only data from the shipping path. |
+| D4 | **No SynthTab in the default pipeline.** Audio-side lift comes from priors + cheap pitch post-processing + GuitarSet fine-tune. DadaGP-derived synthetic remains acceptable for **internal training/dev only** if it's never shipped. | SynthTab CC-BY-NC-4.0 taints derived weights; SPEC §1.5 bars NC from default. |
+| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature; per-tier Tab F1 measured audio-only. | No public dataset has video + per-note string/fret labels (verified 2026-05-12). |
+| D6 | **Free-tier compute first.** Order per CLAUDE.md operating rule 6 and SPEC §6.3: **Local CPU > Colab > Kaggle > Lightning Studios > Modal**. Modal is the last resort. | Project rule, plus Lightning's 22 GPU-hr/month free tier covers any fine-tune we'd plausibly run. |
+| D7 | **1-2 month cadence.** No fixed deadline. | User-stated. |
+| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1. | SPEC §1.4 already marks them v1.1. |
+| D9 | Top-K acceptable as an editor UX feature; the D2 numbers are on **top-1 only**. | User-stated. |
+| D10 | Personal training clips off the table entirely — not as accuracy gate, not as dev set, not as label source. | User-stated. |
+| D11 | This document is a **SPEC §1.4 amendment**, not a SPEC-achievement plan. Land the SPEC.md update (§1.4.1) in the same change set. | Honest framing of relaxed targets; reviewer's approval bar. |
+
+## 2. Goal & framing
+
+**v1 acceptance:** hit the D2 per-tier Tab F1 targets on the D3
+public-corpus composite eval set within 1-2 months on free-tier
+compute, with the existing v1 pipeline (no §8 contract changes).
+
+**Stretch / portfolio reference:** the original SPEC §1.4 numbers
+(0.94 / 0.86 / 0.90 / 0.82). If we hit them, that's the portfolio
+narrative; v1 acceptance does not require them.
+
+**Out of v1 acceptance:** quantitative video-fusion Tab F1
+improvement claim (no public dataset for it; tracked as qualitative
+only).
+
+## 3. Current evidence
+
+GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08
+production candidate (highres + `guitarset-v1` prior, audio-only):
+
+| Metric | Current | Status |
+|---|---:|---|
+| Onset F1 (50 ms) | 0.9218 | passes SPEC §1.4 ≥ 0.92 |
+| Pitch F1 (50 ms) | 0.9022 | passes SPEC §1.4 ≥ 0.90 |
+| Tab F1 aggregate (retired) | 0.6104 | — |
+| Tab F1, comp subset | 0.670 mean | — |
+| Tab F1, solo subset | 0.508 mean | — |
+
+The 27 pp gap to the **retired** 0.88 aggregate target is almost
+entirely string/fret assignment on single-line passages. Audio is at
+spec; only fusion-side assignment is short. This frames the per-tier
+work: **strummed (chord context) is closest to its target; single-line
+needs the most lift.**
+
+**Coverage gap:** GuitarSet covers only the clean acoustic tiers.
+Clean-electric and distorted-electric have **no current measurement**
+on a public corpus and must be acquired in Phase 0.
+
+## 4. Resource inventory
+
+### 4.1 Datasets (default-pipeline path only)
+
+| Source | License | Modality | Labels | Tier coverage |
+|---|---|---|---|---|
+| GuitarSet (on-disk) | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | clean acoustic single-line, strummed |
+| Guitar-TECHS (acquire) | CC-BY-4.0 | audio (multi-mic + DI) | 6-track per-string MIDI | clean acoustic single-line, clean electric |
+| EGDB (acquire, license pending) | none on repo — author email required | audio (DI + 5 amp sims) | GuitarPro tabs + aligned MIDI | clean electric, distorted electric |
+| Free IR-augmented GuitarSet | CC-BY-4.0 (with IR pack licenses verified) | derived audio | inherited string + fret | distorted electric (fallback if EGDB blocks) |
+
+### 4.2 Datasets (research / dev only — NEVER in the default pipeline)
+
+| Source | License | Use |
+|---|---|---|
+| DadaGP | access-by-email, research-only | possible internal-training augmentation; not shipped, not redistributed |
+| SynthTab | CC-BY-NC-4.0 | reference only; not a substrate for any shipped weight |
+
+### 4.3 Compute accounts (free-tier first, per D6 order)
+
+| Account | Free allowance | Use |
+|---|---|---|
+| Local CPU | 6 cores WSL2 | eval runs, prior training, cheap post-processing experiments |
+| Colab | ~12 hr/day with limits | quick experiments, prior sweeps |
+| Kaggle | ~30 GPU-hr/week T4 | longer sweeps, baseline checks |
+| Lightning Studios | 22 GPU-hr/month | any fine-tune work, batched in one monthly window |
+| W&B | unlimited (academic) | experiment tracking — required before any GPU job |
+| Hugging Face Hub | unlimited public | weight / checkpoint hosting |
+| Modal | pay-per-use | **production smoke retests only**; never default training |
+
+### 4.4 Code already on `main`
+
+- `tabvision.audio.*` — production pitch backends (highres, basicpitch).
+- `tabvision.fusion.{viterbi,chord,playability,position_prior,neck_prior,melodic_prior}` — Phase 5 shipped.
+- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped.
+- `tabvision.pipeline.run_pipeline` — production-facing orchestrator.
+- `tabvision.eval.{manifest,metrics,runner,guitarset_audio}` — eval scaffolding with `REQUIRED_TIERS = ("clean_acoustic_single_line", "clean_acoustic_strummed", "clean_electric", "distorted_electric")` already encoded ([tabvision/tabvision/eval/manifest.py](tabvision/tabvision/eval/manifest.py)).
+- `tabvision-server/{modal_app.py, app/v1_adapter.py}` — Modal production adapter.
+
+### 4.5 What's been tried (lessons carried forward)
+
+| Attempt | Outcome | Lesson |
+|---|---|---|
+| Learned-fusion LightGBM ranker (2026-04-29) | +0.3 pp LOOCV vs +5 pp gate; **-27.8 pp** regression on training-17 | Catastrophic single-fold regression with small data. **Re-try only with strict per-fold regression guard AND with video features actually populated**, which the apr-29 run lacked. |
+| Basic Pitch fine-tune (2026-04-30) | Superseded by highres backend swap | Fine-tune infra reusable; ceiling lift now lives in highres post-processing and possibly a GuitarSet-only highres fine-tune. |
+| Melodic prior | Regresses aggregate by 1.15 pp | Helps solo, hurts comp. Needs solo-density gating. |
+| Position prior `guitarset-v1` | +22 pp Tab F1 | Per-pitch tabular priors are the largest cheap intervention. Style/structure-conditional priors are the natural extension. |
+
+## 5. Composite eval policy
+
+Each tier in the composite eval set must satisfy these rules. The
+manifest schema (`tabvision/tabvision/eval/manifest.py`) already
+encodes tier names and required clip fields; the Phase 0 impl plan
+extends it for source-specific annotation paths and CI reporting.
+
+**Per-tier minimums:**
+- Each of the four required tiers: **≥ 20 clips** and **≥ 500 gold
+  notes**. Below this the bootstrap CI is too wide to claim acceptance.
+- Total composite: ≥ 80 clips, ≥ 2,000 notes.
+
+**Split policy:**
+- GuitarSet: held-out **by player** (player 05 = validation; this is
+  the existing convention from `guitarset_audio_eval.py`). No
+  train/test leak at player level.
+- Guitar-TECHS: split by **performer** if performer metadata is
+  available; else by clip with a deterministic seed.
+- EGDB: split by **source track** (the 240 clean DIs); amp-sim
+  renders of the same track go to the same split. Required to avoid
+  amp-render leakage.
+
+**Source weighting:**
+- Per-tier metrics are reported **un-weighted across sources within a
+  tier** (every clip has equal weight). The strategic question "is
+  GuitarSet over-represented in clean acoustic" gets a separate
+  per-source breakdown in the report; the headline number is the
+  un-weighted clip mean.
+
+**Leakage rules:**
+- No clip used for prior training (`guitarset-v1` etc.) appears in
+  evaluation. Currently `guitarset-v1` is trained on GuitarSet train
+  split, evaluated on player 05 — compliant.
+- Fine-tune sets must be disjoint from eval sets by player / performer.
+- DadaGP-derived synthetic, if used, is training-only and never
+  appears in the eval manifest.
+
+**Confidence intervals:**
+- Every per-tier number reported with a **95% bootstrap CI** over
+  clips (resample clips with replacement, recompute the tier-mean,
+  10 000 resamples). The acceptance test is `lower_95_CI ≥ target`,
+  not just `mean ≥ target` — this disciplines small-sample wishful
+  thinking.
+
+**Parsers:**
+- One parser per source, named by the annotation format (not the
+  source). Phase 0 ships: `guitarset_jams`, `guitar_techs_midi`,
+  `egdb_gp`. Each parser converts source-native annotations into the
+  §8 `TabEvent` dataclass list. Round-trip parser tests required.
+
+## 6. Phase outline (high-level only)
+
+Each phase has a goal + acceptance bar here. **Per-phase implementation
+plans** (exact files / tests / commands / acceptance outputs) are
+written **separately**, one phase at a time, only after the prior
+phase's evidence justifies starting it.
+
+- **Phase 0 — Foundation.** Per-tier baselines + error decomposition on
+  the composite eval. Acquire Guitar-TECHS; send EGDB email; verify free
+  compute accounts. **No production code changes.** Acceptance: per-tier
+  baseline numbers exist for ≥ 3 of 4 tiers with bootstrap CIs;
+  per-tier 7-bucket error breakdown exists. [Companion:
+  `2026-05-13-tab-f1-phase-0-implementation.md`.]
+- **Phase 1 — Pitch ceiling lift (cheap moves).** Voicing/silence gate
+  + peak-picking + Basic Pitch pitch-only ensemble. Acceptance: Pitch
+  F1 ≥ 0.93 on GuitarSet validation, no Onset F1 regression > 1 pp.
+- **Phase 2 — Highres fine-tune on GuitarSet only.** Lightning
+  free-tier; ~3 GPU-hr. **No SynthTab pretrain.** Acceptance: Pitch F1
+  ≥ 0.94, no Onset regression > 1 pp; cross-dataset sanity ≥ 0.90 on
+  Guitar-TECHS held-out.
+- **Phase 3 — Style/structure-conditional priors.** Leave-one-player-out
+  CV with hard regression guard. Acceptance: solo Tab F1 +2 pp vs
+  `guitarset-v1`, no per-bucket regression > 1 pp on comp, no fold
+  regression > 3 pp.
+- **Phase 4 — UI-field audit (capo/tuning/instrument/tone/style).**
+  Unit tests confirm each field propagates into a pipeline decision.
+- **Phase 5 — Learned fusion v2.** Re-attempt with proper features
+  (chord-context, prior-values, playability-cost, video-when-on).
+  Acceptance: +3 pp mean Tab F1, no per-fold regression > 3 pp,
+  margin-fallback to structured search baked in.
+- **Phase 6 — Video pipeline qualitative integration.** Enable
+  `TABVISION_VIDEO_ENABLED=true` in dev with a runtime quality gate.
+  Acceptance: video on/off does not regress audio-only metrics by > 0.5 pp.
+- **Phase 7 — Solo-gated melodic prior.** Acceptance: solo +3 pp,
+  comp ±1 pp.
+- **Phase 8 — Tier shortfall recovery.** Only if a tier still misses
+  its D2 target. Per-tier tactics (chord-shape templates for strummed,
+  IR-augmentation for distorted, etc.).
+- **Phase 9 — Final eval + DECISIONS.md update + SPEC.md PR.**
+
+Sequencing: 0 → 1 → 2 in series; 3–7 parallelizable after 2; 8 only
+on shortfall; 9 closes. Total wall-clock estimate: **4-6 weeks
+engineering** + ~1 week EGDB-email turnaround.
+
+## 7. Risks
+
+| # | Risk | Likelihood | Mitigation |
+|---|---|---|---|
+| R1 | EGDB license never resolves | medium | Phase 8 fallback: free-IR-augmented GuitarSet for distorted-electric tier; explicitly flagged as synthesized in reports. |
+| R2 | Guitar-TECHS clips don't span all promised tiers (some clean-electric tracks may be missing) | low-medium | Phase 0 acceptance only requires ≥ 3 of 4 tiers; distorted-electric can wait on EGDB. |
+| R3 | GuitarSet-only fine-tune (Phase 2) over-fits player 05's adjacent training distribution | medium | Cross-dataset sanity on Guitar-TECHS held-out; abort if Guitar-TECHS regresses > 5 pp. |
+| R4 | Per-tier composite has too few clips for statistical significance | medium | D2 acceptance requires `lower_95_CI ≥ target`, not mean. Per-tier minimum 20 clips / 500 notes (§5). |
+| R5 | Phase 5 learned fusion reproduces apr-29 single-fold catastrophe | medium | Strict per-fold regression guard + margin fallback. Decision tree pivots to Phase 7 if it triggers. |
+| R6 | LICENSES.md updates required for Guitar-TECHS / EGDB / IR packs | certain | Update in Phase 0 alongside acquisition. |
+| R7 | Free-tier monthly compute allowance exhausted before Phase 2 + 5 retries | low | Phase 2 ≈ 3 GPU-hr; Phase 5 is CPU. Combined < 10 hr/month, well inside Lightning's 22 hr cap. |
+| R8 | Synthetic data (DadaGP) inadvertently ends up in shipped weights via training/eval pipeline cross-contamination | low | Synthetic clips never appear in `tabvision/data/eval/manifest.toml`; an explicit assert in Phase 0 manifest validator rejects any synthetic-source clip in the default eval set. |
+
+## 8. Out of scope
+
+- Personal training clips (D10).
+- SynthTab in any shipped configuration (D4).
+- GOAT (license).
+- Aggregate Tab F1 ≥ 0.88 as an acceptance gate (D1).
+- Stretch v1.1 (bends / slides / hammer-ons) per D8.
+- Quantitative video-gate (D5).
+- Top-K UI optimization — UI work is separate; D2 applies to top-1.
+- §8 contract changes — no SPEC §8 signature edits in this plan.
+- Modal as a default training surface (D6).
+
+## 9. Open questions (do not gate the plan)
+
+- EGDB author reply timing — assumed ~1 week.
+- Whether Guitar-TECHS subdivides cleanly into "clean acoustic" vs
+  "clean electric" subsets at clip-level metadata, or whether we'll
+  need to inspect waveforms.
+- Whether free IR pack licenses (Modern Music Solutions, Djammincabs)
+  permit redistribution of derived audio in evaluation reports.
+  Phase 8 fallback only.
+
+## 10. Companion docs in this PR
+
+- `SPEC.md` — §1.4.1 amendment block (per-tier targets + composite test set).
+- `CLAUDE.md` — active-branch update (`main`, not `refactor/v1`).
+- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` — Phase 0
+  implementation: exact files, tests, commands, acceptance outputs.
+
+Later phase implementation plans (`docs/plans/2026-05-NN-tab-f1-phase-N-implementation.md`)
+will be written one phase at a time, only after the prior phase's
+evidence is in.
diff --git a/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md
new file mode 100644
index 0000000..0a9cd5f
--- /dev/null
+++ b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md
@@ -0,0 +1,305 @@
+# Tab F1 — Phase 0 Implementation Plan
+
+**Date:** 2026-05-13
+**Author:** Patrick (brainstormed with Claude)
+**Status:** Proposed — pending sign-off
+**Strategy doc:** `docs/plans/2026-05-12-tab-f1-to-spec-design.md`
+**Implementation branch:** to be cut as `impl/tab-f1-phase-0` off `main`
+              after the strategy / SPEC amendment lands.
+
+## 0. Phase 0 goal recap
+
+Establish the per-tier baseline and error decomposition needed to
+sequence Phases 1+. **No production code changes; no shipped behavior
+changes; no compute spend on training.**
+
+Acceptance, copied from the strategy doc §6:
+
+- Per-tier baseline numbers for ≥ 3 of 4 D2 tiers with **bootstrap
+  95% CIs**, on the composite eval set.
+- Per-tier 7-bucket error decomposition on the same set.
+- Free-tier compute accounts (Local / Colab / Kaggle / Lightning / W&B)
+  verified.
+- EGDB author email sent; reply tracked in `docs/DECISIONS.md`.
+
+## 1. Files to add / modify
+
+### 1.1 New files
+
+| Path | Purpose |
+|---|---|
+| `tabvision/tabvision/eval/parsers/__init__.py` | Parser registry |
+| `tabvision/tabvision/eval/parsers/guitarset_jams.py` | JAMS → `list[TabEvent]` |
+| `tabvision/tabvision/eval/parsers/guitar_techs_midi.py` | 6-track MIDI → `list[TabEvent]` |
+| `tabvision/tabvision/eval/parsers/egdb_gp.py` | GuitarPro tab + MIDI → `list[TabEvent]` (skipped at import-time if PyGuitarPro not installed; runs only when EGDB license clears) |
+| `tabvision/tabvision/eval/composite.py` | `run_composite_eval(manifest_path) -> CompositeReport` — dispatches to per-source parsers and aggregates per-tier |
+| `tabvision/tabvision/eval/bootstrap.py` | Bootstrap CI helper: `bootstrap_ci(values, statistic=mean, n=10_000, seed=int) -> tuple[float, float, float]` returning `(mean, lower_95, upper_95)` |
+| `tabvision/tabvision/eval/error_decomposition.py` | Port of `tabvision-server/tools/error_analysis.py` (apr-28 7-bucket harness) targeting `list[TabEvent]` pairs |
+| `tabvision/scripts/eval/composite_eval.py` | CLI wrapper: `tabvision-composite-eval --manifest data/eval/composite.toml --output docs/EVAL_REPORTS/composite_baseline_<date>.md` |
+| `tabvision/scripts/eval/decompose_tab_errors.py` | CLI wrapper for error_decomposition.py |
+| `tabvision/data/eval/composite.toml` | Composite-eval manifest (live; populated incrementally as datasets arrive) |
+| `tabvision/data/fixtures/eval/guitarset_05_BN1-129-Eb_comp.jams` | Single-clip JAMS fixture for parser round-trip test |
+| `tabvision/data/fixtures/eval/guitar_techs_sample.mid` | Single-clip 6-track MIDI fixture |
+| `tabvision/tests/unit/test_parser_guitarset_jams.py` | JAMS parser round-trip test |
+| `tabvision/tests/unit/test_parser_guitar_techs_midi.py` | MIDI parser round-trip test |
+| `tabvision/tests/unit/test_bootstrap_ci.py` | CI helper correctness on known distributions |
+| `tabvision/tests/unit/test_error_decomposition.py` | 7-bucket assignment correctness on synthetic predicted/gold pairs |
+| `tabvision/tests/integration/test_composite_eval_smoke.py` | End-to-end smoke: 5-clip manifest → tier numbers exist + CIs computed |
+| `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` | First baseline report (output of Phase 0E) |
+| `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` | First 7-bucket decomposition (output of Phase 0D) |
+
+### 1.2 Modified files
+
+| Path | Lines | Change |
+|---|---|---|
+| `tabvision/tabvision/eval/manifest.py` | the `REQUIRED_CLIP_FIELDS` block (currently ~lines 21-28) | Add `annotation_format` field so parser-dispatch can route by source |
+| `tabvision/tabvision/eval/manifest.py` | `validate_manifest()` | Reject any clip whose `source` indicates synthetic origin (e.g. starts with `synthtab/` or `dadagp/`) from a non-train split. This is the R8 cross-contamination guard from the strategy doc. |
+| `LICENSES.md` | datasets table | Add Guitar-TECHS (CC-BY-4.0), EGDB (pending), free IR packs as they're acquired |
+| `docs/DECISIONS.md` | append | D1–D11 from strategy doc §1 |
+| `pyproject.toml` (in `tabvision/`) | `[project.optional-dependencies]` | Add `eval` extra with `pretty_midi`, `pyguitarpro`, `jams` (already used elsewhere — verify before adding) |
+
+### 1.3 NOT modified
+
+- `tabvision/tabvision/pipeline.py` — no behavior change in Phase 0.
+- `tabvision/tabvision/fusion/**` — no fusion changes.
+- `tabvision-server/modal_app.py`, `tabvision-server/app/v1_adapter.py` — no production changes.
+- `tabvision-server/app/v1_adapter.py:91` `videoIgnoredByQualityGate` — flagged in strategy doc as a faked diagnostic, but the fix is Phase 6's job, not Phase 0's.
+
+## 2. Test plan
+
+Every test must be runnable via `pytest tabvision/tests/...` and skip
+cleanly when an optional dependency is missing (PyGuitarPro, jams).
+Fixtures go under `tabvision/data/fixtures/eval/`.
+
+### 2.1 Unit tests
+
+| Test name | Fixture | Assertion |
+|---|---|---|
+| `test_parser_guitarset_jams.py::test_jams_round_trip_pitch_string_fret` | `guitarset_05_BN1-129-Eb_comp.jams` (small, ~50 notes) | Every emitted `TabEvent` has `0 ≤ string_idx ≤ 5`, `0 ≤ fret ≤ 24`, monotonically non-decreasing `onset_s`. Total event count matches the JAMS namespace's note count. |
+| `test_parser_guitarset_jams.py::test_jams_pitch_consistency` | same | For each emitted event, MIDI pitch implied by `(string_idx, fret)` matches the JAMS-reported pitch. |
+| `test_parser_guitar_techs_midi.py::test_midi_round_trip_per_string` | `guitar_techs_sample.mid` (6 tracks, 1 per string) | Track index → `string_idx` mapping correct: track 0 → low E (`string_idx=0`), track 5 → high E (`string_idx=5`). |
+| `test_parser_guitar_techs_midi.py::test_midi_pitch_to_fret` | same | Per-string MIDI pitch → fret derivation matches expected standard-tuning offsets: E2=40 → fret 0 string 0, A2=45 → fret 5 string 0, etc. |
+| `test_bootstrap_ci.py::test_ci_known_normal` | synthetic Gaussian N(0.85, 0.05), n=100 | Returned 95% CI brackets the true mean ≥ 95% of the time over 1000 trials (calibration check). |
+| `test_bootstrap_ci.py::test_ci_handles_small_samples` | n=5 | No exception; CI width sane (≥ standard error). |
+| `test_bootstrap_ci.py::test_ci_deterministic_with_seed` | any | Same seed → same CI. |
+| `test_error_decomposition.py::test_seven_buckets_assigned` | synthetic gold + predicted `TabEvent` lists, one per bucket | Each ground-truth event lands in the expected bucket: `correct`, `wrong_position_same_pitch`, `pitch_off`, `timing_only`, `missed_onset`, `muted_undetectable`, `extra_detection`. |
+| `test_error_decomposition.py::test_share_of_loss_sums_to_one` | mixed gold + predicted | Per-bucket share-of-loss percentages sum to 100% (excluding the `correct` bucket). |
+
+### 2.2 Integration tests
+
+| Test name | Setup | Assertion |
+|---|---|---|
+| `test_composite_eval_smoke.py::test_five_clip_manifest` | A 5-clip composite manifest using checked-in fixtures (3 GuitarSet, 2 Guitar-TECHS) | `run_composite_eval(manifest)` returns a `CompositeReport` whose tiers include both `clean_acoustic_single_line` and `clean_acoustic_strummed`. Each tier has a non-null `tab_f1_mean` and `tab_f1_ci_95`. |
+| `test_composite_eval_smoke.py::test_synthetic_clip_rejected_from_eval` | A manifest with one clip whose `source = "synthtab/test"` and `split = "test"` | `validate_manifest()` raises with a message mentioning the cross-contamination guard. |
+| `test_composite_eval_smoke.py::test_egdb_skipped_when_pyguitarpro_missing` | Manifest with an EGDB clip but PyGuitarPro not installed | Run completes successfully; the EGDB clip is reported as `skipped` with reason `parser_dependency_missing`. Other clips still evaluated. |
+
+### 2.3 What's NOT tested in Phase 0
+
+- The actual D2 acceptance numbers — those are the *output* of running
+  the harness, not a unit-test assertion. The CI gate is what's tested;
+  whether the system *hits* 0.85/0.90/0.87/0.80 is a question Phases
+  1-8 answer.
+- Bootstrap confidence on real production data — covered by the
+  smoke test on fixtures; running on production data is a one-shot
+  command, not a CI test.
+
+## 3. Commands
+
+All commands run from repo root, in the WSL Ubuntu shell, with the
+`tabvision` venv active (`source tabvision/venv/bin/activate` or
+`pip install -e tabvision[dev,eval]`).
+
+### 3.1 One-time setup
+
+```bash
+# Install eval extras (PyGuitarPro, pretty_midi, jams)
+cd tabvision && pip install -e '.[dev,eval]' && cd -
+
+# Verify tests pass on the base
+pytest tabvision/tests/unit/test_parser_guitarset_jams.py -v
+pytest tabvision/tests/unit/test_bootstrap_ci.py -v
+```
+
+### 3.2 Acquire Guitar-TECHS
+
+```bash
+# Guitar-TECHS is CC-BY-4.0, hosted on Zenodo (see strategy doc §4.1)
+mkdir -p ~/mir_datasets/guitar_techs
+# Download the dataset archive from the URL in arXiv:2501.03720
+# (resolved at acquisition time; not committed to repo)
+# Extract into ~/mir_datasets/guitar_techs/
+ls ~/mir_datasets/guitar_techs/
+```
+
+### 3.3 Build the manifest
+
+```bash
+# Generate composite.toml from on-disk datasets
+python tabvision/scripts/eval/build_composite_manifest.py \
+  --guitarset ~/mir_datasets/guitarset \
+  --guitar-techs ~/mir_datasets/guitar_techs \
+  --output tabvision/data/eval/composite.toml
+
+# Validate it
+python -c "from tabvision.eval.manifest import validate_manifest; print(validate_manifest('tabvision/data/eval/composite.toml'))"
+```
+
+### 3.4 Run the baseline composite eval
+
+```bash
+python tabvision/scripts/eval/composite_eval.py \
+  --manifest tabvision/data/eval/composite.toml \
+  --backend highres \
+  --position-prior guitarset-v1 \
+  --bootstrap-n 10000 \
+  --bootstrap-seed 42 \
+  --output docs/EVAL_REPORTS/composite_baseline_2026-05-13.md
+```
+
+### 3.5 Run the error decomposition
+
+```bash
+python tabvision/scripts/eval/decompose_tab_errors.py \
+  --manifest tabvision/data/eval/composite.toml \
+  --backend highres \
+  --position-prior guitarset-v1 \
+  --output docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md
+```
+
+### 3.6 Verify free-tier compute accounts
+
+```bash
+# W&B: confirm login + a tiny no-op run
+wandb login
+python -c "import wandb; r = wandb.init(project='tabvision-phase0', mode='online'); r.log({'hello': 1}); r.finish()"
+
+# Lightning Studios: open a Studio in the browser, run `nvidia-smi`, screenshot for the DECISIONS.md log
+
+# Kaggle: open a notebook in the browser, run `!nvidia-smi`
+
+# Colab: same
+
+# Modal: skip — used only as last resort per D6
+```
+
+### 3.7 Send the EGDB email
+
+User action — not a command. Template in strategy doc; log the
+date sent and the reply (when it arrives) in `docs/DECISIONS.md`.
+
+## 4. Acceptance outputs
+
+These are the artifacts whose existence + content gates Phase 1.
+
+### 4.1 `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md`
+
+Must contain:
+
+- A per-tier table:
+  - Tier name
+  - Clip count (≥ 20 for any tier claimed against D2)
+  - Mean Tab F1
+  - **95% bootstrap CI lower bound**
+  - Mean Onset F1
+  - Mean Pitch F1
+- Per-source breakdown within each tier (GuitarSet / Guitar-TECHS /
+  EGDB) so we can see whether a tier number is dominated by one
+  source.
+- A "Status vs D2 target" column with one of: **pass** (CI lower ≥
+  target), **gap** (mean ≥ target but CI lower below), **fail** (mean
+  below target).
+- Methodology footer: bootstrap N, seed, parser versions, backend +
+  prior versions, eval-harness commit SHA.
+
+### 4.2 `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md`
+
+Must contain:
+
+- Aggregate 7-bucket table (counts + share-of-loss).
+- Per-tier 7-bucket table.
+- A "biggest lever per tier" callout: which bucket dominates each
+  tier's loss. Phase 1+ priorities derive from this.
+
+### 4.3 `tabvision/data/eval/composite.toml`
+
+Must satisfy `validate_manifest()` and contain:
+
+- ≥ 20 clips for each of: `clean_acoustic_single_line`,
+  `clean_acoustic_strummed`. (Guitar-TECHS additions may bring
+  `clean_electric` to ≥ 20 in Phase 0E; if not, that tier waits for
+  EGDB.)
+- `clean_electric` and `distorted_electric` populated as much as
+  Guitar-TECHS + EGDB-license-resolved allow.
+- No `source = synthtab/...` or `source = dadagp/...` rows in `split =
+  validation` or `split = test`.
+
+### 4.4 `docs/DECISIONS.md` entries
+
+D1–D11 from strategy doc §1, dated 2026-05-13. EGDB email send-date
+and reply (when it arrives) as a separate entry.
+
+### 4.5 CI verification
+
+`pytest tabvision/tests/unit tabvision/tests/integration -v` passes
+on `main` HEAD plus this Phase 0 branch.
+
+## 5. Decision tree
+
+What to do after Phase 0E baseline is in:
+
+- **All four tiers' CI lower bound clears D2** — surprising; sanity
+  check the eval harness, then declare v1 acceptance and skip to
+  Phase 9. This is unlikely given the 2026-05-08 0.61 aggregate.
+- **Strummed CI lower bound clears D2, other tiers gap or fail** —
+  expected case. Proceed to Phase 1 (pitch ceiling lift). The
+  error-decomposition report tells us whether Phase 2 (fine-tune) or
+  Phase 3 (style priors) is the next priority after Phase 1.
+- **All tiers fail** — Phase 0 implementation has a bug, or the
+  highres backend regressed on the broader corpus. Inspect 3-5
+  worst-case clips by hand before any further compute spend.
+- **`distorted_electric` has < 20 clips** — EGDB license is the
+  blocker. Set the tier aside; document the gap in the report; do not
+  publish D2 acceptance until the EGDB row clears.
+
+## 6. Time + compute budget
+
+| Item | Effort | Compute |
+|---|---|---|
+| Parser implementations + tests (1.1) | 1.5 days | none |
+| Manifest extensions + validator hardening (1.2) | 0.5 day | none |
+| Composite + bootstrap + error-decomposition modules (1.1) | 1 day | none |
+| Guitar-TECHS acquisition + manifest population | 0.5 day | none |
+| Baseline + decomposition runs (3.4 + 3.5) | 4-8 wall-clock hours | local CPU |
+| Free-tier compute account verification | 0.5 day | none |
+| EGDB email + DECISIONS.md updates | 15 minutes | none |
+| Report writing | 0.5 day | none |
+| **Total** | **4-5 days engineering** | **~$0** |
+
+## 7. Out of scope for Phase 0
+
+- Any production-pipeline change. No edits to `pipeline.py`, `fusion/`,
+  `audio/`, `video/`, `tabvision-server/`.
+- Fine-tuning, training, or model weight changes.
+- Anything depending on the EGDB license reply (defer to Phase 8 or
+  later).
+- Style-conditional priors (Phase 3).
+- Video pipeline experiments (Phase 6).
+- Synthetic-data generation (research/dev only; not part of Phase 0).
+
+## 8. Done definition
+
+Phase 0 is **done** when:
+
+- All items in §1.1 and §1.2 exist on the impl branch.
+- All tests in §2.1 and §2.2 pass green.
+- `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` exists and meets
+  §4.1.
+- `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` exists
+  and meets §4.2.
+- `tabvision/data/eval/composite.toml` exists and validates.
+- `docs/DECISIONS.md` includes D1–D11.
+- EGDB email send-date recorded.
+- Free-tier compute accounts verified (W&B at minimum; Lightning /
+  Kaggle / Colab logged in `docs/DECISIONS.md`).
+
+Then — and only then — the Phase 1 implementation plan gets written.