From 6f2117d1f4fe61842f775c34641936dec786490c Mon Sep 17 00:00:00 2001
From: Patrick Gilhooley <pgilhooley95@gmail.com>
Date: Wed, 13 May 2026 06:15:58 -0400
Subject: [PATCH 1/2] docs(plan): tab F1 per-tier targets and 10-phase work
 breakdown
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Plan for closing the Tab F1 gap on GuitarSet validation (currently 0.61
aggregate). Replaces the single-aggregate SPEC §1.4 target with per-tier
targets (clean acoustic single-line 0.85, strummed 0.90, clean electric
0.87, distorted electric 0.80) and locks in 10 decisions on the eval
composite, compute, and scope.

10-phase plan: foundation setup → pitch ceiling lift → SynthTab pretrain
+ highres fine-tune → style priors → UI-field audit → learned fusion v2
→ video qualitative integration → solo-gated melodic prior → tier
shortfall recovery → final eval.

Plan-only, no code changes. Phase 0 user actions enumerated in §8.
---
 .../plans/2026-05-12-tab-f1-to-spec-design.md | 536 ++++++++++++++++++
 1 file changed, 536 insertions(+)
 create mode 100644 docs/plans/2026-05-12-tab-f1-to-spec-design.md

diff --git a/docs/plans/2026-05-12-tab-f1-to-spec-design.md b/docs/plans/2026-05-12-tab-f1-to-spec-design.md
new file mode 100644
index 0000000..41c14a9
--- /dev/null
+++ b/docs/plans/2026-05-12-tab-f1-to-spec-design.md
@@ -0,0 +1,536 @@
+# Tab F1 → Per-Tier Targets — Design
+
+**Date:** 2026-05-12
+**Author:** Patrick (brainstormed with Claude)
+**Status:** Proposed — pending sign-off
+**Spec source:** `SPEC.md` §1.4 (per-tier table), §5 Phase 5, §8 contracts, §1.5 hard constraints, §6.3 free compute accounts.
+**Branch:** to be cut off `refactor/v1` once approved.
+**Depends on:** `docs/plans/2026-05-06-phase5-fusion-design.md`, `docs/plans/2026-05-06-video-pipeline-integration-design.md`, `docs/EVAL_REPORTS/guitarset_accuracy_boost_2026-05-08.md`.
+**Replaces:** earlier 2026-05-12 single-aggregate-target draft (never committed).
+
+## 0. Decisions taken on 2026-05-12
+
+These were locked in during the planning conversation; record them in
+`docs/DECISIONS.md` per SPEC §0.5 once the plan is approved.
+
+| # | Decision | Rationale |
+|---|---|---|
+| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). Per-tier targets force the conversation onto the right axis. |
+| D2 | Per-tier numeric targets (table below). | Strummed raised from SPEC 0.86 → 0.90; distorted-electric floor lowered 0.82 → 0.80. Middle tiers relaxed to reflect the gap between current 0.61 and any realistic ceiling. |
+| D3 | Eval set is a **multi-source composite**: GuitarSet + Guitar-TECHS + GOAT + EGDB (pending license) + synthetic. Personal videos banned from any role. | GuitarSet alone gives one player, one genre cluster, no electric/distorted. Per-tier evaluation requires per-tier sources. |
+| D4 | **SynthTab** pretrain → real-data fine-tune is the audio-side plan. No DIY DadaGP synthesis unless SynthTab proves insufficient. | SynthTab (CC-BY-4.0, ~6,700 h with string/fret labels) pre-empts the engineering cost of building a renderer. Literature (SynthTab paper, High-Res Domain Adaptation arXiv:2402.15258) shows pretrain+fine-tune lifts cross-dataset generalization. |
+| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature. Production runs audio-only; video is opt-in. | No public dataset has synchronized guitar video + per-note string/fret labels. Confirmed via 2026-05-12 research pass (see §3.1). |
+| D6 | **Free-tier compute first.** Order: local CPU > Lightning Studios free (22 GPU-hr/mo) > Kaggle (30 hr/wk T4) > Colab > Modal. | Per CLAUDE.md operating rule 6 and SPEC §6.3 §1.5 hard constraint. The earlier $30-80 fine-tune estimate was Modal pricing; free tier fits a highres fine-tune comfortably. |
+| D7 | **1-2 month cadence.** No fixed deadline. | User-stated. |
+| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1; SPEC §1.4 already marks them v1.1. | Confirmed in conversation. |
+| D9 | Top-K is acceptable as an editor UX feature but the **0.80 floor and per-tier targets apply to the top-1 prediction**. | User-stated. |
+| D10 | Personal training clips (the 20-video set) **off the table entirely** — not as accuracy gate, not as dev set, not as label source. | User-stated. |
+
+### Per-tier Tab F1 targets (D2)
+
+| Tier | SPEC §1.4 | This plan |
+|---|---:|---:|
+| Clean acoustic single-line | 0.94 | **0.85** |
+| Clean acoustic strummed | 0.86 | **0.90** |
+| Clean electric | 0.90 | **0.87** |
+| Distorted electric | 0.82 | **0.80** |
+
+All on the multi-source composite test set (D3). Top-1 prediction only.
+Onset F1 (≥ 0.92) and Pitch F1 (≥ 0.90) from SPEC §1.4 remain unchanged
+— audio already clears them on GuitarSet.
+
+## 1. Goal
+
+Hit the D2 per-tier Tab F1 targets on the D3 composite eval set within
+1-2 months using free-tier compute, while keeping the production system
+the SPEC §8 contract-conformant v1 pipeline.
+
+## 2. Current evidence
+
+GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08 production
+candidate (highres + `guitarset-v1` prior, no video, no melodic prior):
+
+| Metric | Current | SPEC | Status |
+|---|---:|---:|---|
+| Onset F1 (50 ms) | 0.9218 | ≥ 0.92 | pass |
+| Pitch F1 (50 ms) | 0.9022 | ≥ 0.90 | pass |
+| Tab F1 aggregate | 0.6104 | ≥ 0.88 (deprecated) | retired metric |
+
+Per-track distribution (2026-05-12 diagnostic):
+
+- Tab F1 mean **0.589**, median 0.620, min 0.166, max 0.933
+- Comp tracks (n=30) mean **0.670**; solo tracks (n=30) mean **0.508**
+- Worst 10 tracks: 7 are solos. Best 5: 4 are comps.
+- Tab/Pitch ratio: comp 0.744, **solo 0.546** — solos lose 45% of
+  pitch-correct notes to wrong string/fret assignment.
+
+**Bottleneck is string/fret assignment on single-line passages where
+chord-cluster context is absent.** Audio is essentially at spec; only the
+Tab F1 numbers are red, and only on the solo regime.
+
+The single-tier mapping of GuitarSet is "clean acoustic strummed" for
+comp tracks and "clean acoustic single-line" for solo tracks. The
+electric and distorted-electric tiers (D2) have no current measurement
+and must be acquired (D3).
+
+## 3. Resource inventory
+
+### 3.1 Datasets
+
+Verified by the 2026-05-12 research pass. Italics = on-disk now;
+**bold** = to acquire.
+
+| Source | License | Modality | Labels | Size | Tier coverage |
+|---|---|---|---|---|---|
+| *GuitarSet* | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | 3 h, 6 players | clean acoustic single-line, strummed |
+| **Guitar-TECHS** | CC-BY-4.0 | audio (multi-mic + DI) | 6-track MIDI per string | 5h12m | clean acoustic single-line, clean electric |
+| **GOAT** | CC-BY-4.0 | DI audio | tablature | 5.9 h | clean electric |
+| **EGDB** | None on repo — **email author for portfolio-use permission** | audio (DI + 5 amp sims, ~6 renders) | GuitarPro tabs + aligned MIDI (string + fret) | ~12 h synthesized | clean electric, distorted electric |
+| **SynthTab** | CC-BY-4.0 | synthesized audio | string + fret + onset | ~6,700 h | all four tiers (pretrain only) |
+| GAPS | CC-BY-NC-SA | YouTube video links + MIDI pitch | pitch-only, **not tab** | 14 h | reject — non-commercial taint |
+| DadaGP | research-access (email) | symbolic GP files | tab natively | 26,181 files | fallback synthesis source if SynthTab insufficient |
+| ~~The 20 personal clips~~ | n/a | n/a | n/a | n/a | **banned** (D10) |
+
+**Confirmed gap:** no public dataset combines guitar video with per-note
+string+fret labels. This is the load-bearing finding behind D5.
+
+### 3.2 Compute
+
+| Account | Free allowance | Status | Use |
+|---|---|---|---|
+| Lightning Studios | 22 GPU-hr/month | Phase 0 setup | SynthTab pretrain, highres fine-tune |
+| Kaggle | ~30 GPU-hr/week T4 | Phase 0 setup | overflow for long sweeps |
+| Colab | ~12 hr/day with limits | Phase 0 setup | quick experiments |
+| W&B | unlimited (academic) | Phase 0 setup | experiment tracking |
+| HuggingFace Hub | unlimited public | already used | weights / checkpoints |
+| Modal | pay-per-use | already used | production smoke retests only |
+| Local CPU | 6 cores WSL2 | available | eval, priors, light tuning |
+
+Per CLAUDE.md operating rule 6: Local > Colab > Kaggle > Lightning >
+Modal. Modal is the resort, not the default.
+
+### 3.3 Code already in tree
+
+- `tabvision.audio.highres` — production pitch backend, 0.92 / 0.90 on GuitarSet.
+- `tabvision.fusion.position_prior` — `guitarset-v1` prior, +22pp Tab F1.
+- `tabvision.fusion.{viterbi,chord,playability,neck_prior,melodic_prior}` — Phase 5 shipped, cluster Viterbi + chord state enumeration + playability emission/transition costs.
+- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped (1603 LOC).
+- `tabvision.pipeline.run_pipeline` — composes all of the above; production runs through it via `tabvision-server/app/v1_adapter.py`.
+- `tabvision-server/tools/eval_basic_pitch_baseline.py` + `tabvision/scripts/eval/guitarset_audio_eval.py` — current evaluation harness; needs extension for multi-source composite.
+- `tabvision-server/tools/outputs/errors-2026-04-28_185743.md` — apr-28 error-decomposition methodology (proven on personal clips); port the same 7-bucket harness to the composite eval set.
+
+### 3.4 What has been tried (lessons)
+
+| Attempt | Date | Outcome | Lesson |
+|---|---|---|---|
+| Learned-fusion LightGBM ranker | 2026-04-29 | +0.3pp LOOCV vs +5pp gate; **catastrophic −27.8pp regression on training-17** | Small data + over-fit on one held-out group. **Critically: video features were `null` on every row** (`audio_only=True`) — so this wasn't actually a test of learned-fusion-with-video. Re-attempt with proper feature instrumentation is justified. |
+| Basic Pitch fine-tune on GuitarSet | 2026-04-29/30 | Did happen; superseded by highres backend swap before final integration | Fine-tune infrastructure is reusable for highres; SynthTab pretrain is the missing first step. |
+| Melodic prior | current | Regresses aggregate Tab F1 from 0.6104 to 0.5989 | Helps solo, hurts comp. Needs solo-density gating, not a flat enable. |
+| Position prior `guitarset-v1` | 2026-05-08 | +22pp Tab F1 vs no prior | Per-pitch tabular priors are the largest-leverage cheap intervention. Style/structure-conditional versions are the natural extension. |
+| Phase 5 cluster Viterbi + chord enumeration | 2026-05-06 | Shipped, drives current production | The audio-only structured search is already well-tuned. Further gain needs either better priors or different evidence (which video can't provide on the eval set). |
+
+## 4. Plan
+
+10 phases. Phases 0–2 sequential; 3–8 parallelizable. Decision tree
+inside each phase determines whether to continue, branch, or escalate.
+
+### Phase 0 — Foundation (parallel, no compute, 1 week wall-clock)
+
+**Goal:** assemble the evidence base + accounts the rest of the plan
+depends on. No production code changes; setup only.
+
+Concurrent tracks:
+
+- **0A. Acquisition.**
+  - [user] Email EGDB author (`f08946011@ntu.edu.tw`) for written
+    portfolio-use permission. Template draft in §10.
+  - [code] Download Guitar-TECHS and GOAT (both CC-BY-4.0, no email).
+  - [code] Sample SynthTab to a 500-clip pilot subset (~50 h). Full
+    download deferred until Phase 2.
+- **0B. Compute accounts.**
+  - [user] Lightning Studios, Kaggle, Colab, W&B sign-ups per SPEC §6.3.
+  - [code] Verify each by running a hello-world (W&B init + a GPU
+    `nvidia-smi` job on each platform).
+- **0C. Eval harness extension.**
+  - [code] Build `tabvision/scripts/eval/composite_eval.py`. Reads a
+    manifest TOML (per-clip tier label + source + audio path + tab
+    annotation path) and runs the same `guitarset_audio_eval.py` logic
+    across all sources. Outputs per-tier Tab F1, per-source CSVs, and a
+    consolidated Markdown report.
+  - Manifest schema follows the placeholder in
+    `tabvision/data/eval/manifest.toml`. Tier label is one of
+    `clean_acoustic_single_line`, `clean_acoustic_strummed`,
+    `clean_electric`, `distorted_electric`.
+- **0D. Error decomposition.**
+  - [code] Port `tools/error_analysis.py` (apr-28 7-bucket harness)
+    from personal-clip input to the composite eval set. Output:
+    `docs/EVAL_REPORTS/error_decomposition_<date>.md` with per-tier
+    bucket counts.
+- **0E. Baseline measurement.**
+  - [code] Run `composite_eval.py` against the current production
+    pipeline. Get the per-tier numbers. These are the Phase 1+
+    starting points.
+
+**Phase 0 acceptance gate:**
+- Per-tier Tab F1 baseline numbers exist for at least 3 of the 4 tiers
+  (distorted electric is EGDB-dependent; deferred OK).
+- Per-tier 7-bucket error decomposition exists.
+- All free-tier compute accounts verified.
+- EGDB email sent.
+- No production code changes.
+
+**Decision tree:**
+- If baseline already hits some tier (e.g., strummed at 0.92) → drop
+  that tier from later phases' work.
+- If pitch-side metrics regress vs the 2026-05-08 GuitarSet numbers on
+  the composite set → STOP and investigate before any further work.
+  The composite eval should not change audio-side numbers on GuitarSet.
+
+### Phase 1 — Pitch ceiling lift, cheap moves (local CPU, 2-3 days)
+
+**Goal:** Pitch F1 from 0.915 → ≥ 0.93 on GuitarSet validation, without
+training. Gives Tab F1 mathematical headroom regardless of fusion-side
+work.
+
+Moves, in order:
+
+1. **Voicing/silence gate** on highres pitch posteriors. Tune the
+   joint onset+pitch confidence threshold. Trade some recall for
+   precision; expect net F1 gain.
+2. **Onset peak-picking adjustment.** The 50 ms tolerance is generous;
+   misaligned within-tolerance peaks still produce pitch mis-reads.
+   Improve peak localization → tighter onset match → higher pitch TP
+   count.
+3. **Basic Pitch pitch-only ensemble.** Run Basic Pitch alongside
+   highres. Use Basic Pitch's pitch output (not onset) as a tiebreaker
+   on pitch-disagreement events; downweight (or drop) events where the
+   two backends disagree on pitch. SPEC §6.1 path; LICENSES.md
+   confirms Basic Pitch is Apache-2.0 default-pipeline-safe.
+
+**Phase 1 acceptance:**
+- Pitch F1 ≥ 0.93 on GuitarSet validation.
+- Onset F1 ≥ 0.92 (no regression).
+- Aggregate Tab F1 ≥ 0.62 (no regression beyond mathematical
+  pitch-improvement bound).
+
+**Decision tree:**
+- 0.93 met → continue.
+- 0.92–0.93 → continue; Phase 2 fine-tune still useful as ceiling lift.
+- < 0.92 → diagnose. Could be a threshold-sweep artifact rather than a
+  real regression. Inspect on 3-5 representative tracks before
+  escalating.
+
+### Phase 2 — SynthTab pretrain + highres fine-tune (Lightning, 1 week)
+
+**Goal:** Pitch F1 ≥ 0.94 on GuitarSet validation. Lift the audio
+ceiling beyond what threshold-tuning alone can do.
+
+**Approach.** Per the SynthTab paper (ICASSP 2024) and arXiv:2402.15258,
+the proven recipe is: pretrain on synthetic, fine-tune on real.
+
+- **Pretrain corpus:** SynthTab 500-clip pilot (Phase 0). Full set
+  (~6,700 h) is overkill at this stage and won't fit in the free tier
+  monthly budget.
+- **Pretrain head:** the highres model's pitch+onset head. Backbone
+  frozen for the pretrain phase to avoid catastrophic forgetting on
+  the spectral feature extractor.
+- **Fine-tune:** GuitarSet train split (4 players, 240 tracks ≈ 2 h),
+  unfrozen, 5-10 epochs with early stopping on Pitch F1.
+- **Compute:** Lightning Studios free tier (22 GPU-hr/month). Estimate:
+  pretrain ~6 GPU-hr, fine-tune ~3 GPU-hr. Buffer for re-runs ~5 GPU-hr.
+  Stays inside the monthly allowance.
+
+**Phase 2 acceptance:**
+- GuitarSet validation Pitch F1 ≥ 0.94.
+- No Onset F1 regression > 1 pp.
+- Cross-dataset sanity: on Guitar-TECHS (held out from training),
+  Pitch F1 ≥ 0.90 (no catastrophic transfer loss).
+
+**Decision tree:**
+- Met all three → continue. New `audio_backend = "highres-synthtab"`
+  becomes the candidate for production replacement.
+- GuitarSet met, Guitar-TECHS regresses > 5 pp → over-fit on the
+  pretrain distribution. Reduce pretrain epochs, increase fine-tune
+  weight, retry once.
+- GuitarSet ≤ 0.93 → SynthTab pretrain didn't transfer; abandon
+  Phase 2 and revisit with the actual diagnostic (Pitch P/R curves)
+  before any further training spend.
+
+### Phase 3 — Style/structure-conditional priors (local CPU, 3 days)
+
+**Goal:** lift Tab F1 on solos via finer-grained per-pitch position
+priors. Expected +1 to +5 pp on solo subsets.
+
+- **Buckets:** {bn, jazz, funk, rock, ss} × {solo, comp} = 10 priors.
+  GuitarSet's `style` field gives the genre axis directly; structure
+  axis derived from cluster-singleton density (already computable in
+  fusion).
+- **Train:** GuitarSet train split (players 00, 01, 02, 03, 04).
+  Per-bucket Laplace-smoothed counts. Empty cells fall back to
+  `guitarset-v1`.
+- **Validate:** leave-one-player-out CV (not LOOCV per-clip — too
+  small). Primary metric: per-bucket Tab F1 delta vs `guitarset-v1`
+  baseline on the held-out player.
+- **Risk:** the apr-29 learned-fusion attempt failed with one
+  catastrophic regression. Same class of risk here — small data,
+  bucketing on 4 training players. **Hard regression guard:** abort
+  the bucket if any cross-validation fold regresses by > 3 pp.
+
+**Phase 3 acceptance:**
+- Mean Tab F1 over solo buckets: +2 pp vs `guitarset-v1` baseline.
+- No bucket regresses by > 1 pp on comp.
+- No cross-validation fold regresses by > 3 pp on any bucket.
+
+**Decision tree:**
+- Met → ship the prior set, expose `position_prior = "guitarset-styled-v1"`.
+- Solo gain < 2 pp → drop the structure axis, ship style-only.
+- Any bucket fails the regression guard → drop that bucket only;
+  fall back to `guitarset-v1` for it. Don't kill the whole experiment
+  on one bad bucket.
+
+### Phase 4 — Style+structure-aware capo/tuning audit (local, 1 day)
+
+**Goal:** verify the capo / instrument / tone / style fields from the
+upload UI are actually flowing into prior selection and playability
+weights as designed.
+
+- **Trace:** unit-test that with `capo_fret = 5` the position prior
+  shifts correctly (frets 0-19 become frets 5-24).
+- **Smoke:** run a known capo-3 clip from GuitarSet (if any exist)
+  and confirm the output tab is rendered against the capo.
+- **Audit playability:** confirm `instrument = electric` doesn't apply
+  the open-string bonus differently when it shouldn't, etc.
+
+Small phase; mostly a correctness-check before later phases compound
+any bugs here.
+
+**Phase 4 acceptance:**
+- All upload-form fields measurably affect at least one pipeline
+  decision per a unit test.
+- No silent fallback to defaults on any field.
+
+### Phase 5 — Learned fusion v2 (local, 3-5 days)
+
+**Goal:** the 2026-04-24 plan's learned-fusion approach, redone with
+proper feature instrumentation. **Per-pitch + chord-context ranker**,
+not the audio-only ranker that flat-lined at +0.3 pp in 2026-04-29.
+
+**Why this can work this time:** the apr-29 attempt's per-candidate
+features were limited (no fusion-prior values, no neck-anchor values
+because video was off, no chord-cluster context). With Phase 5
+shipping the structured search already, those values are now exposed
+and can be features.
+
+**Per-candidate features:**
+- `pitch`, `confidence`, `duration`, `amplitude` (audio).
+- `position_prior_log_prob`, `melodic_prior_log_prob`,
+  `neck_prior_log_prob` (fusion priors at this candidate).
+- `cluster_size`, `cluster_span`, `is_singleton`, `singleton_density_2s`
+  (chord context).
+- `emission_cost`, `transition_cost_to_prev` (playability).
+- `cand_string`, `cand_fret`, `is_open`, `is_low_position` (identity).
+- `style`, `instrument`, `tone` (from session config; flow-from-UI
+  audited in Phase 4).
+
+**Training:** GuitarSet train split, leave-one-player-out CV. LightGBM
+`lambdarank` with hard regression guard at -3 pp per held-out player.
+
+**Phase 5 acceptance:**
+- Mean Tab F1 across all held-out players: +3 pp vs Phase 3-or-earlier
+  baseline.
+- No held-out player regresses by > 3 pp.
+- Margin-based fallback to structured-search pick when learned-fusion
+  margin is below a threshold (mitigates OOD behavior in production).
+
+**Decision tree:**
+- Met → ship behind a flag, default off, with the margin fallback.
+  Default-on requires a separate review pass with at least one week of
+  production smoke clean.
+- Per-player regression > 3 pp on any fold → the apr-29 failure mode
+  repeats. Stop Phase 5 and pivot to Phase 7 instead.
+
+### Phase 6 — Video pipeline qualitative integration (1-2 days)
+
+Goal: re-enable the video stack in production for users whose uploads
+have usable video, without claiming any quantitative Tab F1 improvement.
+**No video accuracy gate** (D5).
+
+- Flip `TABVISION_VIDEO_ENABLED=true` in `tabvision-server/modal_app.py`
+  in dev.
+- Verify pipeline runs end-to-end on at least one synthetic
+  fretboard-rendered clip (Phase 6A) and the qualitative output is
+  reasonable.
+- Add a runtime quality gate (the one the v1_adapter currently fakes):
+  reject video evidence when `handDetectionRate < 0.3` or
+  `fretboardDetectionConfidence < 0.5`. Diagnostics in result JSON.
+- Production smoke: end-to-end on the existing `test_a440.mp4` (audio
+  ceiling) and one real-world iPhone clip (qualitative inspection only,
+  not gated).
+
+**Phase 6A — Synthetic fretboard video** (optional, 2-3 days):
+- Render a procedurally-generated fretboard animation (Blender or
+  pyrender) against SynthTab audio. Synchronized by-construction.
+- Use for video-pipeline smoke + regression tests (does turning video
+  on/off change anything?), NOT for accuracy claims.
+
+**Phase 6 acceptance:**
+- Video enable in dev does not regress GuitarSet audio-only Pitch /
+  Onset / Tab F1 metrics (delta within ±0.5 pp).
+- At least one synthetic clip produces a non-empty `fingerings` list
+  in the result.
+- Production smoke clean.
+
+**Decision tree:**
+- Audio-only metrics regress when video enabled → video is making
+  things worse on no-video-content clips. Add a fail-fast that
+  disables video output when `videoObservationCount == 0`, retry.
+- No regression but no positive signal either → ship video as opt-in,
+  default off. Revisit when a public video+tab dataset emerges.
+- Positive signal on some clips → ship default-on with the quality
+  gate.
+
+### Phase 7 — Solo-gated melodic prior (local, 2 days)
+
+**Goal:** re-enable the existing melodic prior in the regime where it
+helps (solo passages) without re-introducing the comp regression that
+caused the current ship-disable.
+
+- Gate the melodic prior on rolling-window singleton density: apply
+  only when ≥ 80% of clusters in the last 2 seconds are singletons.
+- Re-tune the 35/65 prior-blend ratio currently hard-coded in
+  `tabvision/tabvision/fusion/melodic_prior.py:64`.
+
+**Phase 7 acceptance:**
+- Solo subset Tab F1 +3 pp vs Phase 3 baseline.
+- Comp subset Tab F1 within ±1 pp.
+- No per-track regression > 3 pp.
+
+### Phase 8 — Tier shortfall recovery (as needed, 1-2 weeks)
+
+Triggered only if a tier still misses its D2 target after Phases 1-7.
+
+- **Distorted electric < 0.80:**
+  - If EGDB acquired: oversample EGDB distorted variants in Phase 2
+    fine-tune; re-run.
+  - If EGDB blocked: synthesize a distorted training subset via
+    SynthTab clean audio + free IR pack convolution (Modern Music
+    Solutions Declassified, Djammincabs).
+- **Clean acoustic single-line < 0.85:**
+  - Re-tune Phase 7 melodic-prior strength on the single-line subset.
+  - If still short: add a position-shift smoothing prior (events
+    within < 200 ms shouldn't span > 5 frets unless audio amplitude
+    suggests a deliberate slide).
+- **Clean acoustic strummed < 0.90:**
+  - Chord-shape template prior: for each detected chord cluster,
+    boost candidate fingerings that match a curated set of 30-50
+    common guitar chord shapes (port from
+    `tabvision-server/app/chord_shapes.py`, 790 LOC).
+- **Clean electric < 0.87:**
+  - Likely co-resolves with one of the above. Investigate per-tier
+    error decomposition before adding tier-specific work.
+
+### Phase 9 — Final eval + documentation
+
+- Run `composite_eval.py` with full per-tier table.
+- Write `docs/EVAL_REPORTS/per_tier_acceptance_<date>.md`.
+- Update `docs/DECISIONS.md` with each Dn entry actually taken.
+- Final SPEC §1.4 amendment proposal: tier table replaces aggregate
+  target. Land as a SPEC PR.
+
+## 5. Sequencing
+
+```
+Phase 0 (parallel setup)  [week 1]
+    ↓
+Phase 1 (pitch ceiling cheap)  [week 1]
+    ↓
+Phase 2 (SynthTab + fine-tune)  [week 2]
+    ↓
+┌────────────────────────────────────────┐
+│ Phase 3 (style priors)          [w3]   │
+│ Phase 4 (UI fields audit)       [w3]   │  parallel
+│ Phase 5 (learned fusion v2)     [w3-4] │
+│ Phase 6 (video qualitative)     [w3]   │
+│ Phase 7 (solo melodic prior)    [w3]   │
+└────────────────────────────────────────┘
+    ↓
+Phase 8 (tier recovery)          [w5-6 as needed]
+    ↓
+Phase 9 (final eval + docs)      [w6]
+```
+
+Total wall-clock: **4-6 weeks engineering**, plus 1-2 weeks waiting
+time on the EGDB email if it gates Phase 8 distorted-electric work.
+
+## 6. Risk register
+
+| # | Risk | Likelihood | Mitigation |
+|---|---|---|---|
+| R1 | SynthTab pretrain doesn't transfer to real audio (domain gap) | medium | Literature shows pretrain+fine-tune works (SynthTab paper, arXiv:2402.15258). Smoke on Guitar-TECHS held-out before committing to full pretrain spend. |
+| R2 | EGDB license never resolves | low-medium | Author replies are usually fast; if blocked, synthetic IR-based distorted electric via Phase 8 fallback. |
+| R3 | SynthTab labels are noisy (DadaGP human-transcribed varies in quality) | medium | Use SynthTab as pretrain only, never as eval gate. Phase 0 spot-check a 50-clip random sample. |
+| R4 | Per-tier composite eval set has too few clips per tier for statistical significance | medium | Bootstrap 95% CIs in all per-tier reports. State the CI explicitly when reporting against the D2 target. |
+| R5 | Video pipeline degrades audio-only metrics when enabled | low | Quality gate in Phase 6 + audio-only fallback. Phase 6 acceptance explicitly checks this. |
+| R6 | Phase 5 learned-fusion reproduces the apr-29 single-fold catastrophe | medium | Hard regression guard per-fold + margin fallback to structured search. Phase 5 decision tree pivots to Phase 7 if it triggers. |
+| R7 | Free-tier compute monthly allowance insufficient for Phase 2 + 8 retries | low | Lightning 22 hr/mo + Kaggle 30 hr/wk + Colab is ~150 GPU-hr/mo combined; Phase 2 needs ~14 hr. Plenty of buffer. |
+| R8 | LICENSES.md needs updates for Guitar-TECHS, GOAT, SynthTab, EGDB | certain | Update in Phase 0. Each is CC-BY-4.0 (or pending in EGDB's case); attribution must appear in README and any blog. |
+
+## 7. Out of scope
+
+- Personal training clips (D10).
+- Single-aggregate Tab F1 ≥ 0.88 (D1).
+- Stretch v1.1 (bends/slides/hammer-ons) per D8.
+- Quantitative video-gate (D5). Video ships qualitative-only.
+- Top-K UX surface — UI work is separate. D2 targets apply to top-1.
+- New SPEC §8 contracts — none of these phases changes signatures.
+- Real-money compute except for production smoke retests on Modal.
+
+## 8. Phase 0 user actions (the things only you can do)
+
+1. Sign up / verify free-tier compute accounts:
+   - Lightning Studios (https://lightning.ai)
+   - Kaggle (https://kaggle.com)
+   - Colab (https://colab.research.google.com)
+   - Weights & Biases (https://wandb.ai, free academic tier)
+2. Email the EGDB author for portfolio-use written permission.
+   Template:
+
+   > Subject: TabVision portfolio project — request to use EGDB
+   >
+   > Dr. Chen,
+   >
+   > I'm a developer building TabVision, a portfolio guitar
+   > transcription project (public GitHub repo, blog post, recorded
+   > demo). I would like to use EGDB as the distorted-electric
+   > evaluation tier of my multi-source test set, and cite your
+   > ICASSP 2022 paper. The repo has no LICENSE file, so I'm asking
+   > for written permission to use EGDB in this portfolio context,
+   > including reporting evaluation metrics computed on it.
+   >
+   > Thank you,
+   > Patrick Gilhooley
+
+3. Confirm or push back on the D2 per-tier targets (table in §0).
+4. Approve the plan; I cut a branch from `refactor/v1` and start
+   Phase 0E (the baseline measurement, since 0A and 0B are blocked on
+   the above).
+
+## 9. Things still genuinely unresolved
+
+These can be answered in flight; don't gate the plan on them.
+
+- The exact size of the SynthTab pilot (500 clips is a guess; the
+  right number is "smallest subset that produces a fine-tune gain"
+  and emerges from Phase 2's first run).
+- Whether Phase 4 finds any actual capo/tuning regressions worth
+  fixing, or if it's a 30-minute box-tick.
+- Phase 6A: whether procedural fretboard rendering is 2 days or 2
+  weeks of work. Defer until we know whether Phase 6 alone is enough.
+
+## 10. Open invitation to redirect
+
+This plan favors free compute over fast iteration; SynthTab over DIY
+synthesis; per-tier targets over single-aggregate; audio-only gates
+over speculative video-gate construction. If any of those defaults are
+wrong for what you actually want, say so before Phase 0 starts —
+backtracking from Phase 3 is expensive.

From b7f139dbf461825feb2b1d95e5ca3888905e6685 Mon Sep 17 00:00:00 2001
From: Patrick Gilhooley <pgilhooley95@gmail.com>
Date: Wed, 13 May 2026 09:30:31 -0400
Subject: [PATCH 2/2] =?UTF-8?q?docs(plan):=20revise=20per=20#10=20review?=
 =?UTF-8?q?=20=E2=80=94=20license=20fixes=20+=20SPEC=20amend?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address PR #10 review (2026-05-13):

- Drop SynthTab (CC-BY-NC-4.0, taints derived weights per SPEC §1.5)
  and GOAT (request-only research-only) from the default-pipeline path.
- Frame the design doc as a SPEC §1.4 amendment proposal; commit the
  SPEC.md update (§1.4.1) in the same change set, keeping the original
  SPEC numbers as v1.1 / portfolio stretch reference.
- Split the doc: strategy / decision-record kept under the original
  filename; new Phase 0 implementation plan
  (2026-05-13-tab-f1-phase-0-implementation.md) with exact files,
  tests, commands, acceptance outputs.
- Add explicit License Gate section (§0) verifying every resource
  before any compute spend.
- Define composite eval policy: ≥ 20 clips / 500 notes per tier,
  player-split, 95% bootstrap CIs with lower-bound acceptance test,
  parser-per-source, no synthetic-source clips in eval splits.
- Update CLAUDE.md 'Active branch' to reflect main (Modal production
  deploy landed there; refactor/v1 is 23 commits behind).

Plan-only commit; no production code changes.
---
 CLAUDE.md                                     |  16 +-
 SPEC.md                                       |  35 +
 .../plans/2026-05-12-tab-f1-to-spec-design.md | 773 ++++++------------
 ...026-05-13-tab-f1-phase-0-implementation.md | 305 +++++++
 4 files changed, 618 insertions(+), 511 deletions(-)
 create mode 100644 docs/plans/2026-05-13-tab-f1-phase-0-implementation.md

diff --git a/CLAUDE.md b/CLAUDE.md
index 71537df..65dc78c 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -18,9 +18,19 @@ parallel under `refactor/v1`.
 - `LICENSES.md` — dependency license map; ⚠️ items gate respective phase entry.
 - `docs/DECISIONS.md` — non-obvious branches taken (per SPEC §0.5).
 
-**Active branch:** `refactor/v1` (cut off `feature/audio-finetune-phase1`,
-not `main` — see `docs/DECISIONS.md`). `main` is 33 commits behind v0.
-Phase 0 in progress; sign-off pending on AUDIT + LICENSES.
+**Active branch (2026-05-13):** `main`. The Modal production deploy
+(`936a5cc`) and v1 CI hardening landed on `main`; `refactor/v1` is now
+**23 commits behind `main`** and should be treated as historical. Cut new
+work branches off `main`. Older design docs (and earlier paragraphs in
+this file) may reference paths that exist on `main` but not on
+`refactor/v1` — verify with `git cat-file -e origin/main:<path>` before
+relying on them. The full pipeline (`tabvision/tabvision/pipeline.py`),
+the Modal production adapter (`tabvision-server/modal_app.py`,
+`tabvision-server/app/v1_adapter.py`), and the highres audio backend all
+live on `main`. Phase 5 fusion has shipped. See
+`docs/2026-05-12-session-handoff.md` for the production state and
+`docs/plans/2026-05-12-tab-f1-to-spec-design.md` (+ companion Phase 0
+implementation plan) for current accuracy work.
 
 ## Layout
 
diff --git a/SPEC.md b/SPEC.md
index 3fe8f5f..e666752 100644
--- a/SPEC.md
+++ b/SPEC.md
@@ -121,6 +121,41 @@ The targets above are aggregate over the full eval set. Per-difficulty-tier expe
 
 If the aggregate hits 0.88 but distorted electric scores below 0.75, treat that as a partial pass and prioritize Phase 7 distortion-augmented fine-tuning before final acceptance.
 
+### 1.4.1 v1 acceptance amendment — per-tier targets (2026-05-13)
+
+Per the 2026-05-13 design plan
+(`docs/plans/2026-05-12-tab-f1-to-spec-design.md`), v1 acceptance moves
+from the aggregate 0.88 Tab F1 in §1.4 to **per-tier targets on a
+public-corpus composite eval set**:
+
+| Tier | §1.4 stretch reference | v1 acceptance |
+|---|---:|---:|
+| Clean acoustic single-line | 0.94 | **0.85** |
+| Clean acoustic strummed | 0.86 | **0.90** |
+| Clean electric | 0.90 | **0.87** |
+| Distorted electric | 0.82 | **0.80** |
+
+Rationale: 2026-05-08 GuitarSet validation showed aggregate Tab F1 = 0.61
+with comp tracks at 0.67 and solo tracks at 0.51 despite both being near
+0.92 Pitch F1. The aggregate hid the structural failure mode (single-line
+string/fret assignment). Per-tier targets force the conversation onto the
+right axis and let work be sequenced (strummed first, distorted electric
+last).
+
+**Test-set composition amendment:** the "user's own playing" test set in
+§1.4 paragraph 1 is replaced by a public-corpus composite (GuitarSet
+held-out + Guitar-TECHS + EGDB pending license + qualifying synthetic
+training/dev material). See the design plan §5 for composite policy
+(per-tier minimums, splits, leakage rules, bootstrap CIs).
+
+**Stretch / portfolio reference:** the original §1.4 per-tier table
+(0.94 / 0.86 / 0.90 / 0.82) remains the v1.1 / portfolio stretch bar.
+Hitting it is welcome; v1 acceptance requires only the amended table.
+
+**Aggregate Tab F1** is retired as an acceptance metric. **Onset F1
+(≥ 0.92), Pitch F1 (≥ 0.90), chord-instance accuracy (≥ 0.85), and
+latency (≤ 5 min)** from §1.4 are unchanged.
+
 ### 1.5 Hard constraints
 
 - All training/inference dependencies must be free or have a free tier sufficient for this project (see §6).
diff --git a/docs/plans/2026-05-12-tab-f1-to-spec-design.md b/docs/plans/2026-05-12-tab-f1-to-spec-design.md
index 41c14a9..ff1569b 100644
--- a/docs/plans/2026-05-12-tab-f1-to-spec-design.md
+++ b/docs/plans/2026-05-12-tab-f1-to-spec-design.md
@@ -1,536 +1,293 @@
-# Tab F1 → Per-Tier Targets — Design
+# Tab F1 v1 acceptance — Strategy & Decision Record
 
-**Date:** 2026-05-12
+**Date:** 2026-05-12 (revised 2026-05-13 per PR #10 review)
 **Author:** Patrick (brainstormed with Claude)
-**Status:** Proposed — pending sign-off
-**Spec source:** `SPEC.md` §1.4 (per-tier table), §5 Phase 5, §8 contracts, §1.5 hard constraints, §6.3 free compute accounts.
-**Branch:** to be cut off `refactor/v1` once approved.
-**Depends on:** `docs/plans/2026-05-06-phase5-fusion-design.md`, `docs/plans/2026-05-06-video-pipeline-integration-design.md`, `docs/EVAL_REPORTS/guitarset_accuracy_boost_2026-05-08.md`.
-**Replaces:** earlier 2026-05-12 single-aggregate-target draft (never committed).
-
-## 0. Decisions taken on 2026-05-12
-
-These were locked in during the planning conversation; record them in
-`docs/DECISIONS.md` per SPEC §0.5 once the plan is approved.
+**Status:** v3 — strategy / decision-record only; **not** an implementation plan
+**Scope note:** This is a **SPEC §1.4 amendment proposal** plus
+              strategy. Implementation detail lives in companion docs.
+**Companions:**
+- `SPEC.md` §1.4.1 (the amendment table; committed in the same change set)
+- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` (Phase 0 impl)
+- Later phase impl plans (write after Phase 0 evidence)
+**Replaces:** v1 + v2 (2026-05-12 single-aggregate-target drafts; both
+              had load-bearing license errors and stale path references
+              and have been superseded by this rewrite).
+
+## 0. License gate (must clear before any compute spend)
+
+Per SPEC §1.5 the **shipping default pipeline** must be portfolio-clean.
+NC-licensed material is acceptable in research/experiment configurations
+that are NOT shipped. Each resource is verified 2026-05-13:
+
+| Resource | License | Portfolio-default usable? | Source / verification |
+|---|---|---|---|
+| GuitarSet | CC-BY-4.0 | **yes** | https://zenodo.org/records/3371780 |
+| Guitar-TECHS | CC-BY-4.0 | **yes** | arXiv:2501.03720 §4 distribution |
+| EGDB | none on repo — **author email pending** | **gated** | https://ss12f32v.github.io/Guitar-Transcription/ (LICENSES.md ⚠️) |
+| GOAT | request-only, research-only | **no — DROPPED** | arXiv:2509.22655 §4.2 *"made available by request to better control its use for research purposes only"* |
+| SynthTab dataset | **CC-BY-NC-4.0** | **no — DROPPED** | github.com/yongyizang/SynthTab README *"SynthTab is released with CC BY-NC 4.0 license"* |
+| SynthTab rendering code | CC-BY-4.0 | n/a (we're not redistributing the code) | repo `LICENSE` file |
+| DadaGP | access-by-email research-only; underlying GP tabs derive from copyrighted songs | **research/dev only** — NOT in default path | github.com/dada-bots/dadaGP README; underlying tab copyright unsettled |
+| Basic Pitch | Apache-2.0 | yes (Phase 1 pitch ensemble) | github.com/spotify/basic-pitch |
+| highres (xavriley) | MIT | yes — current production audio backend | github.com/xavriley/hf_midi_transcription |
+| MediaPipe Hands | Apache-2.0 | yes — video pipeline | per LICENSES.md |
+| YOLO-OBB (ultralytics) | AGPL-3.0 (accepted per DECISIONS.md) | yes (portfolio is AGPL-OK) | per LICENSES.md |
+| Free amp/cab IRs | varies (most free-public) | yes for default if redistribution terms allow; verify per-pack | Modern Music Solutions Declassified, Djammincabs |
+
+**Drops vs v2 plan:**
+- **SynthTab dropped** because the dataset is CC-BY-NC-4.0; pretraining
+  the shipping audio backend on it taints derived weights (SynthTab paper
+  treats trained models as derivative work). Distillation as a laundering
+  step is rejected — both legally murky and explicitly out of bounds
+  per the 2026-05-13 review.
+- **GOAT dropped** because it's request-only research-only. Cannot
+  evaluate a public portfolio against it.
+
+**Hard rule:** any phase that depends on a "gated" or "no" row must
+produce evidence that the gate cleared (e.g., a written reply from the
+EGDB author) BEFORE that phase ships. No conditional commits, no
+"we'll-figure-it-out-later" merges.
+
+## 1. Decisions
+
+These supersede the v2 D1–D10 set. Append to `docs/DECISIONS.md` per
+SPEC §0.5 once the plan is approved.
 
 | # | Decision | Rationale |
 |---|---|---|
-| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). Per-tier targets force the conversation onto the right axis. |
-| D2 | Per-tier numeric targets (table below). | Strummed raised from SPEC 0.86 → 0.90; distorted-electric floor lowered 0.82 → 0.80. Middle tiers relaxed to reflect the gap between current 0.61 and any realistic ceiling. |
-| D3 | Eval set is a **multi-source composite**: GuitarSet + Guitar-TECHS + GOAT + EGDB (pending license) + synthetic. Personal videos banned from any role. | GuitarSet alone gives one player, one genre cluster, no electric/distorted. Per-tier evaluation requires per-tier sources. |
-| D4 | **SynthTab** pretrain → real-data fine-tune is the audio-side plan. No DIY DadaGP synthesis unless SynthTab proves insufficient. | SynthTab (CC-BY-4.0, ~6,700 h with string/fret labels) pre-empts the engineering cost of building a renderer. Literature (SynthTab paper, High-Res Domain Adaptation arXiv:2402.15258) shows pretrain+fine-tune lifts cross-dataset generalization. |
-| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature. Production runs audio-only; video is opt-in. | No public dataset has synchronized guitar video + per-note string/fret labels. Confirmed via 2026-05-12 research pass (see §3.1). |
-| D6 | **Free-tier compute first.** Order: local CPU > Lightning Studios free (22 GPU-hr/mo) > Kaggle (30 hr/wk T4) > Colab > Modal. | Per CLAUDE.md operating rule 6 and SPEC §6.3 §1.5 hard constraint. The earlier $30-80 fine-tune estimate was Modal pricing; free tier fits a highres fine-tune comfortably. |
+| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). |
+| D2 | Per-tier v1 acceptance targets: **0.85 / 0.90 / 0.87 / 0.80** for clean acoustic single-line / strummed / clean electric / distorted electric. | User-stated floor (0.80) and strummed (≥ 0.90); middle tiers proposed and accepted. Original SPEC numbers (0.94 / 0.86 / 0.90 / 0.82) become the v1.1 / portfolio stretch reference. |
+| D3 | Eval set is a **multi-source public-corpus composite**: GuitarSet + Guitar-TECHS + EGDB (license-pending) + qualifying synthetic. Personal videos banned. GOAT dropped. SynthTab dropped from default path. | Per-tier evaluation requires per-tier sources; portfolio constraint excludes NC and request-only data from the shipping path. |
+| D4 | **No SynthTab in the default pipeline.** Audio-side lift comes from priors + cheap pitch post-processing + GuitarSet fine-tune. DadaGP-derived synthetic remains acceptable for **internal training/dev only** if it's never shipped. | SynthTab CC-BY-NC-4.0 taints derived weights; SPEC §1.5 bars NC from default. |
+| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature; per-tier Tab F1 measured audio-only. | No public dataset has video + per-note string/fret labels (verified 2026-05-12). |
+| D6 | **Free-tier compute first.** Order per CLAUDE.md operating rule 6 and SPEC §6.3: **Local CPU > Colab > Kaggle > Lightning Studios > Modal**. Modal is the last resort. | Project rule, plus Lightning's 22 GPU-hr/month free tier covers any fine-tune we'd plausibly run. |
 | D7 | **1-2 month cadence.** No fixed deadline. | User-stated. |
-| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1; SPEC §1.4 already marks them v1.1. | Confirmed in conversation. |
-| D9 | Top-K is acceptable as an editor UX feature but the **0.80 floor and per-tier targets apply to the top-1 prediction**. | User-stated. |
-| D10 | Personal training clips (the 20-video set) **off the table entirely** — not as accuracy gate, not as dev set, not as label source. | User-stated. |
-
-### Per-tier Tab F1 targets (D2)
-
-| Tier | SPEC §1.4 | This plan |
-|---|---:|---:|
-| Clean acoustic single-line | 0.94 | **0.85** |
-| Clean acoustic strummed | 0.86 | **0.90** |
-| Clean electric | 0.90 | **0.87** |
-| Distorted electric | 0.82 | **0.80** |
-
-All on the multi-source composite test set (D3). Top-1 prediction only.
-Onset F1 (≥ 0.92) and Pitch F1 (≥ 0.90) from SPEC §1.4 remain unchanged
-— audio already clears them on GuitarSet.
+| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1. | SPEC §1.4 already marks them v1.1. |
+| D9 | Top-K acceptable as an editor UX feature; the D2 numbers are on **top-1 only**. | User-stated. |
+| D10 | Personal training clips off the table entirely — not as accuracy gate, not as dev set, not as label source. | User-stated. |
+| D11 | This document is a **SPEC §1.4 amendment**, not a SPEC-achievement plan. Land the SPEC.md update (§1.4.1) in the same change set. | Honest framing of relaxed targets; reviewer's approval bar. |
 
-## 1. Goal
+## 2. Goal & framing
 
-Hit the D2 per-tier Tab F1 targets on the D3 composite eval set within
-1-2 months using free-tier compute, while keeping the production system
-the SPEC §8 contract-conformant v1 pipeline.
+**v1 acceptance:** hit the D2 per-tier Tab F1 targets on the D3
+public-corpus composite eval set within 1-2 months on free-tier
+compute, with the existing v1 pipeline (no §8 contract changes).
 
-## 2. Current evidence
+**Stretch / portfolio reference:** the original SPEC §1.4 numbers
+(0.94 / 0.86 / 0.90 / 0.82). If we hit them, that's the portfolio
+narrative; v1 acceptance does not require them.
 
-GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08 production
-candidate (highres + `guitarset-v1` prior, no video, no melodic prior):
+**Out of v1 acceptance:** quantitative video-fusion Tab F1
+improvement claim (no public dataset for it; tracked as qualitative
+only).
 
-| Metric | Current | SPEC | Status |
-|---|---:|---:|---|
-| Onset F1 (50 ms) | 0.9218 | ≥ 0.92 | pass |
-| Pitch F1 (50 ms) | 0.9022 | ≥ 0.90 | pass |
-| Tab F1 aggregate | 0.6104 | ≥ 0.88 (deprecated) | retired metric |
+## 3. Current evidence
 
-Per-track distribution (2026-05-12 diagnostic):
+GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08
+production candidate (highres + `guitarset-v1` prior, audio-only):
 
-- Tab F1 mean **0.589**, median 0.620, min 0.166, max 0.933
-- Comp tracks (n=30) mean **0.670**; solo tracks (n=30) mean **0.508**
-- Worst 10 tracks: 7 are solos. Best 5: 4 are comps.
-- Tab/Pitch ratio: comp 0.744, **solo 0.546** — solos lose 45% of
-  pitch-correct notes to wrong string/fret assignment.
+| Metric | Current | Status |
+|---|---:|---|
+| Onset F1 (50 ms) | 0.9218 | passes SPEC §1.4 ≥ 0.92 |
+| Pitch F1 (50 ms) | 0.9022 | passes SPEC §1.4 ≥ 0.90 |
+| Tab F1 aggregate (retired) | 0.6104 | — |
+| Tab F1, comp subset | 0.670 mean | — |
+| Tab F1, solo subset | 0.508 mean | — |
 
-**Bottleneck is string/fret assignment on single-line passages where
-chord-cluster context is absent.** Audio is essentially at spec; only the
-Tab F1 numbers are red, and only on the solo regime.
+The 27 pp gap to the **retired** 0.88 aggregate target is almost
+entirely string/fret assignment on single-line passages. Audio is at
+spec; only fusion-side assignment is short. This frames the per-tier
+work: **strummed (chord context) is closest to its target; single-line
+needs the most lift.**
 
-The single-tier mapping of GuitarSet is "clean acoustic strummed" for
-comp tracks and "clean acoustic single-line" for solo tracks. The
-electric and distorted-electric tiers (D2) have no current measurement
-and must be acquired (D3).
+**Coverage gap:** GuitarSet covers only the clean acoustic tiers.
+Clean-electric and distorted-electric have **no current measurement**
+on a public corpus and must be acquired in Phase 0.
 
-## 3. Resource inventory
+## 4. Resource inventory
 
-### 3.1 Datasets
+### 4.1 Datasets (default-pipeline path only)
 
-Verified by the 2026-05-12 research pass. Italics = on-disk now;
-**bold** = to acquire.
+| Source | License | Modality | Labels | Tier coverage |
+|---|---|---|---|---|
+| GuitarSet (on-disk) | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | clean acoustic single-line, strummed |
+| Guitar-TECHS (acquire) | CC-BY-4.0 | audio (multi-mic + DI) | 6-track per-string MIDI | clean acoustic single-line, clean electric |
+| EGDB (acquire, license pending) | none on repo — author email required | audio (DI + 5 amp sims) | GuitarPro tabs + aligned MIDI | clean electric, distorted electric |
+| Free IR-augmented GuitarSet | CC-BY-4.0 (with IR pack licenses verified) | derived audio | inherited string + fret | distorted electric (fallback if EGDB blocks) |
 
-| Source | License | Modality | Labels | Size | Tier coverage |
-|---|---|---|---|---|---|
-| *GuitarSet* | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | 3 h, 6 players | clean acoustic single-line, strummed |
-| **Guitar-TECHS** | CC-BY-4.0 | audio (multi-mic + DI) | 6-track MIDI per string | 5h12m | clean acoustic single-line, clean electric |
-| **GOAT** | CC-BY-4.0 | DI audio | tablature | 5.9 h | clean electric |
-| **EGDB** | None on repo — **email author for portfolio-use permission** | audio (DI + 5 amp sims, ~6 renders) | GuitarPro tabs + aligned MIDI (string + fret) | ~12 h synthesized | clean electric, distorted electric |
-| **SynthTab** | CC-BY-4.0 | synthesized audio | string + fret + onset | ~6,700 h | all four tiers (pretrain only) |
-| GAPS | CC-BY-NC-SA | YouTube video links + MIDI pitch | pitch-only, **not tab** | 14 h | reject — non-commercial taint |
-| DadaGP | research-access (email) | symbolic GP files | tab natively | 26,181 files | fallback synthesis source if SynthTab insufficient |
-| ~~The 20 personal clips~~ | n/a | n/a | n/a | n/a | **banned** (D10) |
+### 4.2 Datasets (research / dev only — NEVER in the default pipeline)
 
-**Confirmed gap:** no public dataset combines guitar video with per-note
-string+fret labels. This is the load-bearing finding behind D5.
+| Source | License | Use |
+|---|---|---|
+| DadaGP | access-by-email, research-only | possible internal-training augmentation; not shipped, not redistributed |
+| SynthTab | CC-BY-NC-4.0 | reference only; not a substrate for any shipped weight |
 
-### 3.2 Compute
+### 4.3 Compute accounts (free-tier first, per D6 order)
 
-| Account | Free allowance | Status | Use |
-|---|---|---|---|
-| Lightning Studios | 22 GPU-hr/month | Phase 0 setup | SynthTab pretrain, highres fine-tune |
-| Kaggle | ~30 GPU-hr/week T4 | Phase 0 setup | overflow for long sweeps |
-| Colab | ~12 hr/day with limits | Phase 0 setup | quick experiments |
-| W&B | unlimited (academic) | Phase 0 setup | experiment tracking |
-| HuggingFace Hub | unlimited public | already used | weights / checkpoints |
-| Modal | pay-per-use | already used | production smoke retests only |
-| Local CPU | 6 cores WSL2 | available | eval, priors, light tuning |
-
-Per CLAUDE.md operating rule 6: Local > Colab > Kaggle > Lightning >
-Modal. Modal is the resort, not the default.
-
-### 3.3 Code already in tree
-
-- `tabvision.audio.highres` — production pitch backend, 0.92 / 0.90 on GuitarSet.
-- `tabvision.fusion.position_prior` — `guitarset-v1` prior, +22pp Tab F1.
-- `tabvision.fusion.{viterbi,chord,playability,neck_prior,melodic_prior}` — Phase 5 shipped, cluster Viterbi + chord state enumeration + playability emission/transition costs.
-- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped (1603 LOC).
-- `tabvision.pipeline.run_pipeline` — composes all of the above; production runs through it via `tabvision-server/app/v1_adapter.py`.
-- `tabvision-server/tools/eval_basic_pitch_baseline.py` + `tabvision/scripts/eval/guitarset_audio_eval.py` — current evaluation harness; needs extension for multi-source composite.
-- `tabvision-server/tools/outputs/errors-2026-04-28_185743.md` — apr-28 error-decomposition methodology (proven on personal clips); port the same 7-bucket harness to the composite eval set.
-
-### 3.4 What has been tried (lessons)
-
-| Attempt | Date | Outcome | Lesson |
-|---|---|---|---|
-| Learned-fusion LightGBM ranker | 2026-04-29 | +0.3pp LOOCV vs +5pp gate; **catastrophic −27.8pp regression on training-17** | Small data + over-fit on one held-out group. **Critically: video features were `null` on every row** (`audio_only=True`) — so this wasn't actually a test of learned-fusion-with-video. Re-attempt with proper feature instrumentation is justified. |
-| Basic Pitch fine-tune on GuitarSet | 2026-04-29/30 | Did happen; superseded by highres backend swap before final integration | Fine-tune infrastructure is reusable for highres; SynthTab pretrain is the missing first step. |
-| Melodic prior | current | Regresses aggregate Tab F1 from 0.6104 to 0.5989 | Helps solo, hurts comp. Needs solo-density gating, not a flat enable. |
-| Position prior `guitarset-v1` | 2026-05-08 | +22pp Tab F1 vs no prior | Per-pitch tabular priors are the largest-leverage cheap intervention. Style/structure-conditional versions are the natural extension. |
-| Phase 5 cluster Viterbi + chord enumeration | 2026-05-06 | Shipped, drives current production | The audio-only structured search is already well-tuned. Further gain needs either better priors or different evidence (which video can't provide on the eval set). |
-
-## 4. Plan
-
-10 phases. Phases 0–2 sequential; 3–8 parallelizable. Decision tree
-inside each phase determines whether to continue, branch, or escalate.
-
-### Phase 0 — Foundation (parallel, no compute, 1 week wall-clock)
-
-**Goal:** assemble the evidence base + accounts the rest of the plan
-depends on. No production code changes; setup only.
-
-Concurrent tracks:
-
-- **0A. Acquisition.**
-  - [user] Email EGDB author (`f08946011@ntu.edu.tw`) for written
-    portfolio-use permission. Template draft in §10.
-  - [code] Download Guitar-TECHS and GOAT (both CC-BY-4.0, no email).
-  - [code] Sample SynthTab to a 500-clip pilot subset (~50 h). Full
-    download deferred until Phase 2.
-- **0B. Compute accounts.**
-  - [user] Lightning Studios, Kaggle, Colab, W&B sign-ups per SPEC §6.3.
-  - [code] Verify each by running a hello-world (W&B init + a GPU
-    `nvidia-smi` job on each platform).
-- **0C. Eval harness extension.**
-  - [code] Build `tabvision/scripts/eval/composite_eval.py`. Reads a
-    manifest TOML (per-clip tier label + source + audio path + tab
-    annotation path) and runs the same `guitarset_audio_eval.py` logic
-    across all sources. Outputs per-tier Tab F1, per-source CSVs, and a
-    consolidated Markdown report.
-  - Manifest schema follows the placeholder in
-    `tabvision/data/eval/manifest.toml`. Tier label is one of
-    `clean_acoustic_single_line`, `clean_acoustic_strummed`,
-    `clean_electric`, `distorted_electric`.
-- **0D. Error decomposition.**
-  - [code] Port `tools/error_analysis.py` (apr-28 7-bucket harness)
-    from personal-clip input to the composite eval set. Output:
-    `docs/EVAL_REPORTS/error_decomposition_<date>.md` with per-tier
-    bucket counts.
-- **0E. Baseline measurement.**
-  - [code] Run `composite_eval.py` against the current production
-    pipeline. Get the per-tier numbers. These are the Phase 1+
-    starting points.
-
-**Phase 0 acceptance gate:**
-- Per-tier Tab F1 baseline numbers exist for at least 3 of the 4 tiers
-  (distorted electric is EGDB-dependent; deferred OK).
-- Per-tier 7-bucket error decomposition exists.
-- All free-tier compute accounts verified.
-- EGDB email sent.
-- No production code changes.
-
-**Decision tree:**
-- If baseline already hits some tier (e.g., strummed at 0.92) → drop
-  that tier from later phases' work.
-- If pitch-side metrics regress vs the 2026-05-08 GuitarSet numbers on
-  the composite set → STOP and investigate before any further work.
-  The composite eval should not change audio-side numbers on GuitarSet.
-
-### Phase 1 — Pitch ceiling lift, cheap moves (local CPU, 2-3 days)
-
-**Goal:** Pitch F1 from 0.915 → ≥ 0.93 on GuitarSet validation, without
-training. Gives Tab F1 mathematical headroom regardless of fusion-side
-work.
-
-Moves, in order:
-
-1. **Voicing/silence gate** on highres pitch posteriors. Tune the
-   joint onset+pitch confidence threshold. Trade some recall for
-   precision; expect net F1 gain.
-2. **Onset peak-picking adjustment.** The 50 ms tolerance is generous;
-   misaligned within-tolerance peaks still produce pitch mis-reads.
-   Improve peak localization → tighter onset match → higher pitch TP
-   count.
-3. **Basic Pitch pitch-only ensemble.** Run Basic Pitch alongside
-   highres. Use Basic Pitch's pitch output (not onset) as a tiebreaker
-   on pitch-disagreement events; downweight (or drop) events where the
-   two backends disagree on pitch. SPEC §6.1 path; LICENSES.md
-   confirms Basic Pitch is Apache-2.0 default-pipeline-safe.
-
-**Phase 1 acceptance:**
-- Pitch F1 ≥ 0.93 on GuitarSet validation.
-- Onset F1 ≥ 0.92 (no regression).
-- Aggregate Tab F1 ≥ 0.62 (no regression beyond mathematical
-  pitch-improvement bound).
-
-**Decision tree:**
-- 0.93 met → continue.
-- 0.92–0.93 → continue; Phase 2 fine-tune still useful as ceiling lift.
-- < 0.92 → diagnose. Could be a threshold-sweep artifact rather than a
-  real regression. Inspect on 3-5 representative tracks before
-  escalating.
-
-### Phase 2 — SynthTab pretrain + highres fine-tune (Lightning, 1 week)
-
-**Goal:** Pitch F1 ≥ 0.94 on GuitarSet validation. Lift the audio
-ceiling beyond what threshold-tuning alone can do.
-
-**Approach.** Per the SynthTab paper (ICASSP 2024) and arXiv:2402.15258,
-the proven recipe is: pretrain on synthetic, fine-tune on real.
-
-- **Pretrain corpus:** SynthTab 500-clip pilot (Phase 0). Full set
-  (~6,700 h) is overkill at this stage and won't fit in the free tier
-  monthly budget.
-- **Pretrain head:** the highres model's pitch+onset head. Backbone
-  frozen for the pretrain phase to avoid catastrophic forgetting on
-  the spectral feature extractor.
-- **Fine-tune:** GuitarSet train split (4 players, 240 tracks ≈ 2 h),
-  unfrozen, 5-10 epochs with early stopping on Pitch F1.
-- **Compute:** Lightning Studios free tier (22 GPU-hr/month). Estimate:
-  pretrain ~6 GPU-hr, fine-tune ~3 GPU-hr. Buffer for re-runs ~5 GPU-hr.
-  Stays inside the monthly allowance.
-
-**Phase 2 acceptance:**
-- GuitarSet validation Pitch F1 ≥ 0.94.
-- No Onset F1 regression > 1 pp.
-- Cross-dataset sanity: on Guitar-TECHS (held out from training),
-  Pitch F1 ≥ 0.90 (no catastrophic transfer loss).
-
-**Decision tree:**
-- Met all three → continue. New `audio_backend = "highres-synthtab"`
-  becomes the candidate for production replacement.
-- GuitarSet met, Guitar-TECHS regresses > 5 pp → over-fit on the
-  pretrain distribution. Reduce pretrain epochs, increase fine-tune
-  weight, retry once.
-- GuitarSet ≤ 0.93 → SynthTab pretrain didn't transfer; abandon
-  Phase 2 and revisit with the actual diagnostic (Pitch P/R curves)
-  before any further training spend.
-
-### Phase 3 — Style/structure-conditional priors (local CPU, 3 days)
-
-**Goal:** lift Tab F1 on solos via finer-grained per-pitch position
-priors. Expected +1 to +5 pp on solo subsets.
-
-- **Buckets:** {bn, jazz, funk, rock, ss} × {solo, comp} = 10 priors.
-  GuitarSet's `style` field gives the genre axis directly; structure
-  axis derived from cluster-singleton density (already computable in
-  fusion).
-- **Train:** GuitarSet train split (players 00, 01, 02, 03, 04).
-  Per-bucket Laplace-smoothed counts. Empty cells fall back to
-  `guitarset-v1`.
-- **Validate:** leave-one-player-out CV (not LOOCV per-clip — too
-  small). Primary metric: per-bucket Tab F1 delta vs `guitarset-v1`
-  baseline on the held-out player.
-- **Risk:** the apr-29 learned-fusion attempt failed with one
-  catastrophic regression. Same class of risk here — small data,
-  bucketing on 4 training players. **Hard regression guard:** abort
-  the bucket if any cross-validation fold regresses by > 3 pp.
-
-**Phase 3 acceptance:**
-- Mean Tab F1 over solo buckets: +2 pp vs `guitarset-v1` baseline.
-- No bucket regresses by > 1 pp on comp.
-- No cross-validation fold regresses by > 3 pp on any bucket.
-
-**Decision tree:**
-- Met → ship the prior set, expose `position_prior = "guitarset-styled-v1"`.
-- Solo gain < 2 pp → drop the structure axis, ship style-only.
-- Any bucket fails the regression guard → drop that bucket only;
-  fall back to `guitarset-v1` for it. Don't kill the whole experiment
-  on one bad bucket.
-
-### Phase 4 — Style+structure-aware capo/tuning audit (local, 1 day)
-
-**Goal:** verify the capo / instrument / tone / style fields from the
-upload UI are actually flowing into prior selection and playability
-weights as designed.
-
-- **Trace:** unit-test that with `capo_fret = 5` the position prior
-  shifts correctly (frets 0-19 become frets 5-24).
-- **Smoke:** run a known capo-3 clip from GuitarSet (if any exist)
-  and confirm the output tab is rendered against the capo.
-- **Audit playability:** confirm `instrument = electric` doesn't apply
-  the open-string bonus differently when it shouldn't, etc.
-
-Small phase; mostly a correctness-check before later phases compound
-any bugs here.
-
-**Phase 4 acceptance:**
-- All upload-form fields measurably affect at least one pipeline
-  decision per a unit test.
-- No silent fallback to defaults on any field.
-
-### Phase 5 — Learned fusion v2 (local, 3-5 days)
-
-**Goal:** the 2026-04-24 plan's learned-fusion approach, redone with
-proper feature instrumentation. **Per-pitch + chord-context ranker**,
-not the audio-only ranker that flat-lined at +0.3 pp in 2026-04-29.
-
-**Why this can work this time:** the apr-29 attempt's per-candidate
-features were limited (no fusion-prior values, no neck-anchor values
-because video was off, no chord-cluster context). With Phase 5
-shipping the structured search already, those values are now exposed
-and can be features.
-
-**Per-candidate features:**
-- `pitch`, `confidence`, `duration`, `amplitude` (audio).
-- `position_prior_log_prob`, `melodic_prior_log_prob`,
-  `neck_prior_log_prob` (fusion priors at this candidate).
-- `cluster_size`, `cluster_span`, `is_singleton`, `singleton_density_2s`
-  (chord context).
-- `emission_cost`, `transition_cost_to_prev` (playability).
-- `cand_string`, `cand_fret`, `is_open`, `is_low_position` (identity).
-- `style`, `instrument`, `tone` (from session config; flow-from-UI
-  audited in Phase 4).
-
-**Training:** GuitarSet train split, leave-one-player-out CV. LightGBM
-`lambdarank` with hard regression guard at -3 pp per held-out player.
-
-**Phase 5 acceptance:**
-- Mean Tab F1 across all held-out players: +3 pp vs Phase 3-or-earlier
-  baseline.
-- No held-out player regresses by > 3 pp.
-- Margin-based fallback to structured-search pick when learned-fusion
-  margin is below a threshold (mitigates OOD behavior in production).
-
-**Decision tree:**
-- Met → ship behind a flag, default off, with the margin fallback.
-  Default-on requires a separate review pass with at least one week of
-  production smoke clean.
-- Per-player regression > 3 pp on any fold → the apr-29 failure mode
-  repeats. Stop Phase 5 and pivot to Phase 7 instead.
-
-### Phase 6 — Video pipeline qualitative integration (1-2 days)
-
-Goal: re-enable the video stack in production for users whose uploads
-have usable video, without claiming any quantitative Tab F1 improvement.
-**No video accuracy gate** (D5).
-
-- Flip `TABVISION_VIDEO_ENABLED=true` in `tabvision-server/modal_app.py`
-  in dev.
-- Verify pipeline runs end-to-end on at least one synthetic
-  fretboard-rendered clip (Phase 6A) and the qualitative output is
-  reasonable.
-- Add a runtime quality gate (the one the v1_adapter currently fakes):
-  reject video evidence when `handDetectionRate < 0.3` or
-  `fretboardDetectionConfidence < 0.5`. Diagnostics in result JSON.
-- Production smoke: end-to-end on the existing `test_a440.mp4` (audio
-  ceiling) and one real-world iPhone clip (qualitative inspection only,
-  not gated).
-
-**Phase 6A — Synthetic fretboard video** (optional, 2-3 days):
-- Render a procedurally-generated fretboard animation (Blender or
-  pyrender) against SynthTab audio. Synchronized by-construction.
-- Use for video-pipeline smoke + regression tests (does turning video
-  on/off change anything?), NOT for accuracy claims.
-
-**Phase 6 acceptance:**
-- Video enable in dev does not regress GuitarSet audio-only Pitch /
-  Onset / Tab F1 metrics (delta within ±0.5 pp).
-- At least one synthetic clip produces a non-empty `fingerings` list
-  in the result.
-- Production smoke clean.
-
-**Decision tree:**
-- Audio-only metrics regress when video enabled → video is making
-  things worse on no-video-content clips. Add a fail-fast that
-  disables video output when `videoObservationCount == 0`, retry.
-- No regression but no positive signal either → ship video as opt-in,
-  default off. Revisit when a public video+tab dataset emerges.
-- Positive signal on some clips → ship default-on with the quality
-  gate.
-
-### Phase 7 — Solo-gated melodic prior (local, 2 days)
-
-**Goal:** re-enable the existing melodic prior in the regime where it
-helps (solo passages) without re-introducing the comp regression that
-caused the current ship-disable.
-
-- Gate the melodic prior on rolling-window singleton density: apply
-  only when ≥ 80% of clusters in the last 2 seconds are singletons.
-- Re-tune the 35/65 prior-blend ratio currently hard-coded in
-  `tabvision/tabvision/fusion/melodic_prior.py:64`.
-
-**Phase 7 acceptance:**
-- Solo subset Tab F1 +3 pp vs Phase 3 baseline.
-- Comp subset Tab F1 within ±1 pp.
-- No per-track regression > 3 pp.
-
-### Phase 8 — Tier shortfall recovery (as needed, 1-2 weeks)
-
-Triggered only if a tier still misses its D2 target after Phases 1-7.
-
-- **Distorted electric < 0.80:**
-  - If EGDB acquired: oversample EGDB distorted variants in Phase 2
-    fine-tune; re-run.
-  - If EGDB blocked: synthesize a distorted training subset via
-    SynthTab clean audio + free IR pack convolution (Modern Music
-    Solutions Declassified, Djammincabs).
-- **Clean acoustic single-line < 0.85:**
-  - Re-tune Phase 7 melodic-prior strength on the single-line subset.
-  - If still short: add a position-shift smoothing prior (events
-    within < 200 ms shouldn't span > 5 frets unless audio amplitude
-    suggests a deliberate slide).
-- **Clean acoustic strummed < 0.90:**
-  - Chord-shape template prior: for each detected chord cluster,
-    boost candidate fingerings that match a curated set of 30-50
-    common guitar chord shapes (port from
-    `tabvision-server/app/chord_shapes.py`, 790 LOC).
-- **Clean electric < 0.87:**
-  - Likely co-resolves with one of the above. Investigate per-tier
-    error decomposition before adding tier-specific work.
-
-### Phase 9 — Final eval + documentation
-
-- Run `composite_eval.py` with full per-tier table.
-- Write `docs/EVAL_REPORTS/per_tier_acceptance_<date>.md`.
-- Update `docs/DECISIONS.md` with each Dn entry actually taken.
-- Final SPEC §1.4 amendment proposal: tier table replaces aggregate
-  target. Land as a SPEC PR.
-
-## 5. Sequencing
-
-```
-Phase 0 (parallel setup)  [week 1]
-    ↓
-Phase 1 (pitch ceiling cheap)  [week 1]
-    ↓
-Phase 2 (SynthTab + fine-tune)  [week 2]
-    ↓
-┌────────────────────────────────────────┐
-│ Phase 3 (style priors)          [w3]   │
-│ Phase 4 (UI fields audit)       [w3]   │  parallel
-│ Phase 5 (learned fusion v2)     [w3-4] │
-│ Phase 6 (video qualitative)     [w3]   │
-│ Phase 7 (solo melodic prior)    [w3]   │
-└────────────────────────────────────────┘
-    ↓
-Phase 8 (tier recovery)          [w5-6 as needed]
-    ↓
-Phase 9 (final eval + docs)      [w6]
-```
-
-Total wall-clock: **4-6 weeks engineering**, plus 1-2 weeks waiting
-time on the EGDB email if it gates Phase 8 distorted-electric work.
-
-## 6. Risk register
+| Account | Free allowance | Use |
+|---|---|---|
+| Local CPU | 6 cores WSL2 | eval runs, prior training, cheap post-processing experiments |
+| Colab | ~12 hr/day with limits | quick experiments, prior sweeps |
+| Kaggle | ~30 GPU-hr/week T4 | longer sweeps, baseline checks |
+| Lightning Studios | 22 GPU-hr/month | any fine-tune work, batched in one monthly window |
+| W&B | unlimited (academic) | experiment tracking — required before any GPU job |
+| Hugging Face Hub | unlimited public | weight / checkpoint hosting |
+| Modal | pay-per-use | **production smoke retests only**; never default training |
+
+### 4.4 Code already on `main`
+
+- `tabvision.audio.*` — production pitch backends (highres, basicpitch).
+- `tabvision.fusion.{viterbi,chord,playability,position_prior,neck_prior,melodic_prior}` — Phase 5 shipped.
+- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped.
+- `tabvision.pipeline.run_pipeline` — production-facing orchestrator.
+- `tabvision.eval.{manifest,metrics,runner,guitarset_audio}` — eval scaffolding with `REQUIRED_TIERS = ("clean_acoustic_single_line", "clean_acoustic_strummed", "clean_electric", "distorted_electric")` already encoded ([tabvision/tabvision/eval/manifest.py](tabvision/tabvision/eval/manifest.py)).
+- `tabvision-server/{modal_app.py, app/v1_adapter.py}` — Modal production adapter.
+
+### 4.5 What's been tried (lessons carried forward)
+
+| Attempt | Outcome | Lesson |
+|---|---|---|
+| Learned-fusion LightGBM ranker (2026-04-29) | +0.3 pp LOOCV vs +5 pp gate; **-27.8 pp** regression on training-17 | Catastrophic single-fold regression with small data. **Re-try only with strict per-fold regression guard AND with video features actually populated**, which the apr-29 run lacked. |
+| Basic Pitch fine-tune (2026-04-30) | Superseded by highres backend swap | Fine-tune infra reusable; ceiling lift now lives in highres post-processing and possibly a GuitarSet-only highres fine-tune. |
+| Melodic prior | Regresses aggregate by 1.15 pp | Helps solo, hurts comp. Needs solo-density gating. |
+| Position prior `guitarset-v1` | +22 pp Tab F1 | Per-pitch tabular priors are the largest cheap intervention. Style/structure-conditional priors are the natural extension. |
+
+## 5. Composite eval policy
+
+Each tier in the composite eval set must satisfy these rules. The
+manifest schema (`tabvision/tabvision/eval/manifest.py`) already
+encodes tier names and required clip fields; the Phase 0 impl plan
+extends it for source-specific annotation paths and CI reporting.
+
+**Per-tier minimums:**
+- Each of the four required tiers: **≥ 20 clips** and **≥ 500 gold
+  notes**. Below this the bootstrap CI is too wide to claim acceptance.
+- Total composite: ≥ 80 clips, ≥ 2,000 notes.
+
+**Split policy:**
+- GuitarSet: held-out **by player** (player 05 = validation; this is
+  the existing convention from `guitarset_audio_eval.py`). No
+  train/test leak at player level.
+- Guitar-TECHS: split by **performer** if performer metadata is
+  available; else by clip with a deterministic seed.
+- EGDB: split by **source track** (the 240 clean DIs); amp-sim
+  renders of the same track go to the same split. Required to avoid
+  amp-render leakage.
+
+**Source weighting:**
+- Per-tier metrics are reported **un-weighted across sources within a
+  tier** (every clip has equal weight). The strategic question "is
+  GuitarSet over-represented in clean acoustic" gets a separate
+  per-source breakdown in the report; the headline number is the
+  un-weighted clip mean.
+
+**Leakage rules:**
+- No clip used for prior training (`guitarset-v1` etc.) appears in
+  evaluation. Currently `guitarset-v1` is trained on GuitarSet train
+  split, evaluated on player 05 — compliant.
+- Fine-tune sets must be disjoint from eval sets by player / performer.
+- DadaGP-derived synthetic, if used, is training-only and never
+  appears in the eval manifest.
+
+**Confidence intervals:**
+- Every per-tier number reported with a **95% bootstrap CI** over
+  clips (resample clips with replacement, recompute the tier-mean,
+  10 000 resamples). The acceptance test is `lower_95_CI ≥ target`,
+  not just `mean ≥ target` — this disciplines small-sample wishful
+  thinking.
+
+**Parsers:**
+- One parser per source, named by the annotation format (not the
+  source). Phase 0 ships: `guitarset_jams`, `guitar_techs_midi`,
+  `egdb_gp`. Each parser converts source-native annotations into the
+  §8 `TabEvent` dataclass list. Round-trip parser tests required.
+
+## 6. Phase outline (high-level only)
+
+Each phase has a goal + acceptance bar here. **Per-phase implementation
+plans** (exact files / tests / commands / acceptance outputs) are
+written **separately**, one phase at a time, only after the prior
+phase's evidence justifies starting it.
+
+- **Phase 0 — Foundation.** Per-tier baselines + error decomposition on
+  the composite eval. Acquire Guitar-TECHS; send EGDB email; verify free
+  compute accounts. **No production code changes.** Acceptance: per-tier
+  baseline numbers exist for ≥ 3 of 4 tiers with bootstrap CIs;
+  per-tier 7-bucket error breakdown exists. [Companion:
+  `2026-05-13-tab-f1-phase-0-implementation.md`.]
+- **Phase 1 — Pitch ceiling lift (cheap moves).** Voicing/silence gate
+  + peak-picking + Basic Pitch pitch-only ensemble. Acceptance: Pitch
+  F1 ≥ 0.93 on GuitarSet validation, no Onset F1 regression > 1 pp.
+- **Phase 2 — Highres fine-tune on GuitarSet only.** Lightning
+  free-tier; ~3 GPU-hr. **No SynthTab pretrain.** Acceptance: Pitch F1
+  ≥ 0.94, no Onset regression > 1 pp; cross-dataset sanity ≥ 0.90 on
+  Guitar-TECHS held-out.
+- **Phase 3 — Style/structure-conditional priors.** Leave-one-player-out
+  CV with hard regression guard. Acceptance: solo Tab F1 +2 pp vs
+  `guitarset-v1`, no per-bucket regression > 1 pp on comp, no fold
+  regression > 3 pp.
+- **Phase 4 — UI-field audit (capo/tuning/instrument/tone/style).**
+  Unit tests confirm each field propagates into a pipeline decision.
+- **Phase 5 — Learned fusion v2.** Re-attempt with proper features
+  (chord-context, prior-values, playability-cost, video-when-on).
+  Acceptance: +3 pp mean Tab F1, no per-fold regression > 3 pp,
+  margin-fallback to structured search baked in.
+- **Phase 6 — Video pipeline qualitative integration.** Enable
+  `TABVISION_VIDEO_ENABLED=true` in dev with a runtime quality gate.
+  Acceptance: video on/off does not regress audio-only metrics by > 0.5 pp.
+- **Phase 7 — Solo-gated melodic prior.** Acceptance: solo +3 pp,
+  comp ±1 pp.
+- **Phase 8 — Tier shortfall recovery.** Only if a tier still misses
+  its D2 target. Per-tier tactics (chord-shape templates for strummed,
+  IR-augmentation for distorted, etc.).
+- **Phase 9 — Final eval + DECISIONS.md update + SPEC.md PR.**
+
+Sequencing: 0 → 1 → 2 in series; 3–7 parallelizable after 2; 8 only
+on shortfall; 9 closes. Total wall-clock estimate: **4-6 weeks
+engineering** + ~1 week EGDB-email turnaround.
+
+## 7. Risks
 
 | # | Risk | Likelihood | Mitigation |
 |---|---|---|---|
-| R1 | SynthTab pretrain doesn't transfer to real audio (domain gap) | medium | Literature shows pretrain+fine-tune works (SynthTab paper, arXiv:2402.15258). Smoke on Guitar-TECHS held-out before committing to full pretrain spend. |
-| R2 | EGDB license never resolves | low-medium | Author replies are usually fast; if blocked, synthetic IR-based distorted electric via Phase 8 fallback. |
-| R3 | SynthTab labels are noisy (DadaGP human-transcribed varies in quality) | medium | Use SynthTab as pretrain only, never as eval gate. Phase 0 spot-check a 50-clip random sample. |
-| R4 | Per-tier composite eval set has too few clips per tier for statistical significance | medium | Bootstrap 95% CIs in all per-tier reports. State the CI explicitly when reporting against the D2 target. |
-| R5 | Video pipeline degrades audio-only metrics when enabled | low | Quality gate in Phase 6 + audio-only fallback. Phase 6 acceptance explicitly checks this. |
-| R6 | Phase 5 learned-fusion reproduces the apr-29 single-fold catastrophe | medium | Hard regression guard per-fold + margin fallback to structured search. Phase 5 decision tree pivots to Phase 7 if it triggers. |
-| R7 | Free-tier compute monthly allowance insufficient for Phase 2 + 8 retries | low | Lightning 22 hr/mo + Kaggle 30 hr/wk + Colab is ~150 GPU-hr/mo combined; Phase 2 needs ~14 hr. Plenty of buffer. |
-| R8 | LICENSES.md needs updates for Guitar-TECHS, GOAT, SynthTab, EGDB | certain | Update in Phase 0. Each is CC-BY-4.0 (or pending in EGDB's case); attribution must appear in README and any blog. |
+| R1 | EGDB license never resolves | medium | Phase 8 fallback: free-IR-augmented GuitarSet for distorted-electric tier; explicitly flagged as synthesized in reports. |
+| R2 | Guitar-TECHS clips don't span all promised tiers (some clean-electric tracks may be missing) | low-medium | Phase 0 acceptance only requires ≥ 3 of 4 tiers; distorted-electric can wait on EGDB. |
+| R3 | GuitarSet-only fine-tune (Phase 2) over-fits player 05's adjacent training distribution | medium | Cross-dataset sanity on Guitar-TECHS held-out; abort if Guitar-TECHS regresses > 5 pp. |
+| R4 | Per-tier composite has too few clips for statistical significance | medium | D2 acceptance requires `lower_95_CI ≥ target`, not mean. Per-tier minimum 20 clips / 500 notes (§5). |
+| R5 | Phase 5 learned fusion reproduces apr-29 single-fold catastrophe | medium | Strict per-fold regression guard + margin fallback. Decision tree pivots to Phase 7 if it triggers. |
+| R6 | LICENSES.md updates required for Guitar-TECHS / EGDB / IR packs | certain | Update in Phase 0 alongside acquisition. |
+| R7 | Free-tier monthly compute allowance exhausted before Phase 2 + 5 retries | low | Phase 2 ≈ 3 GPU-hr; Phase 5 is CPU. Combined < 10 hr/month, well inside Lightning's 22 hr cap. |
+| R8 | Synthetic data (DadaGP) inadvertently ends up in shipped weights via training/eval pipeline cross-contamination | low | Synthetic clips never appear in `tabvision/data/eval/manifest.toml`; an explicit assert in Phase 0 manifest validator rejects any synthetic-source clip in the default eval set. |
 
-## 7. Out of scope
+## 8. Out of scope
 
 - Personal training clips (D10).
-- Single-aggregate Tab F1 ≥ 0.88 (D1).
-- Stretch v1.1 (bends/slides/hammer-ons) per D8.
-- Quantitative video-gate (D5). Video ships qualitative-only.
-- Top-K UX surface — UI work is separate. D2 targets apply to top-1.
-- New SPEC §8 contracts — none of these phases changes signatures.
-- Real-money compute except for production smoke retests on Modal.
-
-## 8. Phase 0 user actions (the things only you can do)
-
-1. Sign up / verify free-tier compute accounts:
-   - Lightning Studios (https://lightning.ai)
-   - Kaggle (https://kaggle.com)
-   - Colab (https://colab.research.google.com)
-   - Weights & Biases (https://wandb.ai, free academic tier)
-2. Email the EGDB author for portfolio-use written permission.
-   Template:
-
-   > Subject: TabVision portfolio project — request to use EGDB
-   >
-   > Dr. Chen,
-   >
-   > I'm a developer building TabVision, a portfolio guitar
-   > transcription project (public GitHub repo, blog post, recorded
-   > demo). I would like to use EGDB as the distorted-electric
-   > evaluation tier of my multi-source test set, and cite your
-   > ICASSP 2022 paper. The repo has no LICENSE file, so I'm asking
-   > for written permission to use EGDB in this portfolio context,
-   > including reporting evaluation metrics computed on it.
-   >
-   > Thank you,
-   > Patrick Gilhooley
-
-3. Confirm or push back on the D2 per-tier targets (table in §0).
-4. Approve the plan; I cut a branch from `refactor/v1` and start
-   Phase 0E (the baseline measurement, since 0A and 0B are blocked on
-   the above).
-
-## 9. Things still genuinely unresolved
-
-These can be answered in flight; don't gate the plan on them.
-
-- The exact size of the SynthTab pilot (500 clips is a guess; the
-  right number is "smallest subset that produces a fine-tune gain"
-  and emerges from Phase 2's first run).
-- Whether Phase 4 finds any actual capo/tuning regressions worth
-  fixing, or if it's a 30-minute box-tick.
-- Phase 6A: whether procedural fretboard rendering is 2 days or 2
-  weeks of work. Defer until we know whether Phase 6 alone is enough.
-
-## 10. Open invitation to redirect
-
-This plan favors free compute over fast iteration; SynthTab over DIY
-synthesis; per-tier targets over single-aggregate; audio-only gates
-over speculative video-gate construction. If any of those defaults are
-wrong for what you actually want, say so before Phase 0 starts —
-backtracking from Phase 3 is expensive.
+- SynthTab in any shipped configuration (D4).
+- GOAT (license).
+- Aggregate Tab F1 ≥ 0.88 as an acceptance gate (D1).
+- Stretch v1.1 (bends / slides / hammer-ons) per D8.
+- Quantitative video-gate (D5).
+- Top-K UI optimization — UI work is separate; D2 applies to top-1.
+- §8 contract changes — no SPEC §8 signature edits in this plan.
+- Modal as a default training surface (D6).
+
+## 9. Open questions (do not gate the plan)
+
+- EGDB author reply timing — assumed ~1 week.
+- Whether Guitar-TECHS subdivides cleanly into "clean acoustic" vs
+  "clean electric" subsets at clip-level metadata, or whether we'll
+  need to inspect waveforms.
+- Whether free IR pack licenses (Modern Music Solutions, Djammincabs)
+  permit redistribution of derived audio in evaluation reports.
+  Phase 8 fallback only.
+
+## 10. Companion docs in this PR
+
+- `SPEC.md` — §1.4.1 amendment block (per-tier targets + composite test set).
+- `CLAUDE.md` — active-branch update (`main`, not `refactor/v1`).
+- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` — Phase 0
+  implementation: exact files, tests, commands, acceptance outputs.
+
+Later phase implementation plans (`docs/plans/2026-05-NN-tab-f1-phase-N-implementation.md`)
+will be written one phase at a time, only after the prior phase's
+evidence is in.
diff --git a/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md
new file mode 100644
index 0000000..0a9cd5f
--- /dev/null
+++ b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md
@@ -0,0 +1,305 @@
+# Tab F1 — Phase 0 Implementation Plan
+
+**Date:** 2026-05-13
+**Author:** Patrick (brainstormed with Claude)
+**Status:** Proposed — pending sign-off
+**Strategy doc:** `docs/plans/2026-05-12-tab-f1-to-spec-design.md`
+**Implementation branch:** to be cut as `impl/tab-f1-phase-0` off `main`
+              after the strategy / SPEC amendment lands.
+
+## 0. Phase 0 goal recap
+
+Establish the per-tier baseline and error decomposition needed to
+sequence Phases 1+. **No production code changes; no shipped behavior
+changes; no compute spend on training.**
+
+Acceptance, copied from the strategy doc §6:
+
+- Per-tier baseline numbers for ≥ 3 of 4 D2 tiers with **bootstrap
+  95% CIs**, on the composite eval set.
+- Per-tier 7-bucket error decomposition on the same set.
+- Free-tier compute accounts (Local / Colab / Kaggle / Lightning / W&B)
+  verified.
+- EGDB author email sent; reply tracked in `docs/DECISIONS.md`.
+
+## 1. Files to add / modify
+
+### 1.1 New files
+
+| Path | Purpose |
+|---|---|
+| `tabvision/tabvision/eval/parsers/__init__.py` | Parser registry |
+| `tabvision/tabvision/eval/parsers/guitarset_jams.py` | JAMS → `list[TabEvent]` |
+| `tabvision/tabvision/eval/parsers/guitar_techs_midi.py` | 6-track MIDI → `list[TabEvent]` |
+| `tabvision/tabvision/eval/parsers/egdb_gp.py` | GuitarPro tab + MIDI → `list[TabEvent]` (skipped at import-time if PyGuitarPro not installed; runs only when EGDB license clears) |
+| `tabvision/tabvision/eval/composite.py` | `run_composite_eval(manifest_path) -> CompositeReport` — dispatches to per-source parsers and aggregates per-tier |
+| `tabvision/tabvision/eval/bootstrap.py` | Bootstrap CI helper: `bootstrap_ci(values, statistic=mean, n=10_000, seed=int) -> tuple[float, float, float]` returning `(mean, lower_95, upper_95)` |
+| `tabvision/tabvision/eval/error_decomposition.py` | Port of `tabvision-server/tools/error_analysis.py` (apr-28 7-bucket harness) targeting `list[TabEvent]` pairs |
+| `tabvision/scripts/eval/composite_eval.py` | CLI wrapper: `tabvision-composite-eval --manifest data/eval/composite.toml --output docs/EVAL_REPORTS/composite_baseline_<date>.md` |
+| `tabvision/scripts/eval/decompose_tab_errors.py` | CLI wrapper for error_decomposition.py |
+| `tabvision/data/eval/composite.toml` | Composite-eval manifest (live; populated incrementally as datasets arrive) |
+| `tabvision/data/fixtures/eval/guitarset_05_BN1-129-Eb_comp.jams` | Single-clip JAMS fixture for parser round-trip test |
+| `tabvision/data/fixtures/eval/guitar_techs_sample.mid` | Single-clip 6-track MIDI fixture |
+| `tabvision/tests/unit/test_parser_guitarset_jams.py` | JAMS parser round-trip test |
+| `tabvision/tests/unit/test_parser_guitar_techs_midi.py` | MIDI parser round-trip test |
+| `tabvision/tests/unit/test_bootstrap_ci.py` | CI helper correctness on known distributions |
+| `tabvision/tests/unit/test_error_decomposition.py` | 7-bucket assignment correctness on synthetic predicted/gold pairs |
+| `tabvision/tests/integration/test_composite_eval_smoke.py` | End-to-end smoke: 5-clip manifest → tier numbers exist + CIs computed |
+| `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` | First baseline report (output of Phase 0E) |
+| `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` | First 7-bucket decomposition (output of Phase 0D) |
+
+### 1.2 Modified files
+
+| Path | Lines | Change |
+|---|---|---|
+| `tabvision/tabvision/eval/manifest.py` | the `REQUIRED_CLIP_FIELDS` block (currently ~lines 21-28) | Add `annotation_format` field so parser-dispatch can route by source |
+| `tabvision/tabvision/eval/manifest.py` | `validate_manifest()` | Reject any clip whose `source` indicates synthetic origin (e.g. starts with `synthtab/` or `dadagp/`) from a non-train split. This is the R8 cross-contamination guard from the strategy doc. |
+| `LICENSES.md` | datasets table | Add Guitar-TECHS (CC-BY-4.0), EGDB (pending), free IR packs as they're acquired |
+| `docs/DECISIONS.md` | append | D1–D11 from strategy doc §1 |
+| `pyproject.toml` (in `tabvision/`) | `[project.optional-dependencies]` | Add `eval` extra with `pretty_midi`, `pyguitarpro`, `jams` (already used elsewhere — verify before adding) |
+
+### 1.3 NOT modified
+
+- `tabvision/tabvision/pipeline.py` — no behavior change in Phase 0.
+- `tabvision/tabvision/fusion/**` — no fusion changes.
+- `tabvision-server/modal_app.py`, `tabvision-server/app/v1_adapter.py` — no production changes.
+- `tabvision-server/app/v1_adapter.py:91` `videoIgnoredByQualityGate` — flagged in strategy doc as a faked diagnostic, but the fix is Phase 6's job, not Phase 0's.
+
+## 2. Test plan
+
+Every test must be runnable via `pytest tabvision/tests/...` and skip
+cleanly when an optional dependency is missing (PyGuitarPro, jams).
+Fixtures go under `tabvision/data/fixtures/eval/`.
+
+### 2.1 Unit tests
+
+| Test name | Fixture | Assertion |
+|---|---|---|
+| `test_parser_guitarset_jams.py::test_jams_round_trip_pitch_string_fret` | `guitarset_05_BN1-129-Eb_comp.jams` (small, ~50 notes) | Every emitted `TabEvent` has `0 ≤ string_idx ≤ 5`, `0 ≤ fret ≤ 24`, monotonically non-decreasing `onset_s`. Total event count matches the JAMS namespace's note count. |
+| `test_parser_guitarset_jams.py::test_jams_pitch_consistency` | same | For each emitted event, MIDI pitch implied by `(string_idx, fret)` matches the JAMS-reported pitch. |
+| `test_parser_guitar_techs_midi.py::test_midi_round_trip_per_string` | `guitar_techs_sample.mid` (6 tracks, 1 per string) | Track index → `string_idx` mapping correct: track 0 → low E (`string_idx=0`), track 5 → high E (`string_idx=5`). |
+| `test_parser_guitar_techs_midi.py::test_midi_pitch_to_fret` | same | Per-string MIDI pitch → fret derivation matches expected standard-tuning offsets: E2=40 → fret 0 string 0, A2=45 → fret 5 string 0, etc. |
+| `test_bootstrap_ci.py::test_ci_known_normal` | synthetic Gaussian N(0.85, 0.05), n=100 | Returned 95% CI brackets the true mean ≥ 95% of the time over 1000 trials (calibration check). |
+| `test_bootstrap_ci.py::test_ci_handles_small_samples` | n=5 | No exception; CI width sane (≥ standard error). |
+| `test_bootstrap_ci.py::test_ci_deterministic_with_seed` | any | Same seed → same CI. |
+| `test_error_decomposition.py::test_seven_buckets_assigned` | synthetic gold + predicted `TabEvent` lists, one per bucket | Each ground-truth event lands in the expected bucket: `correct`, `wrong_position_same_pitch`, `pitch_off`, `timing_only`, `missed_onset`, `muted_undetectable`, `extra_detection`. |
+| `test_error_decomposition.py::test_share_of_loss_sums_to_one` | mixed gold + predicted | Per-bucket share-of-loss percentages sum to 100% (excluding the `correct` bucket). |
+
+### 2.2 Integration tests
+
+| Test name | Setup | Assertion |
+|---|---|---|
+| `test_composite_eval_smoke.py::test_five_clip_manifest` | A 5-clip composite manifest using checked-in fixtures (3 GuitarSet, 2 Guitar-TECHS) | `run_composite_eval(manifest)` returns a `CompositeReport` whose tiers include both `clean_acoustic_single_line` and `clean_acoustic_strummed`. Each tier has a non-null `tab_f1_mean` and `tab_f1_ci_95`. |
+| `test_composite_eval_smoke.py::test_synthetic_clip_rejected_from_eval` | A manifest with one clip whose `source = "synthtab/test"` and `split = "test"` | `validate_manifest()` raises with a message mentioning the cross-contamination guard. |
+| `test_composite_eval_smoke.py::test_egdb_skipped_when_pyguitarpro_missing` | Manifest with an EGDB clip but PyGuitarPro not installed | Run completes successfully; the EGDB clip is reported as `skipped` with reason `parser_dependency_missing`. Other clips still evaluated. |
+
+### 2.3 What's NOT tested in Phase 0
+
+- The actual D2 acceptance numbers — those are the *output* of running
+  the harness, not a unit-test assertion. The CI gate is what's tested;
+  whether the system *hits* 0.85/0.90/0.87/0.80 is a question Phases
+  1-8 answer.
+- Bootstrap confidence on real production data — covered by the
+  smoke test on fixtures; running on production data is a one-shot
+  command, not a CI test.
+
+## 3. Commands
+
+All commands run from repo root, in the WSL Ubuntu shell, with the
+`tabvision` venv active (`source tabvision/venv/bin/activate` or
+`pip install -e tabvision[dev,eval]`).
+
+### 3.1 One-time setup
+
+```bash
+# Install eval extras (PyGuitarPro, pretty_midi, jams)
+cd tabvision && pip install -e '.[dev,eval]' && cd -
+
+# Verify tests pass on the base
+pytest tabvision/tests/unit/test_parser_guitarset_jams.py -v
+pytest tabvision/tests/unit/test_bootstrap_ci.py -v
+```
+
+### 3.2 Acquire Guitar-TECHS
+
+```bash
+# Guitar-TECHS is CC-BY-4.0, hosted on Zenodo (see strategy doc §4.1)
+mkdir -p ~/mir_datasets/guitar_techs
+# Download the dataset archive from the URL in arXiv:2501.03720
+# (resolved at acquisition time; not committed to repo)
+# Extract into ~/mir_datasets/guitar_techs/
+ls ~/mir_datasets/guitar_techs/
+```
+
+### 3.3 Build the manifest
+
+```bash
+# Generate composite.toml from on-disk datasets
+python tabvision/scripts/eval/build_composite_manifest.py \
+  --guitarset ~/mir_datasets/guitarset \
+  --guitar-techs ~/mir_datasets/guitar_techs \
+  --output tabvision/data/eval/composite.toml
+
+# Validate it
+python -c "from tabvision.eval.manifest import validate_manifest; print(validate_manifest('tabvision/data/eval/composite.toml'))"
+```
+
+### 3.4 Run the baseline composite eval
+
+```bash
+python tabvision/scripts/eval/composite_eval.py \
+  --manifest tabvision/data/eval/composite.toml \
+  --backend highres \
+  --position-prior guitarset-v1 \
+  --bootstrap-n 10000 \
+  --bootstrap-seed 42 \
+  --output docs/EVAL_REPORTS/composite_baseline_2026-05-13.md
+```
+
+### 3.5 Run the error decomposition
+
+```bash
+python tabvision/scripts/eval/decompose_tab_errors.py \
+  --manifest tabvision/data/eval/composite.toml \
+  --backend highres \
+  --position-prior guitarset-v1 \
+  --output docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md
+```
+
+### 3.6 Verify free-tier compute accounts
+
+```bash
+# W&B: confirm login + a tiny no-op run
+wandb login
+python -c "import wandb; r = wandb.init(project='tabvision-phase0', mode='online'); r.log({'hello': 1}); r.finish()"
+
+# Lightning Studios: open a Studio in the browser, run `nvidia-smi`, screenshot for the DECISIONS.md log
+
+# Kaggle: open a notebook in the browser, run `!nvidia-smi`
+
+# Colab: same
+
+# Modal: skip — used only as last resort per D6
+```
+
+### 3.7 Send the EGDB email
+
+User action — not a command. Template in strategy doc; log the
+date sent and the reply (when it arrives) in `docs/DECISIONS.md`.
+
+## 4. Acceptance outputs
+
+These are the artifacts whose existence + content gates Phase 1.
+
+### 4.1 `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md`
+
+Must contain:
+
+- A per-tier table:
+  - Tier name
+  - Clip count (≥ 20 for any tier claimed against D2)
+  - Mean Tab F1
+  - **95% bootstrap CI lower bound**
+  - Mean Onset F1
+  - Mean Pitch F1
+- Per-source breakdown within each tier (GuitarSet / Guitar-TECHS /
+  EGDB) so we can see whether a tier number is dominated by one
+  source.
+- A "Status vs D2 target" column with one of: **pass** (CI lower ≥
+  target), **gap** (mean ≥ target but CI lower below), **fail** (mean
+  below target).
+- Methodology footer: bootstrap N, seed, parser versions, backend +
+  prior versions, eval-harness commit SHA.
+
+### 4.2 `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md`
+
+Must contain:
+
+- Aggregate 7-bucket table (counts + share-of-loss).
+- Per-tier 7-bucket table.
+- A "biggest lever per tier" callout: which bucket dominates each
+  tier's loss. Phase 1+ priorities derive from this.
+
+### 4.3 `tabvision/data/eval/composite.toml`
+
+Must satisfy `validate_manifest()` and contain:
+
+- ≥ 20 clips for each of: `clean_acoustic_single_line`,
+  `clean_acoustic_strummed`. (Guitar-TECHS additions may bring
+  `clean_electric` to ≥ 20 in Phase 0E; if not, that tier waits for
+  EGDB.)
+- `clean_electric` and `distorted_electric` populated as much as
+  Guitar-TECHS + EGDB-license-resolved allow.
+- No `source = synthtab/...` or `source = dadagp/...` rows in `split =
+  validation` or `split = test`.
+
+### 4.4 `docs/DECISIONS.md` entries
+
+D1–D11 from strategy doc §1, dated 2026-05-13. EGDB email send-date
+and reply (when it arrives) as a separate entry.
+
+### 4.5 CI verification
+
+`pytest tabvision/tests/unit tabvision/tests/integration -v` passes
+on `main` HEAD plus this Phase 0 branch.
+
+## 5. Decision tree
+
+What to do after Phase 0E baseline is in:
+
+- **All four tiers' CI lower bound clears D2** — surprising; sanity
+  check the eval harness, then declare v1 acceptance and skip to
+  Phase 9. This is unlikely given the 2026-05-08 0.61 aggregate.
+- **Strummed CI lower bound clears D2, other tiers gap or fail** —
+  expected case. Proceed to Phase 1 (pitch ceiling lift). The
+  error-decomposition report tells us whether Phase 2 (fine-tune) or
+  Phase 3 (style priors) is the next priority after Phase 1.
+- **All tiers fail** — Phase 0 implementation has a bug, or the
+  highres backend regressed on the broader corpus. Inspect 3-5
+  worst-case clips by hand before any further compute spend.
+- **`distorted_electric` has < 20 clips** — EGDB license is the
+  blocker. Set the tier aside; document the gap in the report; do not
+  publish D2 acceptance until the EGDB row clears.
+
+## 6. Time + compute budget
+
+| Item | Effort | Compute |
+|---|---|---|
+| Parser implementations + tests (1.1) | 1.5 days | none |
+| Manifest extensions + validator hardening (1.2) | 0.5 day | none |
+| Composite + bootstrap + error-decomposition modules (1.1) | 1 day | none |
+| Guitar-TECHS acquisition + manifest population | 0.5 day | none |
+| Baseline + decomposition runs (3.4 + 3.5) | 4-8 wall-clock hours | local CPU |
+| Free-tier compute account verification | 0.5 day | none |
+| EGDB email + DECISIONS.md updates | 15 minutes | none |
+| Report writing | 0.5 day | none |
+| **Total** | **4-5 days engineering** | **~$0** |
+
+## 7. Out of scope for Phase 0
+
+- Any production-pipeline change. No edits to `pipeline.py`, `fusion/`,
+  `audio/`, `video/`, `tabvision-server/`.
+- Fine-tuning, training, or model weight changes.
+- Anything depending on the EGDB license reply (defer to Phase 8 or
+  later).
+- Style-conditional priors (Phase 3).
+- Video pipeline experiments (Phase 6).
+- Synthetic-data generation (research/dev only; not part of Phase 0).
+
+## 8. Done definition
+
+Phase 0 is **done** when:
+
+- All items in §1.1 and §1.2 exist on the impl branch.
+- All tests in §2.1 and §2.2 pass green.
+- `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` exists and meets
+  §4.1.
+- `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` exists
+  and meets §4.2.
+- `tabvision/data/eval/composite.toml` exists and validates.
+- `docs/DECISIONS.md` includes D1–D11.
+- EGDB email send-date recorded.
+- Free-tier compute accounts verified (W&B at minimum; Lightning /
+  Kaggle / Colab logged in `docs/DECISIONS.md`).
+
+Then — and only then — the Phase 1 implementation plan gets written.