From 6f2117d1f4fe61842f775c34641936dec786490c Mon Sep 17 00:00:00 2001 From: Patrick Gilhooley Date: Wed, 13 May 2026 06:15:58 -0400 Subject: [PATCH 1/2] docs(plan): tab F1 per-tier targets and 10-phase work breakdown MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Plan for closing the Tab F1 gap on GuitarSet validation (currently 0.61 aggregate). Replaces the single-aggregate SPEC §1.4 target with per-tier targets (clean acoustic single-line 0.85, strummed 0.90, clean electric 0.87, distorted electric 0.80) and locks in 10 decisions on the eval composite, compute, and scope. 10-phase plan: foundation setup → pitch ceiling lift → SynthTab pretrain + highres fine-tune → style priors → UI-field audit → learned fusion v2 → video qualitative integration → solo-gated melodic prior → tier shortfall recovery → final eval. Plan-only, no code changes. Phase 0 user actions enumerated in §8. --- .../plans/2026-05-12-tab-f1-to-spec-design.md | 536 ++++++++++++++++++ 1 file changed, 536 insertions(+) create mode 100644 docs/plans/2026-05-12-tab-f1-to-spec-design.md diff --git a/docs/plans/2026-05-12-tab-f1-to-spec-design.md b/docs/plans/2026-05-12-tab-f1-to-spec-design.md new file mode 100644 index 0000000..41c14a9 --- /dev/null +++ b/docs/plans/2026-05-12-tab-f1-to-spec-design.md @@ -0,0 +1,536 @@ +# Tab F1 → Per-Tier Targets — Design + +**Date:** 2026-05-12 +**Author:** Patrick (brainstormed with Claude) +**Status:** Proposed — pending sign-off +**Spec source:** `SPEC.md` §1.4 (per-tier table), §5 Phase 5, §8 contracts, §1.5 hard constraints, §6.3 free compute accounts. +**Branch:** to be cut off `refactor/v1` once approved. +**Depends on:** `docs/plans/2026-05-06-phase5-fusion-design.md`, `docs/plans/2026-05-06-video-pipeline-integration-design.md`, `docs/EVAL_REPORTS/guitarset_accuracy_boost_2026-05-08.md`. +**Replaces:** earlier 2026-05-12 single-aggregate-target draft (never committed). + +## 0. Decisions taken on 2026-05-12 + +These were locked in during the planning conversation; record them in +`docs/DECISIONS.md` per SPEC §0.5 once the plan is approved. + +| # | Decision | Rationale | +|---|---|---| +| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). Per-tier targets force the conversation onto the right axis. | +| D2 | Per-tier numeric targets (table below). | Strummed raised from SPEC 0.86 → 0.90; distorted-electric floor lowered 0.82 → 0.80. Middle tiers relaxed to reflect the gap between current 0.61 and any realistic ceiling. | +| D3 | Eval set is a **multi-source composite**: GuitarSet + Guitar-TECHS + GOAT + EGDB (pending license) + synthetic. Personal videos banned from any role. | GuitarSet alone gives one player, one genre cluster, no electric/distorted. Per-tier evaluation requires per-tier sources. | +| D4 | **SynthTab** pretrain → real-data fine-tune is the audio-side plan. No DIY DadaGP synthesis unless SynthTab proves insufficient. | SynthTab (CC-BY-4.0, ~6,700 h with string/fret labels) pre-empts the engineering cost of building a renderer. Literature (SynthTab paper, High-Res Domain Adaptation arXiv:2402.15258) shows pretrain+fine-tune lifts cross-dataset generalization. | +| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature. Production runs audio-only; video is opt-in. | No public dataset has synchronized guitar video + per-note string/fret labels. Confirmed via 2026-05-12 research pass (see §3.1). | +| D6 | **Free-tier compute first.** Order: local CPU > Lightning Studios free (22 GPU-hr/mo) > Kaggle (30 hr/wk T4) > Colab > Modal. | Per CLAUDE.md operating rule 6 and SPEC §6.3 §1.5 hard constraint. The earlier $30-80 fine-tune estimate was Modal pricing; free tier fits a highres fine-tune comfortably. | +| D7 | **1-2 month cadence.** No fixed deadline. | User-stated. | +| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1; SPEC §1.4 already marks them v1.1. | Confirmed in conversation. | +| D9 | Top-K is acceptable as an editor UX feature but the **0.80 floor and per-tier targets apply to the top-1 prediction**. | User-stated. | +| D10 | Personal training clips (the 20-video set) **off the table entirely** — not as accuracy gate, not as dev set, not as label source. | User-stated. | + +### Per-tier Tab F1 targets (D2) + +| Tier | SPEC §1.4 | This plan | +|---|---:|---:| +| Clean acoustic single-line | 0.94 | **0.85** | +| Clean acoustic strummed | 0.86 | **0.90** | +| Clean electric | 0.90 | **0.87** | +| Distorted electric | 0.82 | **0.80** | + +All on the multi-source composite test set (D3). Top-1 prediction only. +Onset F1 (≥ 0.92) and Pitch F1 (≥ 0.90) from SPEC §1.4 remain unchanged +— audio already clears them on GuitarSet. + +## 1. Goal + +Hit the D2 per-tier Tab F1 targets on the D3 composite eval set within +1-2 months using free-tier compute, while keeping the production system +the SPEC §8 contract-conformant v1 pipeline. + +## 2. Current evidence + +GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08 production +candidate (highres + `guitarset-v1` prior, no video, no melodic prior): + +| Metric | Current | SPEC | Status | +|---|---:|---:|---| +| Onset F1 (50 ms) | 0.9218 | ≥ 0.92 | pass | +| Pitch F1 (50 ms) | 0.9022 | ≥ 0.90 | pass | +| Tab F1 aggregate | 0.6104 | ≥ 0.88 (deprecated) | retired metric | + +Per-track distribution (2026-05-12 diagnostic): + +- Tab F1 mean **0.589**, median 0.620, min 0.166, max 0.933 +- Comp tracks (n=30) mean **0.670**; solo tracks (n=30) mean **0.508** +- Worst 10 tracks: 7 are solos. Best 5: 4 are comps. +- Tab/Pitch ratio: comp 0.744, **solo 0.546** — solos lose 45% of + pitch-correct notes to wrong string/fret assignment. + +**Bottleneck is string/fret assignment on single-line passages where +chord-cluster context is absent.** Audio is essentially at spec; only the +Tab F1 numbers are red, and only on the solo regime. + +The single-tier mapping of GuitarSet is "clean acoustic strummed" for +comp tracks and "clean acoustic single-line" for solo tracks. The +electric and distorted-electric tiers (D2) have no current measurement +and must be acquired (D3). + +## 3. Resource inventory + +### 3.1 Datasets + +Verified by the 2026-05-12 research pass. Italics = on-disk now; +**bold** = to acquire. + +| Source | License | Modality | Labels | Size | Tier coverage | +|---|---|---|---|---|---| +| *GuitarSet* | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | 3 h, 6 players | clean acoustic single-line, strummed | +| **Guitar-TECHS** | CC-BY-4.0 | audio (multi-mic + DI) | 6-track MIDI per string | 5h12m | clean acoustic single-line, clean electric | +| **GOAT** | CC-BY-4.0 | DI audio | tablature | 5.9 h | clean electric | +| **EGDB** | None on repo — **email author for portfolio-use permission** | audio (DI + 5 amp sims, ~6 renders) | GuitarPro tabs + aligned MIDI (string + fret) | ~12 h synthesized | clean electric, distorted electric | +| **SynthTab** | CC-BY-4.0 | synthesized audio | string + fret + onset | ~6,700 h | all four tiers (pretrain only) | +| GAPS | CC-BY-NC-SA | YouTube video links + MIDI pitch | pitch-only, **not tab** | 14 h | reject — non-commercial taint | +| DadaGP | research-access (email) | symbolic GP files | tab natively | 26,181 files | fallback synthesis source if SynthTab insufficient | +| ~~The 20 personal clips~~ | n/a | n/a | n/a | n/a | **banned** (D10) | + +**Confirmed gap:** no public dataset combines guitar video with per-note +string+fret labels. This is the load-bearing finding behind D5. + +### 3.2 Compute + +| Account | Free allowance | Status | Use | +|---|---|---|---| +| Lightning Studios | 22 GPU-hr/month | Phase 0 setup | SynthTab pretrain, highres fine-tune | +| Kaggle | ~30 GPU-hr/week T4 | Phase 0 setup | overflow for long sweeps | +| Colab | ~12 hr/day with limits | Phase 0 setup | quick experiments | +| W&B | unlimited (academic) | Phase 0 setup | experiment tracking | +| HuggingFace Hub | unlimited public | already used | weights / checkpoints | +| Modal | pay-per-use | already used | production smoke retests only | +| Local CPU | 6 cores WSL2 | available | eval, priors, light tuning | + +Per CLAUDE.md operating rule 6: Local > Colab > Kaggle > Lightning > +Modal. Modal is the resort, not the default. + +### 3.3 Code already in tree + +- `tabvision.audio.highres` — production pitch backend, 0.92 / 0.90 on GuitarSet. +- `tabvision.fusion.position_prior` — `guitarset-v1` prior, +22pp Tab F1. +- `tabvision.fusion.{viterbi,chord,playability,neck_prior,melodic_prior}` — Phase 5 shipped, cluster Viterbi + chord state enumeration + playability emission/transition costs. +- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped (1603 LOC). +- `tabvision.pipeline.run_pipeline` — composes all of the above; production runs through it via `tabvision-server/app/v1_adapter.py`. +- `tabvision-server/tools/eval_basic_pitch_baseline.py` + `tabvision/scripts/eval/guitarset_audio_eval.py` — current evaluation harness; needs extension for multi-source composite. +- `tabvision-server/tools/outputs/errors-2026-04-28_185743.md` — apr-28 error-decomposition methodology (proven on personal clips); port the same 7-bucket harness to the composite eval set. + +### 3.4 What has been tried (lessons) + +| Attempt | Date | Outcome | Lesson | +|---|---|---|---| +| Learned-fusion LightGBM ranker | 2026-04-29 | +0.3pp LOOCV vs +5pp gate; **catastrophic −27.8pp regression on training-17** | Small data + over-fit on one held-out group. **Critically: video features were `null` on every row** (`audio_only=True`) — so this wasn't actually a test of learned-fusion-with-video. Re-attempt with proper feature instrumentation is justified. | +| Basic Pitch fine-tune on GuitarSet | 2026-04-29/30 | Did happen; superseded by highres backend swap before final integration | Fine-tune infrastructure is reusable for highres; SynthTab pretrain is the missing first step. | +| Melodic prior | current | Regresses aggregate Tab F1 from 0.6104 to 0.5989 | Helps solo, hurts comp. Needs solo-density gating, not a flat enable. | +| Position prior `guitarset-v1` | 2026-05-08 | +22pp Tab F1 vs no prior | Per-pitch tabular priors are the largest-leverage cheap intervention. Style/structure-conditional versions are the natural extension. | +| Phase 5 cluster Viterbi + chord enumeration | 2026-05-06 | Shipped, drives current production | The audio-only structured search is already well-tuned. Further gain needs either better priors or different evidence (which video can't provide on the eval set). | + +## 4. Plan + +10 phases. Phases 0–2 sequential; 3–8 parallelizable. Decision tree +inside each phase determines whether to continue, branch, or escalate. + +### Phase 0 — Foundation (parallel, no compute, 1 week wall-clock) + +**Goal:** assemble the evidence base + accounts the rest of the plan +depends on. No production code changes; setup only. + +Concurrent tracks: + +- **0A. Acquisition.** + - [user] Email EGDB author (`f08946011@ntu.edu.tw`) for written + portfolio-use permission. Template draft in §10. + - [code] Download Guitar-TECHS and GOAT (both CC-BY-4.0, no email). + - [code] Sample SynthTab to a 500-clip pilot subset (~50 h). Full + download deferred until Phase 2. +- **0B. Compute accounts.** + - [user] Lightning Studios, Kaggle, Colab, W&B sign-ups per SPEC §6.3. + - [code] Verify each by running a hello-world (W&B init + a GPU + `nvidia-smi` job on each platform). +- **0C. Eval harness extension.** + - [code] Build `tabvision/scripts/eval/composite_eval.py`. Reads a + manifest TOML (per-clip tier label + source + audio path + tab + annotation path) and runs the same `guitarset_audio_eval.py` logic + across all sources. Outputs per-tier Tab F1, per-source CSVs, and a + consolidated Markdown report. + - Manifest schema follows the placeholder in + `tabvision/data/eval/manifest.toml`. Tier label is one of + `clean_acoustic_single_line`, `clean_acoustic_strummed`, + `clean_electric`, `distorted_electric`. +- **0D. Error decomposition.** + - [code] Port `tools/error_analysis.py` (apr-28 7-bucket harness) + from personal-clip input to the composite eval set. Output: + `docs/EVAL_REPORTS/error_decomposition_.md` with per-tier + bucket counts. +- **0E. Baseline measurement.** + - [code] Run `composite_eval.py` against the current production + pipeline. Get the per-tier numbers. These are the Phase 1+ + starting points. + +**Phase 0 acceptance gate:** +- Per-tier Tab F1 baseline numbers exist for at least 3 of the 4 tiers + (distorted electric is EGDB-dependent; deferred OK). +- Per-tier 7-bucket error decomposition exists. +- All free-tier compute accounts verified. +- EGDB email sent. +- No production code changes. + +**Decision tree:** +- If baseline already hits some tier (e.g., strummed at 0.92) → drop + that tier from later phases' work. +- If pitch-side metrics regress vs the 2026-05-08 GuitarSet numbers on + the composite set → STOP and investigate before any further work. + The composite eval should not change audio-side numbers on GuitarSet. + +### Phase 1 — Pitch ceiling lift, cheap moves (local CPU, 2-3 days) + +**Goal:** Pitch F1 from 0.915 → ≥ 0.93 on GuitarSet validation, without +training. Gives Tab F1 mathematical headroom regardless of fusion-side +work. + +Moves, in order: + +1. **Voicing/silence gate** on highres pitch posteriors. Tune the + joint onset+pitch confidence threshold. Trade some recall for + precision; expect net F1 gain. +2. **Onset peak-picking adjustment.** The 50 ms tolerance is generous; + misaligned within-tolerance peaks still produce pitch mis-reads. + Improve peak localization → tighter onset match → higher pitch TP + count. +3. **Basic Pitch pitch-only ensemble.** Run Basic Pitch alongside + highres. Use Basic Pitch's pitch output (not onset) as a tiebreaker + on pitch-disagreement events; downweight (or drop) events where the + two backends disagree on pitch. SPEC §6.1 path; LICENSES.md + confirms Basic Pitch is Apache-2.0 default-pipeline-safe. + +**Phase 1 acceptance:** +- Pitch F1 ≥ 0.93 on GuitarSet validation. +- Onset F1 ≥ 0.92 (no regression). +- Aggregate Tab F1 ≥ 0.62 (no regression beyond mathematical + pitch-improvement bound). + +**Decision tree:** +- 0.93 met → continue. +- 0.92–0.93 → continue; Phase 2 fine-tune still useful as ceiling lift. +- < 0.92 → diagnose. Could be a threshold-sweep artifact rather than a + real regression. Inspect on 3-5 representative tracks before + escalating. + +### Phase 2 — SynthTab pretrain + highres fine-tune (Lightning, 1 week) + +**Goal:** Pitch F1 ≥ 0.94 on GuitarSet validation. Lift the audio +ceiling beyond what threshold-tuning alone can do. + +**Approach.** Per the SynthTab paper (ICASSP 2024) and arXiv:2402.15258, +the proven recipe is: pretrain on synthetic, fine-tune on real. + +- **Pretrain corpus:** SynthTab 500-clip pilot (Phase 0). Full set + (~6,700 h) is overkill at this stage and won't fit in the free tier + monthly budget. +- **Pretrain head:** the highres model's pitch+onset head. Backbone + frozen for the pretrain phase to avoid catastrophic forgetting on + the spectral feature extractor. +- **Fine-tune:** GuitarSet train split (4 players, 240 tracks ≈ 2 h), + unfrozen, 5-10 epochs with early stopping on Pitch F1. +- **Compute:** Lightning Studios free tier (22 GPU-hr/month). Estimate: + pretrain ~6 GPU-hr, fine-tune ~3 GPU-hr. Buffer for re-runs ~5 GPU-hr. + Stays inside the monthly allowance. + +**Phase 2 acceptance:** +- GuitarSet validation Pitch F1 ≥ 0.94. +- No Onset F1 regression > 1 pp. +- Cross-dataset sanity: on Guitar-TECHS (held out from training), + Pitch F1 ≥ 0.90 (no catastrophic transfer loss). + +**Decision tree:** +- Met all three → continue. New `audio_backend = "highres-synthtab"` + becomes the candidate for production replacement. +- GuitarSet met, Guitar-TECHS regresses > 5 pp → over-fit on the + pretrain distribution. Reduce pretrain epochs, increase fine-tune + weight, retry once. +- GuitarSet ≤ 0.93 → SynthTab pretrain didn't transfer; abandon + Phase 2 and revisit with the actual diagnostic (Pitch P/R curves) + before any further training spend. + +### Phase 3 — Style/structure-conditional priors (local CPU, 3 days) + +**Goal:** lift Tab F1 on solos via finer-grained per-pitch position +priors. Expected +1 to +5 pp on solo subsets. + +- **Buckets:** {bn, jazz, funk, rock, ss} × {solo, comp} = 10 priors. + GuitarSet's `style` field gives the genre axis directly; structure + axis derived from cluster-singleton density (already computable in + fusion). +- **Train:** GuitarSet train split (players 00, 01, 02, 03, 04). + Per-bucket Laplace-smoothed counts. Empty cells fall back to + `guitarset-v1`. +- **Validate:** leave-one-player-out CV (not LOOCV per-clip — too + small). Primary metric: per-bucket Tab F1 delta vs `guitarset-v1` + baseline on the held-out player. +- **Risk:** the apr-29 learned-fusion attempt failed with one + catastrophic regression. Same class of risk here — small data, + bucketing on 4 training players. **Hard regression guard:** abort + the bucket if any cross-validation fold regresses by > 3 pp. + +**Phase 3 acceptance:** +- Mean Tab F1 over solo buckets: +2 pp vs `guitarset-v1` baseline. +- No bucket regresses by > 1 pp on comp. +- No cross-validation fold regresses by > 3 pp on any bucket. + +**Decision tree:** +- Met → ship the prior set, expose `position_prior = "guitarset-styled-v1"`. +- Solo gain < 2 pp → drop the structure axis, ship style-only. +- Any bucket fails the regression guard → drop that bucket only; + fall back to `guitarset-v1` for it. Don't kill the whole experiment + on one bad bucket. + +### Phase 4 — Style+structure-aware capo/tuning audit (local, 1 day) + +**Goal:** verify the capo / instrument / tone / style fields from the +upload UI are actually flowing into prior selection and playability +weights as designed. + +- **Trace:** unit-test that with `capo_fret = 5` the position prior + shifts correctly (frets 0-19 become frets 5-24). +- **Smoke:** run a known capo-3 clip from GuitarSet (if any exist) + and confirm the output tab is rendered against the capo. +- **Audit playability:** confirm `instrument = electric` doesn't apply + the open-string bonus differently when it shouldn't, etc. + +Small phase; mostly a correctness-check before later phases compound +any bugs here. + +**Phase 4 acceptance:** +- All upload-form fields measurably affect at least one pipeline + decision per a unit test. +- No silent fallback to defaults on any field. + +### Phase 5 — Learned fusion v2 (local, 3-5 days) + +**Goal:** the 2026-04-24 plan's learned-fusion approach, redone with +proper feature instrumentation. **Per-pitch + chord-context ranker**, +not the audio-only ranker that flat-lined at +0.3 pp in 2026-04-29. + +**Why this can work this time:** the apr-29 attempt's per-candidate +features were limited (no fusion-prior values, no neck-anchor values +because video was off, no chord-cluster context). With Phase 5 +shipping the structured search already, those values are now exposed +and can be features. + +**Per-candidate features:** +- `pitch`, `confidence`, `duration`, `amplitude` (audio). +- `position_prior_log_prob`, `melodic_prior_log_prob`, + `neck_prior_log_prob` (fusion priors at this candidate). +- `cluster_size`, `cluster_span`, `is_singleton`, `singleton_density_2s` + (chord context). +- `emission_cost`, `transition_cost_to_prev` (playability). +- `cand_string`, `cand_fret`, `is_open`, `is_low_position` (identity). +- `style`, `instrument`, `tone` (from session config; flow-from-UI + audited in Phase 4). + +**Training:** GuitarSet train split, leave-one-player-out CV. LightGBM +`lambdarank` with hard regression guard at -3 pp per held-out player. + +**Phase 5 acceptance:** +- Mean Tab F1 across all held-out players: +3 pp vs Phase 3-or-earlier + baseline. +- No held-out player regresses by > 3 pp. +- Margin-based fallback to structured-search pick when learned-fusion + margin is below a threshold (mitigates OOD behavior in production). + +**Decision tree:** +- Met → ship behind a flag, default off, with the margin fallback. + Default-on requires a separate review pass with at least one week of + production smoke clean. +- Per-player regression > 3 pp on any fold → the apr-29 failure mode + repeats. Stop Phase 5 and pivot to Phase 7 instead. + +### Phase 6 — Video pipeline qualitative integration (1-2 days) + +Goal: re-enable the video stack in production for users whose uploads +have usable video, without claiming any quantitative Tab F1 improvement. +**No video accuracy gate** (D5). + +- Flip `TABVISION_VIDEO_ENABLED=true` in `tabvision-server/modal_app.py` + in dev. +- Verify pipeline runs end-to-end on at least one synthetic + fretboard-rendered clip (Phase 6A) and the qualitative output is + reasonable. +- Add a runtime quality gate (the one the v1_adapter currently fakes): + reject video evidence when `handDetectionRate < 0.3` or + `fretboardDetectionConfidence < 0.5`. Diagnostics in result JSON. +- Production smoke: end-to-end on the existing `test_a440.mp4` (audio + ceiling) and one real-world iPhone clip (qualitative inspection only, + not gated). + +**Phase 6A — Synthetic fretboard video** (optional, 2-3 days): +- Render a procedurally-generated fretboard animation (Blender or + pyrender) against SynthTab audio. Synchronized by-construction. +- Use for video-pipeline smoke + regression tests (does turning video + on/off change anything?), NOT for accuracy claims. + +**Phase 6 acceptance:** +- Video enable in dev does not regress GuitarSet audio-only Pitch / + Onset / Tab F1 metrics (delta within ±0.5 pp). +- At least one synthetic clip produces a non-empty `fingerings` list + in the result. +- Production smoke clean. + +**Decision tree:** +- Audio-only metrics regress when video enabled → video is making + things worse on no-video-content clips. Add a fail-fast that + disables video output when `videoObservationCount == 0`, retry. +- No regression but no positive signal either → ship video as opt-in, + default off. Revisit when a public video+tab dataset emerges. +- Positive signal on some clips → ship default-on with the quality + gate. + +### Phase 7 — Solo-gated melodic prior (local, 2 days) + +**Goal:** re-enable the existing melodic prior in the regime where it +helps (solo passages) without re-introducing the comp regression that +caused the current ship-disable. + +- Gate the melodic prior on rolling-window singleton density: apply + only when ≥ 80% of clusters in the last 2 seconds are singletons. +- Re-tune the 35/65 prior-blend ratio currently hard-coded in + `tabvision/tabvision/fusion/melodic_prior.py:64`. + +**Phase 7 acceptance:** +- Solo subset Tab F1 +3 pp vs Phase 3 baseline. +- Comp subset Tab F1 within ±1 pp. +- No per-track regression > 3 pp. + +### Phase 8 — Tier shortfall recovery (as needed, 1-2 weeks) + +Triggered only if a tier still misses its D2 target after Phases 1-7. + +- **Distorted electric < 0.80:** + - If EGDB acquired: oversample EGDB distorted variants in Phase 2 + fine-tune; re-run. + - If EGDB blocked: synthesize a distorted training subset via + SynthTab clean audio + free IR pack convolution (Modern Music + Solutions Declassified, Djammincabs). +- **Clean acoustic single-line < 0.85:** + - Re-tune Phase 7 melodic-prior strength on the single-line subset. + - If still short: add a position-shift smoothing prior (events + within < 200 ms shouldn't span > 5 frets unless audio amplitude + suggests a deliberate slide). +- **Clean acoustic strummed < 0.90:** + - Chord-shape template prior: for each detected chord cluster, + boost candidate fingerings that match a curated set of 30-50 + common guitar chord shapes (port from + `tabvision-server/app/chord_shapes.py`, 790 LOC). +- **Clean electric < 0.87:** + - Likely co-resolves with one of the above. Investigate per-tier + error decomposition before adding tier-specific work. + +### Phase 9 — Final eval + documentation + +- Run `composite_eval.py` with full per-tier table. +- Write `docs/EVAL_REPORTS/per_tier_acceptance_.md`. +- Update `docs/DECISIONS.md` with each Dn entry actually taken. +- Final SPEC §1.4 amendment proposal: tier table replaces aggregate + target. Land as a SPEC PR. + +## 5. Sequencing + +``` +Phase 0 (parallel setup) [week 1] + ↓ +Phase 1 (pitch ceiling cheap) [week 1] + ↓ +Phase 2 (SynthTab + fine-tune) [week 2] + ↓ +┌────────────────────────────────────────┐ +│ Phase 3 (style priors) [w3] │ +│ Phase 4 (UI fields audit) [w3] │ parallel +│ Phase 5 (learned fusion v2) [w3-4] │ +│ Phase 6 (video qualitative) [w3] │ +│ Phase 7 (solo melodic prior) [w3] │ +└────────────────────────────────────────┘ + ↓ +Phase 8 (tier recovery) [w5-6 as needed] + ↓ +Phase 9 (final eval + docs) [w6] +``` + +Total wall-clock: **4-6 weeks engineering**, plus 1-2 weeks waiting +time on the EGDB email if it gates Phase 8 distorted-electric work. + +## 6. Risk register + +| # | Risk | Likelihood | Mitigation | +|---|---|---|---| +| R1 | SynthTab pretrain doesn't transfer to real audio (domain gap) | medium | Literature shows pretrain+fine-tune works (SynthTab paper, arXiv:2402.15258). Smoke on Guitar-TECHS held-out before committing to full pretrain spend. | +| R2 | EGDB license never resolves | low-medium | Author replies are usually fast; if blocked, synthetic IR-based distorted electric via Phase 8 fallback. | +| R3 | SynthTab labels are noisy (DadaGP human-transcribed varies in quality) | medium | Use SynthTab as pretrain only, never as eval gate. Phase 0 spot-check a 50-clip random sample. | +| R4 | Per-tier composite eval set has too few clips per tier for statistical significance | medium | Bootstrap 95% CIs in all per-tier reports. State the CI explicitly when reporting against the D2 target. | +| R5 | Video pipeline degrades audio-only metrics when enabled | low | Quality gate in Phase 6 + audio-only fallback. Phase 6 acceptance explicitly checks this. | +| R6 | Phase 5 learned-fusion reproduces the apr-29 single-fold catastrophe | medium | Hard regression guard per-fold + margin fallback to structured search. Phase 5 decision tree pivots to Phase 7 if it triggers. | +| R7 | Free-tier compute monthly allowance insufficient for Phase 2 + 8 retries | low | Lightning 22 hr/mo + Kaggle 30 hr/wk + Colab is ~150 GPU-hr/mo combined; Phase 2 needs ~14 hr. Plenty of buffer. | +| R8 | LICENSES.md needs updates for Guitar-TECHS, GOAT, SynthTab, EGDB | certain | Update in Phase 0. Each is CC-BY-4.0 (or pending in EGDB's case); attribution must appear in README and any blog. | + +## 7. Out of scope + +- Personal training clips (D10). +- Single-aggregate Tab F1 ≥ 0.88 (D1). +- Stretch v1.1 (bends/slides/hammer-ons) per D8. +- Quantitative video-gate (D5). Video ships qualitative-only. +- Top-K UX surface — UI work is separate. D2 targets apply to top-1. +- New SPEC §8 contracts — none of these phases changes signatures. +- Real-money compute except for production smoke retests on Modal. + +## 8. Phase 0 user actions (the things only you can do) + +1. Sign up / verify free-tier compute accounts: + - Lightning Studios (https://lightning.ai) + - Kaggle (https://kaggle.com) + - Colab (https://colab.research.google.com) + - Weights & Biases (https://wandb.ai, free academic tier) +2. Email the EGDB author for portfolio-use written permission. + Template: + + > Subject: TabVision portfolio project — request to use EGDB + > + > Dr. Chen, + > + > I'm a developer building TabVision, a portfolio guitar + > transcription project (public GitHub repo, blog post, recorded + > demo). I would like to use EGDB as the distorted-electric + > evaluation tier of my multi-source test set, and cite your + > ICASSP 2022 paper. The repo has no LICENSE file, so I'm asking + > for written permission to use EGDB in this portfolio context, + > including reporting evaluation metrics computed on it. + > + > Thank you, + > Patrick Gilhooley + +3. Confirm or push back on the D2 per-tier targets (table in §0). +4. Approve the plan; I cut a branch from `refactor/v1` and start + Phase 0E (the baseline measurement, since 0A and 0B are blocked on + the above). + +## 9. Things still genuinely unresolved + +These can be answered in flight; don't gate the plan on them. + +- The exact size of the SynthTab pilot (500 clips is a guess; the + right number is "smallest subset that produces a fine-tune gain" + and emerges from Phase 2's first run). +- Whether Phase 4 finds any actual capo/tuning regressions worth + fixing, or if it's a 30-minute box-tick. +- Phase 6A: whether procedural fretboard rendering is 2 days or 2 + weeks of work. Defer until we know whether Phase 6 alone is enough. + +## 10. Open invitation to redirect + +This plan favors free compute over fast iteration; SynthTab over DIY +synthesis; per-tier targets over single-aggregate; audio-only gates +over speculative video-gate construction. If any of those defaults are +wrong for what you actually want, say so before Phase 0 starts — +backtracking from Phase 3 is expensive. From b7f139dbf461825feb2b1d95e5ca3888905e6685 Mon Sep 17 00:00:00 2001 From: Patrick Gilhooley Date: Wed, 13 May 2026 09:30:31 -0400 Subject: [PATCH 2/2] =?UTF-8?q?docs(plan):=20revise=20per=20#10=20review?= =?UTF-8?q?=20=E2=80=94=20license=20fixes=20+=20SPEC=20amend?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address PR #10 review (2026-05-13): - Drop SynthTab (CC-BY-NC-4.0, taints derived weights per SPEC §1.5) and GOAT (request-only research-only) from the default-pipeline path. - Frame the design doc as a SPEC §1.4 amendment proposal; commit the SPEC.md update (§1.4.1) in the same change set, keeping the original SPEC numbers as v1.1 / portfolio stretch reference. - Split the doc: strategy / decision-record kept under the original filename; new Phase 0 implementation plan (2026-05-13-tab-f1-phase-0-implementation.md) with exact files, tests, commands, acceptance outputs. - Add explicit License Gate section (§0) verifying every resource before any compute spend. - Define composite eval policy: ≥ 20 clips / 500 notes per tier, player-split, 95% bootstrap CIs with lower-bound acceptance test, parser-per-source, no synthetic-source clips in eval splits. - Update CLAUDE.md 'Active branch' to reflect main (Modal production deploy landed there; refactor/v1 is 23 commits behind). Plan-only commit; no production code changes. --- CLAUDE.md | 16 +- SPEC.md | 35 + .../plans/2026-05-12-tab-f1-to-spec-design.md | 773 ++++++------------ ...026-05-13-tab-f1-phase-0-implementation.md | 305 +++++++ 4 files changed, 618 insertions(+), 511 deletions(-) create mode 100644 docs/plans/2026-05-13-tab-f1-phase-0-implementation.md diff --git a/CLAUDE.md b/CLAUDE.md index 71537df..65dc78c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,9 +18,19 @@ parallel under `refactor/v1`. - `LICENSES.md` — dependency license map; ⚠️ items gate respective phase entry. - `docs/DECISIONS.md` — non-obvious branches taken (per SPEC §0.5). -**Active branch:** `refactor/v1` (cut off `feature/audio-finetune-phase1`, -not `main` — see `docs/DECISIONS.md`). `main` is 33 commits behind v0. -Phase 0 in progress; sign-off pending on AUDIT + LICENSES. +**Active branch (2026-05-13):** `main`. The Modal production deploy +(`936a5cc`) and v1 CI hardening landed on `main`; `refactor/v1` is now +**23 commits behind `main`** and should be treated as historical. Cut new +work branches off `main`. Older design docs (and earlier paragraphs in +this file) may reference paths that exist on `main` but not on +`refactor/v1` — verify with `git cat-file -e origin/main:` before +relying on them. The full pipeline (`tabvision/tabvision/pipeline.py`), +the Modal production adapter (`tabvision-server/modal_app.py`, +`tabvision-server/app/v1_adapter.py`), and the highres audio backend all +live on `main`. Phase 5 fusion has shipped. See +`docs/2026-05-12-session-handoff.md` for the production state and +`docs/plans/2026-05-12-tab-f1-to-spec-design.md` (+ companion Phase 0 +implementation plan) for current accuracy work. ## Layout diff --git a/SPEC.md b/SPEC.md index 3fe8f5f..e666752 100644 --- a/SPEC.md +++ b/SPEC.md @@ -121,6 +121,41 @@ The targets above are aggregate over the full eval set. Per-difficulty-tier expe If the aggregate hits 0.88 but distorted electric scores below 0.75, treat that as a partial pass and prioritize Phase 7 distortion-augmented fine-tuning before final acceptance. +### 1.4.1 v1 acceptance amendment — per-tier targets (2026-05-13) + +Per the 2026-05-13 design plan +(`docs/plans/2026-05-12-tab-f1-to-spec-design.md`), v1 acceptance moves +from the aggregate 0.88 Tab F1 in §1.4 to **per-tier targets on a +public-corpus composite eval set**: + +| Tier | §1.4 stretch reference | v1 acceptance | +|---|---:|---:| +| Clean acoustic single-line | 0.94 | **0.85** | +| Clean acoustic strummed | 0.86 | **0.90** | +| Clean electric | 0.90 | **0.87** | +| Distorted electric | 0.82 | **0.80** | + +Rationale: 2026-05-08 GuitarSet validation showed aggregate Tab F1 = 0.61 +with comp tracks at 0.67 and solo tracks at 0.51 despite both being near +0.92 Pitch F1. The aggregate hid the structural failure mode (single-line +string/fret assignment). Per-tier targets force the conversation onto the +right axis and let work be sequenced (strummed first, distorted electric +last). + +**Test-set composition amendment:** the "user's own playing" test set in +§1.4 paragraph 1 is replaced by a public-corpus composite (GuitarSet +held-out + Guitar-TECHS + EGDB pending license + qualifying synthetic +training/dev material). See the design plan §5 for composite policy +(per-tier minimums, splits, leakage rules, bootstrap CIs). + +**Stretch / portfolio reference:** the original §1.4 per-tier table +(0.94 / 0.86 / 0.90 / 0.82) remains the v1.1 / portfolio stretch bar. +Hitting it is welcome; v1 acceptance requires only the amended table. + +**Aggregate Tab F1** is retired as an acceptance metric. **Onset F1 +(≥ 0.92), Pitch F1 (≥ 0.90), chord-instance accuracy (≥ 0.85), and +latency (≤ 5 min)** from §1.4 are unchanged. + ### 1.5 Hard constraints - All training/inference dependencies must be free or have a free tier sufficient for this project (see §6). diff --git a/docs/plans/2026-05-12-tab-f1-to-spec-design.md b/docs/plans/2026-05-12-tab-f1-to-spec-design.md index 41c14a9..ff1569b 100644 --- a/docs/plans/2026-05-12-tab-f1-to-spec-design.md +++ b/docs/plans/2026-05-12-tab-f1-to-spec-design.md @@ -1,536 +1,293 @@ -# Tab F1 → Per-Tier Targets — Design +# Tab F1 v1 acceptance — Strategy & Decision Record -**Date:** 2026-05-12 +**Date:** 2026-05-12 (revised 2026-05-13 per PR #10 review) **Author:** Patrick (brainstormed with Claude) -**Status:** Proposed — pending sign-off -**Spec source:** `SPEC.md` §1.4 (per-tier table), §5 Phase 5, §8 contracts, §1.5 hard constraints, §6.3 free compute accounts. -**Branch:** to be cut off `refactor/v1` once approved. -**Depends on:** `docs/plans/2026-05-06-phase5-fusion-design.md`, `docs/plans/2026-05-06-video-pipeline-integration-design.md`, `docs/EVAL_REPORTS/guitarset_accuracy_boost_2026-05-08.md`. -**Replaces:** earlier 2026-05-12 single-aggregate-target draft (never committed). - -## 0. Decisions taken on 2026-05-12 - -These were locked in during the planning conversation; record them in -`docs/DECISIONS.md` per SPEC §0.5 once the plan is approved. +**Status:** v3 — strategy / decision-record only; **not** an implementation plan +**Scope note:** This is a **SPEC §1.4 amendment proposal** plus + strategy. Implementation detail lives in companion docs. +**Companions:** +- `SPEC.md` §1.4.1 (the amendment table; committed in the same change set) +- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` (Phase 0 impl) +- Later phase impl plans (write after Phase 0 evidence) +**Replaces:** v1 + v2 (2026-05-12 single-aggregate-target drafts; both + had load-bearing license errors and stale path references + and have been superseded by this rewrite). + +## 0. License gate (must clear before any compute spend) + +Per SPEC §1.5 the **shipping default pipeline** must be portfolio-clean. +NC-licensed material is acceptable in research/experiment configurations +that are NOT shipped. Each resource is verified 2026-05-13: + +| Resource | License | Portfolio-default usable? | Source / verification | +|---|---|---|---| +| GuitarSet | CC-BY-4.0 | **yes** | https://zenodo.org/records/3371780 | +| Guitar-TECHS | CC-BY-4.0 | **yes** | arXiv:2501.03720 §4 distribution | +| EGDB | none on repo — **author email pending** | **gated** | https://ss12f32v.github.io/Guitar-Transcription/ (LICENSES.md ⚠️) | +| GOAT | request-only, research-only | **no — DROPPED** | arXiv:2509.22655 §4.2 *"made available by request to better control its use for research purposes only"* | +| SynthTab dataset | **CC-BY-NC-4.0** | **no — DROPPED** | github.com/yongyizang/SynthTab README *"SynthTab is released with CC BY-NC 4.0 license"* | +| SynthTab rendering code | CC-BY-4.0 | n/a (we're not redistributing the code) | repo `LICENSE` file | +| DadaGP | access-by-email research-only; underlying GP tabs derive from copyrighted songs | **research/dev only** — NOT in default path | github.com/dada-bots/dadaGP README; underlying tab copyright unsettled | +| Basic Pitch | Apache-2.0 | yes (Phase 1 pitch ensemble) | github.com/spotify/basic-pitch | +| highres (xavriley) | MIT | yes — current production audio backend | github.com/xavriley/hf_midi_transcription | +| MediaPipe Hands | Apache-2.0 | yes — video pipeline | per LICENSES.md | +| YOLO-OBB (ultralytics) | AGPL-3.0 (accepted per DECISIONS.md) | yes (portfolio is AGPL-OK) | per LICENSES.md | +| Free amp/cab IRs | varies (most free-public) | yes for default if redistribution terms allow; verify per-pack | Modern Music Solutions Declassified, Djammincabs | + +**Drops vs v2 plan:** +- **SynthTab dropped** because the dataset is CC-BY-NC-4.0; pretraining + the shipping audio backend on it taints derived weights (SynthTab paper + treats trained models as derivative work). Distillation as a laundering + step is rejected — both legally murky and explicitly out of bounds + per the 2026-05-13 review. +- **GOAT dropped** because it's request-only research-only. Cannot + evaluate a public portfolio against it. + +**Hard rule:** any phase that depends on a "gated" or "no" row must +produce evidence that the gate cleared (e.g., a written reply from the +EGDB author) BEFORE that phase ships. No conditional commits, no +"we'll-figure-it-out-later" merges. + +## 1. Decisions + +These supersede the v2 D1–D10 set. Append to `docs/DECISIONS.md` per +SPEC §0.5 once the plan is approved. | # | Decision | Rationale | |---|---|---| -| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). Per-tier targets force the conversation onto the right axis. | -| D2 | Per-tier numeric targets (table below). | Strummed raised from SPEC 0.86 → 0.90; distorted-electric floor lowered 0.82 → 0.80. Middle tiers relaxed to reflect the gap between current 0.61 and any realistic ceiling. | -| D3 | Eval set is a **multi-source composite**: GuitarSet + Guitar-TECHS + GOAT + EGDB (pending license) + synthetic. Personal videos banned from any role. | GuitarSet alone gives one player, one genre cluster, no electric/distorted. Per-tier evaluation requires per-tier sources. | -| D4 | **SynthTab** pretrain → real-data fine-tune is the audio-side plan. No DIY DadaGP synthesis unless SynthTab proves insufficient. | SynthTab (CC-BY-4.0, ~6,700 h with string/fret labels) pre-empts the engineering cost of building a renderer. Literature (SynthTab paper, High-Res Domain Adaptation arXiv:2402.15258) shows pretrain+fine-tune lifts cross-dataset generalization. | -| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature. Production runs audio-only; video is opt-in. | No public dataset has synchronized guitar video + per-note string/fret labels. Confirmed via 2026-05-12 research pass (see §3.1). | -| D6 | **Free-tier compute first.** Order: local CPU > Lightning Studios free (22 GPU-hr/mo) > Kaggle (30 hr/wk T4) > Colab > Modal. | Per CLAUDE.md operating rule 6 and SPEC §6.3 §1.5 hard constraint. The earlier $30-80 fine-tune estimate was Modal pricing; free tier fits a highres fine-tune comfortably. | +| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). | +| D2 | Per-tier v1 acceptance targets: **0.85 / 0.90 / 0.87 / 0.80** for clean acoustic single-line / strummed / clean electric / distorted electric. | User-stated floor (0.80) and strummed (≥ 0.90); middle tiers proposed and accepted. Original SPEC numbers (0.94 / 0.86 / 0.90 / 0.82) become the v1.1 / portfolio stretch reference. | +| D3 | Eval set is a **multi-source public-corpus composite**: GuitarSet + Guitar-TECHS + EGDB (license-pending) + qualifying synthetic. Personal videos banned. GOAT dropped. SynthTab dropped from default path. | Per-tier evaluation requires per-tier sources; portfolio constraint excludes NC and request-only data from the shipping path. | +| D4 | **No SynthTab in the default pipeline.** Audio-side lift comes from priors + cheap pitch post-processing + GuitarSet fine-tune. DadaGP-derived synthetic remains acceptable for **internal training/dev only** if it's never shipped. | SynthTab CC-BY-NC-4.0 taints derived weights; SPEC §1.5 bars NC from default. | +| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature; per-tier Tab F1 measured audio-only. | No public dataset has video + per-note string/fret labels (verified 2026-05-12). | +| D6 | **Free-tier compute first.** Order per CLAUDE.md operating rule 6 and SPEC §6.3: **Local CPU > Colab > Kaggle > Lightning Studios > Modal**. Modal is the last resort. | Project rule, plus Lightning's 22 GPU-hr/month free tier covers any fine-tune we'd plausibly run. | | D7 | **1-2 month cadence.** No fixed deadline. | User-stated. | -| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1; SPEC §1.4 already marks them v1.1. | Confirmed in conversation. | -| D9 | Top-K is acceptable as an editor UX feature but the **0.80 floor and per-tier targets apply to the top-1 prediction**. | User-stated. | -| D10 | Personal training clips (the 20-video set) **off the table entirely** — not as accuracy gate, not as dev set, not as label source. | User-stated. | - -### Per-tier Tab F1 targets (D2) - -| Tier | SPEC §1.4 | This plan | -|---|---:|---:| -| Clean acoustic single-line | 0.94 | **0.85** | -| Clean acoustic strummed | 0.86 | **0.90** | -| Clean electric | 0.90 | **0.87** | -| Distorted electric | 0.82 | **0.80** | - -All on the multi-source composite test set (D3). Top-1 prediction only. -Onset F1 (≥ 0.92) and Pitch F1 (≥ 0.90) from SPEC §1.4 remain unchanged -— audio already clears them on GuitarSet. +| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1. | SPEC §1.4 already marks them v1.1. | +| D9 | Top-K acceptable as an editor UX feature; the D2 numbers are on **top-1 only**. | User-stated. | +| D10 | Personal training clips off the table entirely — not as accuracy gate, not as dev set, not as label source. | User-stated. | +| D11 | This document is a **SPEC §1.4 amendment**, not a SPEC-achievement plan. Land the SPEC.md update (§1.4.1) in the same change set. | Honest framing of relaxed targets; reviewer's approval bar. | -## 1. Goal +## 2. Goal & framing -Hit the D2 per-tier Tab F1 targets on the D3 composite eval set within -1-2 months using free-tier compute, while keeping the production system -the SPEC §8 contract-conformant v1 pipeline. +**v1 acceptance:** hit the D2 per-tier Tab F1 targets on the D3 +public-corpus composite eval set within 1-2 months on free-tier +compute, with the existing v1 pipeline (no §8 contract changes). -## 2. Current evidence +**Stretch / portfolio reference:** the original SPEC §1.4 numbers +(0.94 / 0.86 / 0.90 / 0.82). If we hit them, that's the portfolio +narrative; v1 acceptance does not require them. -GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08 production -candidate (highres + `guitarset-v1` prior, no video, no melodic prior): +**Out of v1 acceptance:** quantitative video-fusion Tab F1 +improvement claim (no public dataset for it; tracked as qualitative +only). -| Metric | Current | SPEC | Status | -|---|---:|---:|---| -| Onset F1 (50 ms) | 0.9218 | ≥ 0.92 | pass | -| Pitch F1 (50 ms) | 0.9022 | ≥ 0.90 | pass | -| Tab F1 aggregate | 0.6104 | ≥ 0.88 (deprecated) | retired metric | +## 3. Current evidence -Per-track distribution (2026-05-12 diagnostic): +GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08 +production candidate (highres + `guitarset-v1` prior, audio-only): -- Tab F1 mean **0.589**, median 0.620, min 0.166, max 0.933 -- Comp tracks (n=30) mean **0.670**; solo tracks (n=30) mean **0.508** -- Worst 10 tracks: 7 are solos. Best 5: 4 are comps. -- Tab/Pitch ratio: comp 0.744, **solo 0.546** — solos lose 45% of - pitch-correct notes to wrong string/fret assignment. +| Metric | Current | Status | +|---|---:|---| +| Onset F1 (50 ms) | 0.9218 | passes SPEC §1.4 ≥ 0.92 | +| Pitch F1 (50 ms) | 0.9022 | passes SPEC §1.4 ≥ 0.90 | +| Tab F1 aggregate (retired) | 0.6104 | — | +| Tab F1, comp subset | 0.670 mean | — | +| Tab F1, solo subset | 0.508 mean | — | -**Bottleneck is string/fret assignment on single-line passages where -chord-cluster context is absent.** Audio is essentially at spec; only the -Tab F1 numbers are red, and only on the solo regime. +The 27 pp gap to the **retired** 0.88 aggregate target is almost +entirely string/fret assignment on single-line passages. Audio is at +spec; only fusion-side assignment is short. This frames the per-tier +work: **strummed (chord context) is closest to its target; single-line +needs the most lift.** -The single-tier mapping of GuitarSet is "clean acoustic strummed" for -comp tracks and "clean acoustic single-line" for solo tracks. The -electric and distorted-electric tiers (D2) have no current measurement -and must be acquired (D3). +**Coverage gap:** GuitarSet covers only the clean acoustic tiers. +Clean-electric and distorted-electric have **no current measurement** +on a public corpus and must be acquired in Phase 0. -## 3. Resource inventory +## 4. Resource inventory -### 3.1 Datasets +### 4.1 Datasets (default-pipeline path only) -Verified by the 2026-05-12 research pass. Italics = on-disk now; -**bold** = to acquire. +| Source | License | Modality | Labels | Tier coverage | +|---|---|---|---|---| +| GuitarSet (on-disk) | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | clean acoustic single-line, strummed | +| Guitar-TECHS (acquire) | CC-BY-4.0 | audio (multi-mic + DI) | 6-track per-string MIDI | clean acoustic single-line, clean electric | +| EGDB (acquire, license pending) | none on repo — author email required | audio (DI + 5 amp sims) | GuitarPro tabs + aligned MIDI | clean electric, distorted electric | +| Free IR-augmented GuitarSet | CC-BY-4.0 (with IR pack licenses verified) | derived audio | inherited string + fret | distorted electric (fallback if EGDB blocks) | -| Source | License | Modality | Labels | Size | Tier coverage | -|---|---|---|---|---|---| -| *GuitarSet* | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | 3 h, 6 players | clean acoustic single-line, strummed | -| **Guitar-TECHS** | CC-BY-4.0 | audio (multi-mic + DI) | 6-track MIDI per string | 5h12m | clean acoustic single-line, clean electric | -| **GOAT** | CC-BY-4.0 | DI audio | tablature | 5.9 h | clean electric | -| **EGDB** | None on repo — **email author for portfolio-use permission** | audio (DI + 5 amp sims, ~6 renders) | GuitarPro tabs + aligned MIDI (string + fret) | ~12 h synthesized | clean electric, distorted electric | -| **SynthTab** | CC-BY-4.0 | synthesized audio | string + fret + onset | ~6,700 h | all four tiers (pretrain only) | -| GAPS | CC-BY-NC-SA | YouTube video links + MIDI pitch | pitch-only, **not tab** | 14 h | reject — non-commercial taint | -| DadaGP | research-access (email) | symbolic GP files | tab natively | 26,181 files | fallback synthesis source if SynthTab insufficient | -| ~~The 20 personal clips~~ | n/a | n/a | n/a | n/a | **banned** (D10) | +### 4.2 Datasets (research / dev only — NEVER in the default pipeline) -**Confirmed gap:** no public dataset combines guitar video with per-note -string+fret labels. This is the load-bearing finding behind D5. +| Source | License | Use | +|---|---|---| +| DadaGP | access-by-email, research-only | possible internal-training augmentation; not shipped, not redistributed | +| SynthTab | CC-BY-NC-4.0 | reference only; not a substrate for any shipped weight | -### 3.2 Compute +### 4.3 Compute accounts (free-tier first, per D6 order) -| Account | Free allowance | Status | Use | -|---|---|---|---| -| Lightning Studios | 22 GPU-hr/month | Phase 0 setup | SynthTab pretrain, highres fine-tune | -| Kaggle | ~30 GPU-hr/week T4 | Phase 0 setup | overflow for long sweeps | -| Colab | ~12 hr/day with limits | Phase 0 setup | quick experiments | -| W&B | unlimited (academic) | Phase 0 setup | experiment tracking | -| HuggingFace Hub | unlimited public | already used | weights / checkpoints | -| Modal | pay-per-use | already used | production smoke retests only | -| Local CPU | 6 cores WSL2 | available | eval, priors, light tuning | - -Per CLAUDE.md operating rule 6: Local > Colab > Kaggle > Lightning > -Modal. Modal is the resort, not the default. - -### 3.3 Code already in tree - -- `tabvision.audio.highres` — production pitch backend, 0.92 / 0.90 on GuitarSet. -- `tabvision.fusion.position_prior` — `guitarset-v1` prior, +22pp Tab F1. -- `tabvision.fusion.{viterbi,chord,playability,neck_prior,melodic_prior}` — Phase 5 shipped, cluster Viterbi + chord state enumeration + playability emission/transition costs. -- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped (1603 LOC). -- `tabvision.pipeline.run_pipeline` — composes all of the above; production runs through it via `tabvision-server/app/v1_adapter.py`. -- `tabvision-server/tools/eval_basic_pitch_baseline.py` + `tabvision/scripts/eval/guitarset_audio_eval.py` — current evaluation harness; needs extension for multi-source composite. -- `tabvision-server/tools/outputs/errors-2026-04-28_185743.md` — apr-28 error-decomposition methodology (proven on personal clips); port the same 7-bucket harness to the composite eval set. - -### 3.4 What has been tried (lessons) - -| Attempt | Date | Outcome | Lesson | -|---|---|---|---| -| Learned-fusion LightGBM ranker | 2026-04-29 | +0.3pp LOOCV vs +5pp gate; **catastrophic −27.8pp regression on training-17** | Small data + over-fit on one held-out group. **Critically: video features were `null` on every row** (`audio_only=True`) — so this wasn't actually a test of learned-fusion-with-video. Re-attempt with proper feature instrumentation is justified. | -| Basic Pitch fine-tune on GuitarSet | 2026-04-29/30 | Did happen; superseded by highres backend swap before final integration | Fine-tune infrastructure is reusable for highres; SynthTab pretrain is the missing first step. | -| Melodic prior | current | Regresses aggregate Tab F1 from 0.6104 to 0.5989 | Helps solo, hurts comp. Needs solo-density gating, not a flat enable. | -| Position prior `guitarset-v1` | 2026-05-08 | +22pp Tab F1 vs no prior | Per-pitch tabular priors are the largest-leverage cheap intervention. Style/structure-conditional versions are the natural extension. | -| Phase 5 cluster Viterbi + chord enumeration | 2026-05-06 | Shipped, drives current production | The audio-only structured search is already well-tuned. Further gain needs either better priors or different evidence (which video can't provide on the eval set). | - -## 4. Plan - -10 phases. Phases 0–2 sequential; 3–8 parallelizable. Decision tree -inside each phase determines whether to continue, branch, or escalate. - -### Phase 0 — Foundation (parallel, no compute, 1 week wall-clock) - -**Goal:** assemble the evidence base + accounts the rest of the plan -depends on. No production code changes; setup only. - -Concurrent tracks: - -- **0A. Acquisition.** - - [user] Email EGDB author (`f08946011@ntu.edu.tw`) for written - portfolio-use permission. Template draft in §10. - - [code] Download Guitar-TECHS and GOAT (both CC-BY-4.0, no email). - - [code] Sample SynthTab to a 500-clip pilot subset (~50 h). Full - download deferred until Phase 2. -- **0B. Compute accounts.** - - [user] Lightning Studios, Kaggle, Colab, W&B sign-ups per SPEC §6.3. - - [code] Verify each by running a hello-world (W&B init + a GPU - `nvidia-smi` job on each platform). -- **0C. Eval harness extension.** - - [code] Build `tabvision/scripts/eval/composite_eval.py`. Reads a - manifest TOML (per-clip tier label + source + audio path + tab - annotation path) and runs the same `guitarset_audio_eval.py` logic - across all sources. Outputs per-tier Tab F1, per-source CSVs, and a - consolidated Markdown report. - - Manifest schema follows the placeholder in - `tabvision/data/eval/manifest.toml`. Tier label is one of - `clean_acoustic_single_line`, `clean_acoustic_strummed`, - `clean_electric`, `distorted_electric`. -- **0D. Error decomposition.** - - [code] Port `tools/error_analysis.py` (apr-28 7-bucket harness) - from personal-clip input to the composite eval set. Output: - `docs/EVAL_REPORTS/error_decomposition_.md` with per-tier - bucket counts. -- **0E. Baseline measurement.** - - [code] Run `composite_eval.py` against the current production - pipeline. Get the per-tier numbers. These are the Phase 1+ - starting points. - -**Phase 0 acceptance gate:** -- Per-tier Tab F1 baseline numbers exist for at least 3 of the 4 tiers - (distorted electric is EGDB-dependent; deferred OK). -- Per-tier 7-bucket error decomposition exists. -- All free-tier compute accounts verified. -- EGDB email sent. -- No production code changes. - -**Decision tree:** -- If baseline already hits some tier (e.g., strummed at 0.92) → drop - that tier from later phases' work. -- If pitch-side metrics regress vs the 2026-05-08 GuitarSet numbers on - the composite set → STOP and investigate before any further work. - The composite eval should not change audio-side numbers on GuitarSet. - -### Phase 1 — Pitch ceiling lift, cheap moves (local CPU, 2-3 days) - -**Goal:** Pitch F1 from 0.915 → ≥ 0.93 on GuitarSet validation, without -training. Gives Tab F1 mathematical headroom regardless of fusion-side -work. - -Moves, in order: - -1. **Voicing/silence gate** on highres pitch posteriors. Tune the - joint onset+pitch confidence threshold. Trade some recall for - precision; expect net F1 gain. -2. **Onset peak-picking adjustment.** The 50 ms tolerance is generous; - misaligned within-tolerance peaks still produce pitch mis-reads. - Improve peak localization → tighter onset match → higher pitch TP - count. -3. **Basic Pitch pitch-only ensemble.** Run Basic Pitch alongside - highres. Use Basic Pitch's pitch output (not onset) as a tiebreaker - on pitch-disagreement events; downweight (or drop) events where the - two backends disagree on pitch. SPEC §6.1 path; LICENSES.md - confirms Basic Pitch is Apache-2.0 default-pipeline-safe. - -**Phase 1 acceptance:** -- Pitch F1 ≥ 0.93 on GuitarSet validation. -- Onset F1 ≥ 0.92 (no regression). -- Aggregate Tab F1 ≥ 0.62 (no regression beyond mathematical - pitch-improvement bound). - -**Decision tree:** -- 0.93 met → continue. -- 0.92–0.93 → continue; Phase 2 fine-tune still useful as ceiling lift. -- < 0.92 → diagnose. Could be a threshold-sweep artifact rather than a - real regression. Inspect on 3-5 representative tracks before - escalating. - -### Phase 2 — SynthTab pretrain + highres fine-tune (Lightning, 1 week) - -**Goal:** Pitch F1 ≥ 0.94 on GuitarSet validation. Lift the audio -ceiling beyond what threshold-tuning alone can do. - -**Approach.** Per the SynthTab paper (ICASSP 2024) and arXiv:2402.15258, -the proven recipe is: pretrain on synthetic, fine-tune on real. - -- **Pretrain corpus:** SynthTab 500-clip pilot (Phase 0). Full set - (~6,700 h) is overkill at this stage and won't fit in the free tier - monthly budget. -- **Pretrain head:** the highres model's pitch+onset head. Backbone - frozen for the pretrain phase to avoid catastrophic forgetting on - the spectral feature extractor. -- **Fine-tune:** GuitarSet train split (4 players, 240 tracks ≈ 2 h), - unfrozen, 5-10 epochs with early stopping on Pitch F1. -- **Compute:** Lightning Studios free tier (22 GPU-hr/month). Estimate: - pretrain ~6 GPU-hr, fine-tune ~3 GPU-hr. Buffer for re-runs ~5 GPU-hr. - Stays inside the monthly allowance. - -**Phase 2 acceptance:** -- GuitarSet validation Pitch F1 ≥ 0.94. -- No Onset F1 regression > 1 pp. -- Cross-dataset sanity: on Guitar-TECHS (held out from training), - Pitch F1 ≥ 0.90 (no catastrophic transfer loss). - -**Decision tree:** -- Met all three → continue. New `audio_backend = "highres-synthtab"` - becomes the candidate for production replacement. -- GuitarSet met, Guitar-TECHS regresses > 5 pp → over-fit on the - pretrain distribution. Reduce pretrain epochs, increase fine-tune - weight, retry once. -- GuitarSet ≤ 0.93 → SynthTab pretrain didn't transfer; abandon - Phase 2 and revisit with the actual diagnostic (Pitch P/R curves) - before any further training spend. - -### Phase 3 — Style/structure-conditional priors (local CPU, 3 days) - -**Goal:** lift Tab F1 on solos via finer-grained per-pitch position -priors. Expected +1 to +5 pp on solo subsets. - -- **Buckets:** {bn, jazz, funk, rock, ss} × {solo, comp} = 10 priors. - GuitarSet's `style` field gives the genre axis directly; structure - axis derived from cluster-singleton density (already computable in - fusion). -- **Train:** GuitarSet train split (players 00, 01, 02, 03, 04). - Per-bucket Laplace-smoothed counts. Empty cells fall back to - `guitarset-v1`. -- **Validate:** leave-one-player-out CV (not LOOCV per-clip — too - small). Primary metric: per-bucket Tab F1 delta vs `guitarset-v1` - baseline on the held-out player. -- **Risk:** the apr-29 learned-fusion attempt failed with one - catastrophic regression. Same class of risk here — small data, - bucketing on 4 training players. **Hard regression guard:** abort - the bucket if any cross-validation fold regresses by > 3 pp. - -**Phase 3 acceptance:** -- Mean Tab F1 over solo buckets: +2 pp vs `guitarset-v1` baseline. -- No bucket regresses by > 1 pp on comp. -- No cross-validation fold regresses by > 3 pp on any bucket. - -**Decision tree:** -- Met → ship the prior set, expose `position_prior = "guitarset-styled-v1"`. -- Solo gain < 2 pp → drop the structure axis, ship style-only. -- Any bucket fails the regression guard → drop that bucket only; - fall back to `guitarset-v1` for it. Don't kill the whole experiment - on one bad bucket. - -### Phase 4 — Style+structure-aware capo/tuning audit (local, 1 day) - -**Goal:** verify the capo / instrument / tone / style fields from the -upload UI are actually flowing into prior selection and playability -weights as designed. - -- **Trace:** unit-test that with `capo_fret = 5` the position prior - shifts correctly (frets 0-19 become frets 5-24). -- **Smoke:** run a known capo-3 clip from GuitarSet (if any exist) - and confirm the output tab is rendered against the capo. -- **Audit playability:** confirm `instrument = electric` doesn't apply - the open-string bonus differently when it shouldn't, etc. - -Small phase; mostly a correctness-check before later phases compound -any bugs here. - -**Phase 4 acceptance:** -- All upload-form fields measurably affect at least one pipeline - decision per a unit test. -- No silent fallback to defaults on any field. - -### Phase 5 — Learned fusion v2 (local, 3-5 days) - -**Goal:** the 2026-04-24 plan's learned-fusion approach, redone with -proper feature instrumentation. **Per-pitch + chord-context ranker**, -not the audio-only ranker that flat-lined at +0.3 pp in 2026-04-29. - -**Why this can work this time:** the apr-29 attempt's per-candidate -features were limited (no fusion-prior values, no neck-anchor values -because video was off, no chord-cluster context). With Phase 5 -shipping the structured search already, those values are now exposed -and can be features. - -**Per-candidate features:** -- `pitch`, `confidence`, `duration`, `amplitude` (audio). -- `position_prior_log_prob`, `melodic_prior_log_prob`, - `neck_prior_log_prob` (fusion priors at this candidate). -- `cluster_size`, `cluster_span`, `is_singleton`, `singleton_density_2s` - (chord context). -- `emission_cost`, `transition_cost_to_prev` (playability). -- `cand_string`, `cand_fret`, `is_open`, `is_low_position` (identity). -- `style`, `instrument`, `tone` (from session config; flow-from-UI - audited in Phase 4). - -**Training:** GuitarSet train split, leave-one-player-out CV. LightGBM -`lambdarank` with hard regression guard at -3 pp per held-out player. - -**Phase 5 acceptance:** -- Mean Tab F1 across all held-out players: +3 pp vs Phase 3-or-earlier - baseline. -- No held-out player regresses by > 3 pp. -- Margin-based fallback to structured-search pick when learned-fusion - margin is below a threshold (mitigates OOD behavior in production). - -**Decision tree:** -- Met → ship behind a flag, default off, with the margin fallback. - Default-on requires a separate review pass with at least one week of - production smoke clean. -- Per-player regression > 3 pp on any fold → the apr-29 failure mode - repeats. Stop Phase 5 and pivot to Phase 7 instead. - -### Phase 6 — Video pipeline qualitative integration (1-2 days) - -Goal: re-enable the video stack in production for users whose uploads -have usable video, without claiming any quantitative Tab F1 improvement. -**No video accuracy gate** (D5). - -- Flip `TABVISION_VIDEO_ENABLED=true` in `tabvision-server/modal_app.py` - in dev. -- Verify pipeline runs end-to-end on at least one synthetic - fretboard-rendered clip (Phase 6A) and the qualitative output is - reasonable. -- Add a runtime quality gate (the one the v1_adapter currently fakes): - reject video evidence when `handDetectionRate < 0.3` or - `fretboardDetectionConfidence < 0.5`. Diagnostics in result JSON. -- Production smoke: end-to-end on the existing `test_a440.mp4` (audio - ceiling) and one real-world iPhone clip (qualitative inspection only, - not gated). - -**Phase 6A — Synthetic fretboard video** (optional, 2-3 days): -- Render a procedurally-generated fretboard animation (Blender or - pyrender) against SynthTab audio. Synchronized by-construction. -- Use for video-pipeline smoke + regression tests (does turning video - on/off change anything?), NOT for accuracy claims. - -**Phase 6 acceptance:** -- Video enable in dev does not regress GuitarSet audio-only Pitch / - Onset / Tab F1 metrics (delta within ±0.5 pp). -- At least one synthetic clip produces a non-empty `fingerings` list - in the result. -- Production smoke clean. - -**Decision tree:** -- Audio-only metrics regress when video enabled → video is making - things worse on no-video-content clips. Add a fail-fast that - disables video output when `videoObservationCount == 0`, retry. -- No regression but no positive signal either → ship video as opt-in, - default off. Revisit when a public video+tab dataset emerges. -- Positive signal on some clips → ship default-on with the quality - gate. - -### Phase 7 — Solo-gated melodic prior (local, 2 days) - -**Goal:** re-enable the existing melodic prior in the regime where it -helps (solo passages) without re-introducing the comp regression that -caused the current ship-disable. - -- Gate the melodic prior on rolling-window singleton density: apply - only when ≥ 80% of clusters in the last 2 seconds are singletons. -- Re-tune the 35/65 prior-blend ratio currently hard-coded in - `tabvision/tabvision/fusion/melodic_prior.py:64`. - -**Phase 7 acceptance:** -- Solo subset Tab F1 +3 pp vs Phase 3 baseline. -- Comp subset Tab F1 within ±1 pp. -- No per-track regression > 3 pp. - -### Phase 8 — Tier shortfall recovery (as needed, 1-2 weeks) - -Triggered only if a tier still misses its D2 target after Phases 1-7. - -- **Distorted electric < 0.80:** - - If EGDB acquired: oversample EGDB distorted variants in Phase 2 - fine-tune; re-run. - - If EGDB blocked: synthesize a distorted training subset via - SynthTab clean audio + free IR pack convolution (Modern Music - Solutions Declassified, Djammincabs). -- **Clean acoustic single-line < 0.85:** - - Re-tune Phase 7 melodic-prior strength on the single-line subset. - - If still short: add a position-shift smoothing prior (events - within < 200 ms shouldn't span > 5 frets unless audio amplitude - suggests a deliberate slide). -- **Clean acoustic strummed < 0.90:** - - Chord-shape template prior: for each detected chord cluster, - boost candidate fingerings that match a curated set of 30-50 - common guitar chord shapes (port from - `tabvision-server/app/chord_shapes.py`, 790 LOC). -- **Clean electric < 0.87:** - - Likely co-resolves with one of the above. Investigate per-tier - error decomposition before adding tier-specific work. - -### Phase 9 — Final eval + documentation - -- Run `composite_eval.py` with full per-tier table. -- Write `docs/EVAL_REPORTS/per_tier_acceptance_.md`. -- Update `docs/DECISIONS.md` with each Dn entry actually taken. -- Final SPEC §1.4 amendment proposal: tier table replaces aggregate - target. Land as a SPEC PR. - -## 5. Sequencing - -``` -Phase 0 (parallel setup) [week 1] - ↓ -Phase 1 (pitch ceiling cheap) [week 1] - ↓ -Phase 2 (SynthTab + fine-tune) [week 2] - ↓ -┌────────────────────────────────────────┐ -│ Phase 3 (style priors) [w3] │ -│ Phase 4 (UI fields audit) [w3] │ parallel -│ Phase 5 (learned fusion v2) [w3-4] │ -│ Phase 6 (video qualitative) [w3] │ -│ Phase 7 (solo melodic prior) [w3] │ -└────────────────────────────────────────┘ - ↓ -Phase 8 (tier recovery) [w5-6 as needed] - ↓ -Phase 9 (final eval + docs) [w6] -``` - -Total wall-clock: **4-6 weeks engineering**, plus 1-2 weeks waiting -time on the EGDB email if it gates Phase 8 distorted-electric work. - -## 6. Risk register +| Account | Free allowance | Use | +|---|---|---| +| Local CPU | 6 cores WSL2 | eval runs, prior training, cheap post-processing experiments | +| Colab | ~12 hr/day with limits | quick experiments, prior sweeps | +| Kaggle | ~30 GPU-hr/week T4 | longer sweeps, baseline checks | +| Lightning Studios | 22 GPU-hr/month | any fine-tune work, batched in one monthly window | +| W&B | unlimited (academic) | experiment tracking — required before any GPU job | +| Hugging Face Hub | unlimited public | weight / checkpoint hosting | +| Modal | pay-per-use | **production smoke retests only**; never default training | + +### 4.4 Code already on `main` + +- `tabvision.audio.*` — production pitch backends (highres, basicpitch). +- `tabvision.fusion.{viterbi,chord,playability,position_prior,neck_prior,melodic_prior}` — Phase 5 shipped. +- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped. +- `tabvision.pipeline.run_pipeline` — production-facing orchestrator. +- `tabvision.eval.{manifest,metrics,runner,guitarset_audio}` — eval scaffolding with `REQUIRED_TIERS = ("clean_acoustic_single_line", "clean_acoustic_strummed", "clean_electric", "distorted_electric")` already encoded ([tabvision/tabvision/eval/manifest.py](tabvision/tabvision/eval/manifest.py)). +- `tabvision-server/{modal_app.py, app/v1_adapter.py}` — Modal production adapter. + +### 4.5 What's been tried (lessons carried forward) + +| Attempt | Outcome | Lesson | +|---|---|---| +| Learned-fusion LightGBM ranker (2026-04-29) | +0.3 pp LOOCV vs +5 pp gate; **-27.8 pp** regression on training-17 | Catastrophic single-fold regression with small data. **Re-try only with strict per-fold regression guard AND with video features actually populated**, which the apr-29 run lacked. | +| Basic Pitch fine-tune (2026-04-30) | Superseded by highres backend swap | Fine-tune infra reusable; ceiling lift now lives in highres post-processing and possibly a GuitarSet-only highres fine-tune. | +| Melodic prior | Regresses aggregate by 1.15 pp | Helps solo, hurts comp. Needs solo-density gating. | +| Position prior `guitarset-v1` | +22 pp Tab F1 | Per-pitch tabular priors are the largest cheap intervention. Style/structure-conditional priors are the natural extension. | + +## 5. Composite eval policy + +Each tier in the composite eval set must satisfy these rules. The +manifest schema (`tabvision/tabvision/eval/manifest.py`) already +encodes tier names and required clip fields; the Phase 0 impl plan +extends it for source-specific annotation paths and CI reporting. + +**Per-tier minimums:** +- Each of the four required tiers: **≥ 20 clips** and **≥ 500 gold + notes**. Below this the bootstrap CI is too wide to claim acceptance. +- Total composite: ≥ 80 clips, ≥ 2,000 notes. + +**Split policy:** +- GuitarSet: held-out **by player** (player 05 = validation; this is + the existing convention from `guitarset_audio_eval.py`). No + train/test leak at player level. +- Guitar-TECHS: split by **performer** if performer metadata is + available; else by clip with a deterministic seed. +- EGDB: split by **source track** (the 240 clean DIs); amp-sim + renders of the same track go to the same split. Required to avoid + amp-render leakage. + +**Source weighting:** +- Per-tier metrics are reported **un-weighted across sources within a + tier** (every clip has equal weight). The strategic question "is + GuitarSet over-represented in clean acoustic" gets a separate + per-source breakdown in the report; the headline number is the + un-weighted clip mean. + +**Leakage rules:** +- No clip used for prior training (`guitarset-v1` etc.) appears in + evaluation. Currently `guitarset-v1` is trained on GuitarSet train + split, evaluated on player 05 — compliant. +- Fine-tune sets must be disjoint from eval sets by player / performer. +- DadaGP-derived synthetic, if used, is training-only and never + appears in the eval manifest. + +**Confidence intervals:** +- Every per-tier number reported with a **95% bootstrap CI** over + clips (resample clips with replacement, recompute the tier-mean, + 10 000 resamples). The acceptance test is `lower_95_CI ≥ target`, + not just `mean ≥ target` — this disciplines small-sample wishful + thinking. + +**Parsers:** +- One parser per source, named by the annotation format (not the + source). Phase 0 ships: `guitarset_jams`, `guitar_techs_midi`, + `egdb_gp`. Each parser converts source-native annotations into the + §8 `TabEvent` dataclass list. Round-trip parser tests required. + +## 6. Phase outline (high-level only) + +Each phase has a goal + acceptance bar here. **Per-phase implementation +plans** (exact files / tests / commands / acceptance outputs) are +written **separately**, one phase at a time, only after the prior +phase's evidence justifies starting it. + +- **Phase 0 — Foundation.** Per-tier baselines + error decomposition on + the composite eval. Acquire Guitar-TECHS; send EGDB email; verify free + compute accounts. **No production code changes.** Acceptance: per-tier + baseline numbers exist for ≥ 3 of 4 tiers with bootstrap CIs; + per-tier 7-bucket error breakdown exists. [Companion: + `2026-05-13-tab-f1-phase-0-implementation.md`.] +- **Phase 1 — Pitch ceiling lift (cheap moves).** Voicing/silence gate + + peak-picking + Basic Pitch pitch-only ensemble. Acceptance: Pitch + F1 ≥ 0.93 on GuitarSet validation, no Onset F1 regression > 1 pp. +- **Phase 2 — Highres fine-tune on GuitarSet only.** Lightning + free-tier; ~3 GPU-hr. **No SynthTab pretrain.** Acceptance: Pitch F1 + ≥ 0.94, no Onset regression > 1 pp; cross-dataset sanity ≥ 0.90 on + Guitar-TECHS held-out. +- **Phase 3 — Style/structure-conditional priors.** Leave-one-player-out + CV with hard regression guard. Acceptance: solo Tab F1 +2 pp vs + `guitarset-v1`, no per-bucket regression > 1 pp on comp, no fold + regression > 3 pp. +- **Phase 4 — UI-field audit (capo/tuning/instrument/tone/style).** + Unit tests confirm each field propagates into a pipeline decision. +- **Phase 5 — Learned fusion v2.** Re-attempt with proper features + (chord-context, prior-values, playability-cost, video-when-on). + Acceptance: +3 pp mean Tab F1, no per-fold regression > 3 pp, + margin-fallback to structured search baked in. +- **Phase 6 — Video pipeline qualitative integration.** Enable + `TABVISION_VIDEO_ENABLED=true` in dev with a runtime quality gate. + Acceptance: video on/off does not regress audio-only metrics by > 0.5 pp. +- **Phase 7 — Solo-gated melodic prior.** Acceptance: solo +3 pp, + comp ±1 pp. +- **Phase 8 — Tier shortfall recovery.** Only if a tier still misses + its D2 target. Per-tier tactics (chord-shape templates for strummed, + IR-augmentation for distorted, etc.). +- **Phase 9 — Final eval + DECISIONS.md update + SPEC.md PR.** + +Sequencing: 0 → 1 → 2 in series; 3–7 parallelizable after 2; 8 only +on shortfall; 9 closes. Total wall-clock estimate: **4-6 weeks +engineering** + ~1 week EGDB-email turnaround. + +## 7. Risks | # | Risk | Likelihood | Mitigation | |---|---|---|---| -| R1 | SynthTab pretrain doesn't transfer to real audio (domain gap) | medium | Literature shows pretrain+fine-tune works (SynthTab paper, arXiv:2402.15258). Smoke on Guitar-TECHS held-out before committing to full pretrain spend. | -| R2 | EGDB license never resolves | low-medium | Author replies are usually fast; if blocked, synthetic IR-based distorted electric via Phase 8 fallback. | -| R3 | SynthTab labels are noisy (DadaGP human-transcribed varies in quality) | medium | Use SynthTab as pretrain only, never as eval gate. Phase 0 spot-check a 50-clip random sample. | -| R4 | Per-tier composite eval set has too few clips per tier for statistical significance | medium | Bootstrap 95% CIs in all per-tier reports. State the CI explicitly when reporting against the D2 target. | -| R5 | Video pipeline degrades audio-only metrics when enabled | low | Quality gate in Phase 6 + audio-only fallback. Phase 6 acceptance explicitly checks this. | -| R6 | Phase 5 learned-fusion reproduces the apr-29 single-fold catastrophe | medium | Hard regression guard per-fold + margin fallback to structured search. Phase 5 decision tree pivots to Phase 7 if it triggers. | -| R7 | Free-tier compute monthly allowance insufficient for Phase 2 + 8 retries | low | Lightning 22 hr/mo + Kaggle 30 hr/wk + Colab is ~150 GPU-hr/mo combined; Phase 2 needs ~14 hr. Plenty of buffer. | -| R8 | LICENSES.md needs updates for Guitar-TECHS, GOAT, SynthTab, EGDB | certain | Update in Phase 0. Each is CC-BY-4.0 (or pending in EGDB's case); attribution must appear in README and any blog. | +| R1 | EGDB license never resolves | medium | Phase 8 fallback: free-IR-augmented GuitarSet for distorted-electric tier; explicitly flagged as synthesized in reports. | +| R2 | Guitar-TECHS clips don't span all promised tiers (some clean-electric tracks may be missing) | low-medium | Phase 0 acceptance only requires ≥ 3 of 4 tiers; distorted-electric can wait on EGDB. | +| R3 | GuitarSet-only fine-tune (Phase 2) over-fits player 05's adjacent training distribution | medium | Cross-dataset sanity on Guitar-TECHS held-out; abort if Guitar-TECHS regresses > 5 pp. | +| R4 | Per-tier composite has too few clips for statistical significance | medium | D2 acceptance requires `lower_95_CI ≥ target`, not mean. Per-tier minimum 20 clips / 500 notes (§5). | +| R5 | Phase 5 learned fusion reproduces apr-29 single-fold catastrophe | medium | Strict per-fold regression guard + margin fallback. Decision tree pivots to Phase 7 if it triggers. | +| R6 | LICENSES.md updates required for Guitar-TECHS / EGDB / IR packs | certain | Update in Phase 0 alongside acquisition. | +| R7 | Free-tier monthly compute allowance exhausted before Phase 2 + 5 retries | low | Phase 2 ≈ 3 GPU-hr; Phase 5 is CPU. Combined < 10 hr/month, well inside Lightning's 22 hr cap. | +| R8 | Synthetic data (DadaGP) inadvertently ends up in shipped weights via training/eval pipeline cross-contamination | low | Synthetic clips never appear in `tabvision/data/eval/manifest.toml`; an explicit assert in Phase 0 manifest validator rejects any synthetic-source clip in the default eval set. | -## 7. Out of scope +## 8. Out of scope - Personal training clips (D10). -- Single-aggregate Tab F1 ≥ 0.88 (D1). -- Stretch v1.1 (bends/slides/hammer-ons) per D8. -- Quantitative video-gate (D5). Video ships qualitative-only. -- Top-K UX surface — UI work is separate. D2 targets apply to top-1. -- New SPEC §8 contracts — none of these phases changes signatures. -- Real-money compute except for production smoke retests on Modal. - -## 8. Phase 0 user actions (the things only you can do) - -1. Sign up / verify free-tier compute accounts: - - Lightning Studios (https://lightning.ai) - - Kaggle (https://kaggle.com) - - Colab (https://colab.research.google.com) - - Weights & Biases (https://wandb.ai, free academic tier) -2. Email the EGDB author for portfolio-use written permission. - Template: - - > Subject: TabVision portfolio project — request to use EGDB - > - > Dr. Chen, - > - > I'm a developer building TabVision, a portfolio guitar - > transcription project (public GitHub repo, blog post, recorded - > demo). I would like to use EGDB as the distorted-electric - > evaluation tier of my multi-source test set, and cite your - > ICASSP 2022 paper. The repo has no LICENSE file, so I'm asking - > for written permission to use EGDB in this portfolio context, - > including reporting evaluation metrics computed on it. - > - > Thank you, - > Patrick Gilhooley - -3. Confirm or push back on the D2 per-tier targets (table in §0). -4. Approve the plan; I cut a branch from `refactor/v1` and start - Phase 0E (the baseline measurement, since 0A and 0B are blocked on - the above). - -## 9. Things still genuinely unresolved - -These can be answered in flight; don't gate the plan on them. - -- The exact size of the SynthTab pilot (500 clips is a guess; the - right number is "smallest subset that produces a fine-tune gain" - and emerges from Phase 2's first run). -- Whether Phase 4 finds any actual capo/tuning regressions worth - fixing, or if it's a 30-minute box-tick. -- Phase 6A: whether procedural fretboard rendering is 2 days or 2 - weeks of work. Defer until we know whether Phase 6 alone is enough. - -## 10. Open invitation to redirect - -This plan favors free compute over fast iteration; SynthTab over DIY -synthesis; per-tier targets over single-aggregate; audio-only gates -over speculative video-gate construction. If any of those defaults are -wrong for what you actually want, say so before Phase 0 starts — -backtracking from Phase 3 is expensive. +- SynthTab in any shipped configuration (D4). +- GOAT (license). +- Aggregate Tab F1 ≥ 0.88 as an acceptance gate (D1). +- Stretch v1.1 (bends / slides / hammer-ons) per D8. +- Quantitative video-gate (D5). +- Top-K UI optimization — UI work is separate; D2 applies to top-1. +- §8 contract changes — no SPEC §8 signature edits in this plan. +- Modal as a default training surface (D6). + +## 9. Open questions (do not gate the plan) + +- EGDB author reply timing — assumed ~1 week. +- Whether Guitar-TECHS subdivides cleanly into "clean acoustic" vs + "clean electric" subsets at clip-level metadata, or whether we'll + need to inspect waveforms. +- Whether free IR pack licenses (Modern Music Solutions, Djammincabs) + permit redistribution of derived audio in evaluation reports. + Phase 8 fallback only. + +## 10. Companion docs in this PR + +- `SPEC.md` — §1.4.1 amendment block (per-tier targets + composite test set). +- `CLAUDE.md` — active-branch update (`main`, not `refactor/v1`). +- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` — Phase 0 + implementation: exact files, tests, commands, acceptance outputs. + +Later phase implementation plans (`docs/plans/2026-05-NN-tab-f1-phase-N-implementation.md`) +will be written one phase at a time, only after the prior phase's +evidence is in. diff --git a/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md new file mode 100644 index 0000000..0a9cd5f --- /dev/null +++ b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md @@ -0,0 +1,305 @@ +# Tab F1 — Phase 0 Implementation Plan + +**Date:** 2026-05-13 +**Author:** Patrick (brainstormed with Claude) +**Status:** Proposed — pending sign-off +**Strategy doc:** `docs/plans/2026-05-12-tab-f1-to-spec-design.md` +**Implementation branch:** to be cut as `impl/tab-f1-phase-0` off `main` + after the strategy / SPEC amendment lands. + +## 0. Phase 0 goal recap + +Establish the per-tier baseline and error decomposition needed to +sequence Phases 1+. **No production code changes; no shipped behavior +changes; no compute spend on training.** + +Acceptance, copied from the strategy doc §6: + +- Per-tier baseline numbers for ≥ 3 of 4 D2 tiers with **bootstrap + 95% CIs**, on the composite eval set. +- Per-tier 7-bucket error decomposition on the same set. +- Free-tier compute accounts (Local / Colab / Kaggle / Lightning / W&B) + verified. +- EGDB author email sent; reply tracked in `docs/DECISIONS.md`. + +## 1. Files to add / modify + +### 1.1 New files + +| Path | Purpose | +|---|---| +| `tabvision/tabvision/eval/parsers/__init__.py` | Parser registry | +| `tabvision/tabvision/eval/parsers/guitarset_jams.py` | JAMS → `list[TabEvent]` | +| `tabvision/tabvision/eval/parsers/guitar_techs_midi.py` | 6-track MIDI → `list[TabEvent]` | +| `tabvision/tabvision/eval/parsers/egdb_gp.py` | GuitarPro tab + MIDI → `list[TabEvent]` (skipped at import-time if PyGuitarPro not installed; runs only when EGDB license clears) | +| `tabvision/tabvision/eval/composite.py` | `run_composite_eval(manifest_path) -> CompositeReport` — dispatches to per-source parsers and aggregates per-tier | +| `tabvision/tabvision/eval/bootstrap.py` | Bootstrap CI helper: `bootstrap_ci(values, statistic=mean, n=10_000, seed=int) -> tuple[float, float, float]` returning `(mean, lower_95, upper_95)` | +| `tabvision/tabvision/eval/error_decomposition.py` | Port of `tabvision-server/tools/error_analysis.py` (apr-28 7-bucket harness) targeting `list[TabEvent]` pairs | +| `tabvision/scripts/eval/composite_eval.py` | CLI wrapper: `tabvision-composite-eval --manifest data/eval/composite.toml --output docs/EVAL_REPORTS/composite_baseline_.md` | +| `tabvision/scripts/eval/decompose_tab_errors.py` | CLI wrapper for error_decomposition.py | +| `tabvision/data/eval/composite.toml` | Composite-eval manifest (live; populated incrementally as datasets arrive) | +| `tabvision/data/fixtures/eval/guitarset_05_BN1-129-Eb_comp.jams` | Single-clip JAMS fixture for parser round-trip test | +| `tabvision/data/fixtures/eval/guitar_techs_sample.mid` | Single-clip 6-track MIDI fixture | +| `tabvision/tests/unit/test_parser_guitarset_jams.py` | JAMS parser round-trip test | +| `tabvision/tests/unit/test_parser_guitar_techs_midi.py` | MIDI parser round-trip test | +| `tabvision/tests/unit/test_bootstrap_ci.py` | CI helper correctness on known distributions | +| `tabvision/tests/unit/test_error_decomposition.py` | 7-bucket assignment correctness on synthetic predicted/gold pairs | +| `tabvision/tests/integration/test_composite_eval_smoke.py` | End-to-end smoke: 5-clip manifest → tier numbers exist + CIs computed | +| `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` | First baseline report (output of Phase 0E) | +| `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` | First 7-bucket decomposition (output of Phase 0D) | + +### 1.2 Modified files + +| Path | Lines | Change | +|---|---|---| +| `tabvision/tabvision/eval/manifest.py` | the `REQUIRED_CLIP_FIELDS` block (currently ~lines 21-28) | Add `annotation_format` field so parser-dispatch can route by source | +| `tabvision/tabvision/eval/manifest.py` | `validate_manifest()` | Reject any clip whose `source` indicates synthetic origin (e.g. starts with `synthtab/` or `dadagp/`) from a non-train split. This is the R8 cross-contamination guard from the strategy doc. | +| `LICENSES.md` | datasets table | Add Guitar-TECHS (CC-BY-4.0), EGDB (pending), free IR packs as they're acquired | +| `docs/DECISIONS.md` | append | D1–D11 from strategy doc §1 | +| `pyproject.toml` (in `tabvision/`) | `[project.optional-dependencies]` | Add `eval` extra with `pretty_midi`, `pyguitarpro`, `jams` (already used elsewhere — verify before adding) | + +### 1.3 NOT modified + +- `tabvision/tabvision/pipeline.py` — no behavior change in Phase 0. +- `tabvision/tabvision/fusion/**` — no fusion changes. +- `tabvision-server/modal_app.py`, `tabvision-server/app/v1_adapter.py` — no production changes. +- `tabvision-server/app/v1_adapter.py:91` `videoIgnoredByQualityGate` — flagged in strategy doc as a faked diagnostic, but the fix is Phase 6's job, not Phase 0's. + +## 2. Test plan + +Every test must be runnable via `pytest tabvision/tests/...` and skip +cleanly when an optional dependency is missing (PyGuitarPro, jams). +Fixtures go under `tabvision/data/fixtures/eval/`. + +### 2.1 Unit tests + +| Test name | Fixture | Assertion | +|---|---|---| +| `test_parser_guitarset_jams.py::test_jams_round_trip_pitch_string_fret` | `guitarset_05_BN1-129-Eb_comp.jams` (small, ~50 notes) | Every emitted `TabEvent` has `0 ≤ string_idx ≤ 5`, `0 ≤ fret ≤ 24`, monotonically non-decreasing `onset_s`. Total event count matches the JAMS namespace's note count. | +| `test_parser_guitarset_jams.py::test_jams_pitch_consistency` | same | For each emitted event, MIDI pitch implied by `(string_idx, fret)` matches the JAMS-reported pitch. | +| `test_parser_guitar_techs_midi.py::test_midi_round_trip_per_string` | `guitar_techs_sample.mid` (6 tracks, 1 per string) | Track index → `string_idx` mapping correct: track 0 → low E (`string_idx=0`), track 5 → high E (`string_idx=5`). | +| `test_parser_guitar_techs_midi.py::test_midi_pitch_to_fret` | same | Per-string MIDI pitch → fret derivation matches expected standard-tuning offsets: E2=40 → fret 0 string 0, A2=45 → fret 5 string 0, etc. | +| `test_bootstrap_ci.py::test_ci_known_normal` | synthetic Gaussian N(0.85, 0.05), n=100 | Returned 95% CI brackets the true mean ≥ 95% of the time over 1000 trials (calibration check). | +| `test_bootstrap_ci.py::test_ci_handles_small_samples` | n=5 | No exception; CI width sane (≥ standard error). | +| `test_bootstrap_ci.py::test_ci_deterministic_with_seed` | any | Same seed → same CI. | +| `test_error_decomposition.py::test_seven_buckets_assigned` | synthetic gold + predicted `TabEvent` lists, one per bucket | Each ground-truth event lands in the expected bucket: `correct`, `wrong_position_same_pitch`, `pitch_off`, `timing_only`, `missed_onset`, `muted_undetectable`, `extra_detection`. | +| `test_error_decomposition.py::test_share_of_loss_sums_to_one` | mixed gold + predicted | Per-bucket share-of-loss percentages sum to 100% (excluding the `correct` bucket). | + +### 2.2 Integration tests + +| Test name | Setup | Assertion | +|---|---|---| +| `test_composite_eval_smoke.py::test_five_clip_manifest` | A 5-clip composite manifest using checked-in fixtures (3 GuitarSet, 2 Guitar-TECHS) | `run_composite_eval(manifest)` returns a `CompositeReport` whose tiers include both `clean_acoustic_single_line` and `clean_acoustic_strummed`. Each tier has a non-null `tab_f1_mean` and `tab_f1_ci_95`. | +| `test_composite_eval_smoke.py::test_synthetic_clip_rejected_from_eval` | A manifest with one clip whose `source = "synthtab/test"` and `split = "test"` | `validate_manifest()` raises with a message mentioning the cross-contamination guard. | +| `test_composite_eval_smoke.py::test_egdb_skipped_when_pyguitarpro_missing` | Manifest with an EGDB clip but PyGuitarPro not installed | Run completes successfully; the EGDB clip is reported as `skipped` with reason `parser_dependency_missing`. Other clips still evaluated. | + +### 2.3 What's NOT tested in Phase 0 + +- The actual D2 acceptance numbers — those are the *output* of running + the harness, not a unit-test assertion. The CI gate is what's tested; + whether the system *hits* 0.85/0.90/0.87/0.80 is a question Phases + 1-8 answer. +- Bootstrap confidence on real production data — covered by the + smoke test on fixtures; running on production data is a one-shot + command, not a CI test. + +## 3. Commands + +All commands run from repo root, in the WSL Ubuntu shell, with the +`tabvision` venv active (`source tabvision/venv/bin/activate` or +`pip install -e tabvision[dev,eval]`). + +### 3.1 One-time setup + +```bash +# Install eval extras (PyGuitarPro, pretty_midi, jams) +cd tabvision && pip install -e '.[dev,eval]' && cd - + +# Verify tests pass on the base +pytest tabvision/tests/unit/test_parser_guitarset_jams.py -v +pytest tabvision/tests/unit/test_bootstrap_ci.py -v +``` + +### 3.2 Acquire Guitar-TECHS + +```bash +# Guitar-TECHS is CC-BY-4.0, hosted on Zenodo (see strategy doc §4.1) +mkdir -p ~/mir_datasets/guitar_techs +# Download the dataset archive from the URL in arXiv:2501.03720 +# (resolved at acquisition time; not committed to repo) +# Extract into ~/mir_datasets/guitar_techs/ +ls ~/mir_datasets/guitar_techs/ +``` + +### 3.3 Build the manifest + +```bash +# Generate composite.toml from on-disk datasets +python tabvision/scripts/eval/build_composite_manifest.py \ + --guitarset ~/mir_datasets/guitarset \ + --guitar-techs ~/mir_datasets/guitar_techs \ + --output tabvision/data/eval/composite.toml + +# Validate it +python -c "from tabvision.eval.manifest import validate_manifest; print(validate_manifest('tabvision/data/eval/composite.toml'))" +``` + +### 3.4 Run the baseline composite eval + +```bash +python tabvision/scripts/eval/composite_eval.py \ + --manifest tabvision/data/eval/composite.toml \ + --backend highres \ + --position-prior guitarset-v1 \ + --bootstrap-n 10000 \ + --bootstrap-seed 42 \ + --output docs/EVAL_REPORTS/composite_baseline_2026-05-13.md +``` + +### 3.5 Run the error decomposition + +```bash +python tabvision/scripts/eval/decompose_tab_errors.py \ + --manifest tabvision/data/eval/composite.toml \ + --backend highres \ + --position-prior guitarset-v1 \ + --output docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md +``` + +### 3.6 Verify free-tier compute accounts + +```bash +# W&B: confirm login + a tiny no-op run +wandb login +python -c "import wandb; r = wandb.init(project='tabvision-phase0', mode='online'); r.log({'hello': 1}); r.finish()" + +# Lightning Studios: open a Studio in the browser, run `nvidia-smi`, screenshot for the DECISIONS.md log + +# Kaggle: open a notebook in the browser, run `!nvidia-smi` + +# Colab: same + +# Modal: skip — used only as last resort per D6 +``` + +### 3.7 Send the EGDB email + +User action — not a command. Template in strategy doc; log the +date sent and the reply (when it arrives) in `docs/DECISIONS.md`. + +## 4. Acceptance outputs + +These are the artifacts whose existence + content gates Phase 1. + +### 4.1 `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` + +Must contain: + +- A per-tier table: + - Tier name + - Clip count (≥ 20 for any tier claimed against D2) + - Mean Tab F1 + - **95% bootstrap CI lower bound** + - Mean Onset F1 + - Mean Pitch F1 +- Per-source breakdown within each tier (GuitarSet / Guitar-TECHS / + EGDB) so we can see whether a tier number is dominated by one + source. +- A "Status vs D2 target" column with one of: **pass** (CI lower ≥ + target), **gap** (mean ≥ target but CI lower below), **fail** (mean + below target). +- Methodology footer: bootstrap N, seed, parser versions, backend + + prior versions, eval-harness commit SHA. + +### 4.2 `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` + +Must contain: + +- Aggregate 7-bucket table (counts + share-of-loss). +- Per-tier 7-bucket table. +- A "biggest lever per tier" callout: which bucket dominates each + tier's loss. Phase 1+ priorities derive from this. + +### 4.3 `tabvision/data/eval/composite.toml` + +Must satisfy `validate_manifest()` and contain: + +- ≥ 20 clips for each of: `clean_acoustic_single_line`, + `clean_acoustic_strummed`. (Guitar-TECHS additions may bring + `clean_electric` to ≥ 20 in Phase 0E; if not, that tier waits for + EGDB.) +- `clean_electric` and `distorted_electric` populated as much as + Guitar-TECHS + EGDB-license-resolved allow. +- No `source = synthtab/...` or `source = dadagp/...` rows in `split = + validation` or `split = test`. + +### 4.4 `docs/DECISIONS.md` entries + +D1–D11 from strategy doc §1, dated 2026-05-13. EGDB email send-date +and reply (when it arrives) as a separate entry. + +### 4.5 CI verification + +`pytest tabvision/tests/unit tabvision/tests/integration -v` passes +on `main` HEAD plus this Phase 0 branch. + +## 5. Decision tree + +What to do after Phase 0E baseline is in: + +- **All four tiers' CI lower bound clears D2** — surprising; sanity + check the eval harness, then declare v1 acceptance and skip to + Phase 9. This is unlikely given the 2026-05-08 0.61 aggregate. +- **Strummed CI lower bound clears D2, other tiers gap or fail** — + expected case. Proceed to Phase 1 (pitch ceiling lift). The + error-decomposition report tells us whether Phase 2 (fine-tune) or + Phase 3 (style priors) is the next priority after Phase 1. +- **All tiers fail** — Phase 0 implementation has a bug, or the + highres backend regressed on the broader corpus. Inspect 3-5 + worst-case clips by hand before any further compute spend. +- **`distorted_electric` has < 20 clips** — EGDB license is the + blocker. Set the tier aside; document the gap in the report; do not + publish D2 acceptance until the EGDB row clears. + +## 6. Time + compute budget + +| Item | Effort | Compute | +|---|---|---| +| Parser implementations + tests (1.1) | 1.5 days | none | +| Manifest extensions + validator hardening (1.2) | 0.5 day | none | +| Composite + bootstrap + error-decomposition modules (1.1) | 1 day | none | +| Guitar-TECHS acquisition + manifest population | 0.5 day | none | +| Baseline + decomposition runs (3.4 + 3.5) | 4-8 wall-clock hours | local CPU | +| Free-tier compute account verification | 0.5 day | none | +| EGDB email + DECISIONS.md updates | 15 minutes | none | +| Report writing | 0.5 day | none | +| **Total** | **4-5 days engineering** | **~$0** | + +## 7. Out of scope for Phase 0 + +- Any production-pipeline change. No edits to `pipeline.py`, `fusion/`, + `audio/`, `video/`, `tabvision-server/`. +- Fine-tuning, training, or model weight changes. +- Anything depending on the EGDB license reply (defer to Phase 8 or + later). +- Style-conditional priors (Phase 3). +- Video pipeline experiments (Phase 6). +- Synthetic-data generation (research/dev only; not part of Phase 0). + +## 8. Done definition + +Phase 0 is **done** when: + +- All items in §1.1 and §1.2 exist on the impl branch. +- All tests in §2.1 and §2.2 pass green. +- `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` exists and meets + §4.1. +- `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` exists + and meets §4.2. +- `tabvision/data/eval/composite.toml` exists and validates. +- `docs/DECISIONS.md` includes D1–D11. +- EGDB email send-date recorded. +- Free-tier compute accounts verified (W&B at minimum; Lightning / + Kaggle / Colab logged in `docs/DECISIONS.md`). + +Then — and only then — the Phase 1 implementation plan gets written.