diff --git a/CLAUDE.md b/CLAUDE.md index 71537df..65dc78c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -18,9 +18,19 @@ parallel under `refactor/v1`. - `LICENSES.md` — dependency license map; ⚠️ items gate respective phase entry. - `docs/DECISIONS.md` — non-obvious branches taken (per SPEC §0.5). -**Active branch:** `refactor/v1` (cut off `feature/audio-finetune-phase1`, -not `main` — see `docs/DECISIONS.md`). `main` is 33 commits behind v0. -Phase 0 in progress; sign-off pending on AUDIT + LICENSES. +**Active branch (2026-05-13):** `main`. The Modal production deploy +(`936a5cc`) and v1 CI hardening landed on `main`; `refactor/v1` is now +**23 commits behind `main`** and should be treated as historical. Cut new +work branches off `main`. Older design docs (and earlier paragraphs in +this file) may reference paths that exist on `main` but not on +`refactor/v1` — verify with `git cat-file -e origin/main:` before +relying on them. The full pipeline (`tabvision/tabvision/pipeline.py`), +the Modal production adapter (`tabvision-server/modal_app.py`, +`tabvision-server/app/v1_adapter.py`), and the highres audio backend all +live on `main`. Phase 5 fusion has shipped. See +`docs/2026-05-12-session-handoff.md` for the production state and +`docs/plans/2026-05-12-tab-f1-to-spec-design.md` (+ companion Phase 0 +implementation plan) for current accuracy work. ## Layout diff --git a/SPEC.md b/SPEC.md index 3fe8f5f..e666752 100644 --- a/SPEC.md +++ b/SPEC.md @@ -121,6 +121,41 @@ The targets above are aggregate over the full eval set. Per-difficulty-tier expe If the aggregate hits 0.88 but distorted electric scores below 0.75, treat that as a partial pass and prioritize Phase 7 distortion-augmented fine-tuning before final acceptance. +### 1.4.1 v1 acceptance amendment — per-tier targets (2026-05-13) + +Per the 2026-05-13 design plan +(`docs/plans/2026-05-12-tab-f1-to-spec-design.md`), v1 acceptance moves +from the aggregate 0.88 Tab F1 in §1.4 to **per-tier targets on a +public-corpus composite eval set**: + +| Tier | §1.4 stretch reference | v1 acceptance | +|---|---:|---:| +| Clean acoustic single-line | 0.94 | **0.85** | +| Clean acoustic strummed | 0.86 | **0.90** | +| Clean electric | 0.90 | **0.87** | +| Distorted electric | 0.82 | **0.80** | + +Rationale: 2026-05-08 GuitarSet validation showed aggregate Tab F1 = 0.61 +with comp tracks at 0.67 and solo tracks at 0.51 despite both being near +0.92 Pitch F1. The aggregate hid the structural failure mode (single-line +string/fret assignment). Per-tier targets force the conversation onto the +right axis and let work be sequenced (strummed first, distorted electric +last). + +**Test-set composition amendment:** the "user's own playing" test set in +§1.4 paragraph 1 is replaced by a public-corpus composite (GuitarSet +held-out + Guitar-TECHS + EGDB pending license + qualifying synthetic +training/dev material). See the design plan §5 for composite policy +(per-tier minimums, splits, leakage rules, bootstrap CIs). + +**Stretch / portfolio reference:** the original §1.4 per-tier table +(0.94 / 0.86 / 0.90 / 0.82) remains the v1.1 / portfolio stretch bar. +Hitting it is welcome; v1 acceptance requires only the amended table. + +**Aggregate Tab F1** is retired as an acceptance metric. **Onset F1 +(≥ 0.92), Pitch F1 (≥ 0.90), chord-instance accuracy (≥ 0.85), and +latency (≤ 5 min)** from §1.4 are unchanged. + ### 1.5 Hard constraints - All training/inference dependencies must be free or have a free tier sufficient for this project (see §6). diff --git a/docs/plans/2026-05-12-tab-f1-to-spec-design.md b/docs/plans/2026-05-12-tab-f1-to-spec-design.md new file mode 100644 index 0000000..ff1569b --- /dev/null +++ b/docs/plans/2026-05-12-tab-f1-to-spec-design.md @@ -0,0 +1,293 @@ +# Tab F1 v1 acceptance — Strategy & Decision Record + +**Date:** 2026-05-12 (revised 2026-05-13 per PR #10 review) +**Author:** Patrick (brainstormed with Claude) +**Status:** v3 — strategy / decision-record only; **not** an implementation plan +**Scope note:** This is a **SPEC §1.4 amendment proposal** plus + strategy. Implementation detail lives in companion docs. +**Companions:** +- `SPEC.md` §1.4.1 (the amendment table; committed in the same change set) +- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` (Phase 0 impl) +- Later phase impl plans (write after Phase 0 evidence) +**Replaces:** v1 + v2 (2026-05-12 single-aggregate-target drafts; both + had load-bearing license errors and stale path references + and have been superseded by this rewrite). + +## 0. License gate (must clear before any compute spend) + +Per SPEC §1.5 the **shipping default pipeline** must be portfolio-clean. +NC-licensed material is acceptable in research/experiment configurations +that are NOT shipped. Each resource is verified 2026-05-13: + +| Resource | License | Portfolio-default usable? | Source / verification | +|---|---|---|---| +| GuitarSet | CC-BY-4.0 | **yes** | https://zenodo.org/records/3371780 | +| Guitar-TECHS | CC-BY-4.0 | **yes** | arXiv:2501.03720 §4 distribution | +| EGDB | none on repo — **author email pending** | **gated** | https://ss12f32v.github.io/Guitar-Transcription/ (LICENSES.md ⚠️) | +| GOAT | request-only, research-only | **no — DROPPED** | arXiv:2509.22655 §4.2 *"made available by request to better control its use for research purposes only"* | +| SynthTab dataset | **CC-BY-NC-4.0** | **no — DROPPED** | github.com/yongyizang/SynthTab README *"SynthTab is released with CC BY-NC 4.0 license"* | +| SynthTab rendering code | CC-BY-4.0 | n/a (we're not redistributing the code) | repo `LICENSE` file | +| DadaGP | access-by-email research-only; underlying GP tabs derive from copyrighted songs | **research/dev only** — NOT in default path | github.com/dada-bots/dadaGP README; underlying tab copyright unsettled | +| Basic Pitch | Apache-2.0 | yes (Phase 1 pitch ensemble) | github.com/spotify/basic-pitch | +| highres (xavriley) | MIT | yes — current production audio backend | github.com/xavriley/hf_midi_transcription | +| MediaPipe Hands | Apache-2.0 | yes — video pipeline | per LICENSES.md | +| YOLO-OBB (ultralytics) | AGPL-3.0 (accepted per DECISIONS.md) | yes (portfolio is AGPL-OK) | per LICENSES.md | +| Free amp/cab IRs | varies (most free-public) | yes for default if redistribution terms allow; verify per-pack | Modern Music Solutions Declassified, Djammincabs | + +**Drops vs v2 plan:** +- **SynthTab dropped** because the dataset is CC-BY-NC-4.0; pretraining + the shipping audio backend on it taints derived weights (SynthTab paper + treats trained models as derivative work). Distillation as a laundering + step is rejected — both legally murky and explicitly out of bounds + per the 2026-05-13 review. +- **GOAT dropped** because it's request-only research-only. Cannot + evaluate a public portfolio against it. + +**Hard rule:** any phase that depends on a "gated" or "no" row must +produce evidence that the gate cleared (e.g., a written reply from the +EGDB author) BEFORE that phase ships. No conditional commits, no +"we'll-figure-it-out-later" merges. + +## 1. Decisions + +These supersede the v2 D1–D10 set. Append to `docs/DECISIONS.md` per +SPEC §0.5 once the plan is approved. + +| # | Decision | Rationale | +|---|---|---| +| D1 | Tab F1 evaluated **per tier**, not as a single aggregate. SPEC §1.4 aggregate 0.88 is retired. | Aggregate hides the real failure mode (string/fret assignment on solo lines). | +| D2 | Per-tier v1 acceptance targets: **0.85 / 0.90 / 0.87 / 0.80** for clean acoustic single-line / strummed / clean electric / distorted electric. | User-stated floor (0.80) and strummed (≥ 0.90); middle tiers proposed and accepted. Original SPEC numbers (0.94 / 0.86 / 0.90 / 0.82) become the v1.1 / portfolio stretch reference. | +| D3 | Eval set is a **multi-source public-corpus composite**: GuitarSet + Guitar-TECHS + EGDB (license-pending) + qualifying synthetic. Personal videos banned. GOAT dropped. SynthTab dropped from default path. | Per-tier evaluation requires per-tier sources; portfolio constraint excludes NC and request-only data from the shipping path. | +| D4 | **No SynthTab in the default pipeline.** Audio-side lift comes from priors + cheap pitch post-processing + GuitarSet fine-tune. DadaGP-derived synthetic remains acceptable for **internal training/dev only** if it's never shipped. | SynthTab CC-BY-NC-4.0 taints derived weights; SPEC §1.5 bars NC from default. | +| D5 | **No quantitative video-gate.** Video pipeline ships as a qualitative feature; per-tier Tab F1 measured audio-only. | No public dataset has video + per-note string/fret labels (verified 2026-05-12). | +| D6 | **Free-tier compute first.** Order per CLAUDE.md operating rule 6 and SPEC §6.3: **Local CPU > Colab > Kaggle > Lightning Studios > Modal**. Modal is the last resort. | Project rule, plus Lightning's 22 GPU-hr/month free tier covers any fine-tune we'd plausibly run. | +| D7 | **1-2 month cadence.** No fixed deadline. | User-stated. | +| D8 | Stretch goals (bends / slides / hammer-ons / pull-offs) **out of scope** for v1. | SPEC §1.4 already marks them v1.1. | +| D9 | Top-K acceptable as an editor UX feature; the D2 numbers are on **top-1 only**. | User-stated. | +| D10 | Personal training clips off the table entirely — not as accuracy gate, not as dev set, not as label source. | User-stated. | +| D11 | This document is a **SPEC §1.4 amendment**, not a SPEC-achievement plan. Land the SPEC.md update (§1.4.1) in the same change set. | Honest framing of relaxed targets; reviewer's approval bar. | + +## 2. Goal & framing + +**v1 acceptance:** hit the D2 per-tier Tab F1 targets on the D3 +public-corpus composite eval set within 1-2 months on free-tier +compute, with the existing v1 pipeline (no §8 contract changes). + +**Stretch / portfolio reference:** the original SPEC §1.4 numbers +(0.94 / 0.86 / 0.90 / 0.82). If we hit them, that's the portfolio +narrative; v1 acceptance does not require them. + +**Out of v1 acceptance:** quantitative video-fusion Tab F1 +improvement claim (no public dataset for it; tracked as qualitative +only). + +## 3. Current evidence + +GuitarSet validation, 60 tracks, 8715 gold notes, 2026-05-08 +production candidate (highres + `guitarset-v1` prior, audio-only): + +| Metric | Current | Status | +|---|---:|---| +| Onset F1 (50 ms) | 0.9218 | passes SPEC §1.4 ≥ 0.92 | +| Pitch F1 (50 ms) | 0.9022 | passes SPEC §1.4 ≥ 0.90 | +| Tab F1 aggregate (retired) | 0.6104 | — | +| Tab F1, comp subset | 0.670 mean | — | +| Tab F1, solo subset | 0.508 mean | — | + +The 27 pp gap to the **retired** 0.88 aggregate target is almost +entirely string/fret assignment on single-line passages. Audio is at +spec; only fusion-side assignment is short. This frames the per-tier +work: **strummed (chord context) is closest to its target; single-line +needs the most lift.** + +**Coverage gap:** GuitarSet covers only the clean acoustic tiers. +Clean-electric and distorted-electric have **no current measurement** +on a public corpus and must be acquired in Phase 0. + +## 4. Resource inventory + +### 4.1 Datasets (default-pipeline path only) + +| Source | License | Modality | Labels | Tier coverage | +|---|---|---|---|---| +| GuitarSet (on-disk) | CC-BY-4.0 | audio (hex + DI) | JAMS (string + fret + pitch) | clean acoustic single-line, strummed | +| Guitar-TECHS (acquire) | CC-BY-4.0 | audio (multi-mic + DI) | 6-track per-string MIDI | clean acoustic single-line, clean electric | +| EGDB (acquire, license pending) | none on repo — author email required | audio (DI + 5 amp sims) | GuitarPro tabs + aligned MIDI | clean electric, distorted electric | +| Free IR-augmented GuitarSet | CC-BY-4.0 (with IR pack licenses verified) | derived audio | inherited string + fret | distorted electric (fallback if EGDB blocks) | + +### 4.2 Datasets (research / dev only — NEVER in the default pipeline) + +| Source | License | Use | +|---|---|---| +| DadaGP | access-by-email, research-only | possible internal-training augmentation; not shipped, not redistributed | +| SynthTab | CC-BY-NC-4.0 | reference only; not a substrate for any shipped weight | + +### 4.3 Compute accounts (free-tier first, per D6 order) + +| Account | Free allowance | Use | +|---|---|---| +| Local CPU | 6 cores WSL2 | eval runs, prior training, cheap post-processing experiments | +| Colab | ~12 hr/day with limits | quick experiments, prior sweeps | +| Kaggle | ~30 GPU-hr/week T4 | longer sweeps, baseline checks | +| Lightning Studios | 22 GPU-hr/month | any fine-tune work, batched in one monthly window | +| W&B | unlimited (academic) | experiment tracking — required before any GPU job | +| Hugging Face Hub | unlimited public | weight / checkpoint hosting | +| Modal | pay-per-use | **production smoke retests only**; never default training | + +### 4.4 Code already on `main` + +- `tabvision.audio.*` — production pitch backends (highres, basicpitch). +- `tabvision.fusion.{viterbi,chord,playability,position_prior,neck_prior,melodic_prior}` — Phase 5 shipped. +- `tabvision.video.{guitar,fretboard,hand}` — Phase 4 shipped. +- `tabvision.pipeline.run_pipeline` — production-facing orchestrator. +- `tabvision.eval.{manifest,metrics,runner,guitarset_audio}` — eval scaffolding with `REQUIRED_TIERS = ("clean_acoustic_single_line", "clean_acoustic_strummed", "clean_electric", "distorted_electric")` already encoded ([tabvision/tabvision/eval/manifest.py](tabvision/tabvision/eval/manifest.py)). +- `tabvision-server/{modal_app.py, app/v1_adapter.py}` — Modal production adapter. + +### 4.5 What's been tried (lessons carried forward) + +| Attempt | Outcome | Lesson | +|---|---|---| +| Learned-fusion LightGBM ranker (2026-04-29) | +0.3 pp LOOCV vs +5 pp gate; **-27.8 pp** regression on training-17 | Catastrophic single-fold regression with small data. **Re-try only with strict per-fold regression guard AND with video features actually populated**, which the apr-29 run lacked. | +| Basic Pitch fine-tune (2026-04-30) | Superseded by highres backend swap | Fine-tune infra reusable; ceiling lift now lives in highres post-processing and possibly a GuitarSet-only highres fine-tune. | +| Melodic prior | Regresses aggregate by 1.15 pp | Helps solo, hurts comp. Needs solo-density gating. | +| Position prior `guitarset-v1` | +22 pp Tab F1 | Per-pitch tabular priors are the largest cheap intervention. Style/structure-conditional priors are the natural extension. | + +## 5. Composite eval policy + +Each tier in the composite eval set must satisfy these rules. The +manifest schema (`tabvision/tabvision/eval/manifest.py`) already +encodes tier names and required clip fields; the Phase 0 impl plan +extends it for source-specific annotation paths and CI reporting. + +**Per-tier minimums:** +- Each of the four required tiers: **≥ 20 clips** and **≥ 500 gold + notes**. Below this the bootstrap CI is too wide to claim acceptance. +- Total composite: ≥ 80 clips, ≥ 2,000 notes. + +**Split policy:** +- GuitarSet: held-out **by player** (player 05 = validation; this is + the existing convention from `guitarset_audio_eval.py`). No + train/test leak at player level. +- Guitar-TECHS: split by **performer** if performer metadata is + available; else by clip with a deterministic seed. +- EGDB: split by **source track** (the 240 clean DIs); amp-sim + renders of the same track go to the same split. Required to avoid + amp-render leakage. + +**Source weighting:** +- Per-tier metrics are reported **un-weighted across sources within a + tier** (every clip has equal weight). The strategic question "is + GuitarSet over-represented in clean acoustic" gets a separate + per-source breakdown in the report; the headline number is the + un-weighted clip mean. + +**Leakage rules:** +- No clip used for prior training (`guitarset-v1` etc.) appears in + evaluation. Currently `guitarset-v1` is trained on GuitarSet train + split, evaluated on player 05 — compliant. +- Fine-tune sets must be disjoint from eval sets by player / performer. +- DadaGP-derived synthetic, if used, is training-only and never + appears in the eval manifest. + +**Confidence intervals:** +- Every per-tier number reported with a **95% bootstrap CI** over + clips (resample clips with replacement, recompute the tier-mean, + 10 000 resamples). The acceptance test is `lower_95_CI ≥ target`, + not just `mean ≥ target` — this disciplines small-sample wishful + thinking. + +**Parsers:** +- One parser per source, named by the annotation format (not the + source). Phase 0 ships: `guitarset_jams`, `guitar_techs_midi`, + `egdb_gp`. Each parser converts source-native annotations into the + §8 `TabEvent` dataclass list. Round-trip parser tests required. + +## 6. Phase outline (high-level only) + +Each phase has a goal + acceptance bar here. **Per-phase implementation +plans** (exact files / tests / commands / acceptance outputs) are +written **separately**, one phase at a time, only after the prior +phase's evidence justifies starting it. + +- **Phase 0 — Foundation.** Per-tier baselines + error decomposition on + the composite eval. Acquire Guitar-TECHS; send EGDB email; verify free + compute accounts. **No production code changes.** Acceptance: per-tier + baseline numbers exist for ≥ 3 of 4 tiers with bootstrap CIs; + per-tier 7-bucket error breakdown exists. [Companion: + `2026-05-13-tab-f1-phase-0-implementation.md`.] +- **Phase 1 — Pitch ceiling lift (cheap moves).** Voicing/silence gate + + peak-picking + Basic Pitch pitch-only ensemble. Acceptance: Pitch + F1 ≥ 0.93 on GuitarSet validation, no Onset F1 regression > 1 pp. +- **Phase 2 — Highres fine-tune on GuitarSet only.** Lightning + free-tier; ~3 GPU-hr. **No SynthTab pretrain.** Acceptance: Pitch F1 + ≥ 0.94, no Onset regression > 1 pp; cross-dataset sanity ≥ 0.90 on + Guitar-TECHS held-out. +- **Phase 3 — Style/structure-conditional priors.** Leave-one-player-out + CV with hard regression guard. Acceptance: solo Tab F1 +2 pp vs + `guitarset-v1`, no per-bucket regression > 1 pp on comp, no fold + regression > 3 pp. +- **Phase 4 — UI-field audit (capo/tuning/instrument/tone/style).** + Unit tests confirm each field propagates into a pipeline decision. +- **Phase 5 — Learned fusion v2.** Re-attempt with proper features + (chord-context, prior-values, playability-cost, video-when-on). + Acceptance: +3 pp mean Tab F1, no per-fold regression > 3 pp, + margin-fallback to structured search baked in. +- **Phase 6 — Video pipeline qualitative integration.** Enable + `TABVISION_VIDEO_ENABLED=true` in dev with a runtime quality gate. + Acceptance: video on/off does not regress audio-only metrics by > 0.5 pp. +- **Phase 7 — Solo-gated melodic prior.** Acceptance: solo +3 pp, + comp ±1 pp. +- **Phase 8 — Tier shortfall recovery.** Only if a tier still misses + its D2 target. Per-tier tactics (chord-shape templates for strummed, + IR-augmentation for distorted, etc.). +- **Phase 9 — Final eval + DECISIONS.md update + SPEC.md PR.** + +Sequencing: 0 → 1 → 2 in series; 3–7 parallelizable after 2; 8 only +on shortfall; 9 closes. Total wall-clock estimate: **4-6 weeks +engineering** + ~1 week EGDB-email turnaround. + +## 7. Risks + +| # | Risk | Likelihood | Mitigation | +|---|---|---|---| +| R1 | EGDB license never resolves | medium | Phase 8 fallback: free-IR-augmented GuitarSet for distorted-electric tier; explicitly flagged as synthesized in reports. | +| R2 | Guitar-TECHS clips don't span all promised tiers (some clean-electric tracks may be missing) | low-medium | Phase 0 acceptance only requires ≥ 3 of 4 tiers; distorted-electric can wait on EGDB. | +| R3 | GuitarSet-only fine-tune (Phase 2) over-fits player 05's adjacent training distribution | medium | Cross-dataset sanity on Guitar-TECHS held-out; abort if Guitar-TECHS regresses > 5 pp. | +| R4 | Per-tier composite has too few clips for statistical significance | medium | D2 acceptance requires `lower_95_CI ≥ target`, not mean. Per-tier minimum 20 clips / 500 notes (§5). | +| R5 | Phase 5 learned fusion reproduces apr-29 single-fold catastrophe | medium | Strict per-fold regression guard + margin fallback. Decision tree pivots to Phase 7 if it triggers. | +| R6 | LICENSES.md updates required for Guitar-TECHS / EGDB / IR packs | certain | Update in Phase 0 alongside acquisition. | +| R7 | Free-tier monthly compute allowance exhausted before Phase 2 + 5 retries | low | Phase 2 ≈ 3 GPU-hr; Phase 5 is CPU. Combined < 10 hr/month, well inside Lightning's 22 hr cap. | +| R8 | Synthetic data (DadaGP) inadvertently ends up in shipped weights via training/eval pipeline cross-contamination | low | Synthetic clips never appear in `tabvision/data/eval/manifest.toml`; an explicit assert in Phase 0 manifest validator rejects any synthetic-source clip in the default eval set. | + +## 8. Out of scope + +- Personal training clips (D10). +- SynthTab in any shipped configuration (D4). +- GOAT (license). +- Aggregate Tab F1 ≥ 0.88 as an acceptance gate (D1). +- Stretch v1.1 (bends / slides / hammer-ons) per D8. +- Quantitative video-gate (D5). +- Top-K UI optimization — UI work is separate; D2 applies to top-1. +- §8 contract changes — no SPEC §8 signature edits in this plan. +- Modal as a default training surface (D6). + +## 9. Open questions (do not gate the plan) + +- EGDB author reply timing — assumed ~1 week. +- Whether Guitar-TECHS subdivides cleanly into "clean acoustic" vs + "clean electric" subsets at clip-level metadata, or whether we'll + need to inspect waveforms. +- Whether free IR pack licenses (Modern Music Solutions, Djammincabs) + permit redistribution of derived audio in evaluation reports. + Phase 8 fallback only. + +## 10. Companion docs in this PR + +- `SPEC.md` — §1.4.1 amendment block (per-tier targets + composite test set). +- `CLAUDE.md` — active-branch update (`main`, not `refactor/v1`). +- `docs/plans/2026-05-13-tab-f1-phase-0-implementation.md` — Phase 0 + implementation: exact files, tests, commands, acceptance outputs. + +Later phase implementation plans (`docs/plans/2026-05-NN-tab-f1-phase-N-implementation.md`) +will be written one phase at a time, only after the prior phase's +evidence is in. diff --git a/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md new file mode 100644 index 0000000..0a9cd5f --- /dev/null +++ b/docs/plans/2026-05-13-tab-f1-phase-0-implementation.md @@ -0,0 +1,305 @@ +# Tab F1 — Phase 0 Implementation Plan + +**Date:** 2026-05-13 +**Author:** Patrick (brainstormed with Claude) +**Status:** Proposed — pending sign-off +**Strategy doc:** `docs/plans/2026-05-12-tab-f1-to-spec-design.md` +**Implementation branch:** to be cut as `impl/tab-f1-phase-0` off `main` + after the strategy / SPEC amendment lands. + +## 0. Phase 0 goal recap + +Establish the per-tier baseline and error decomposition needed to +sequence Phases 1+. **No production code changes; no shipped behavior +changes; no compute spend on training.** + +Acceptance, copied from the strategy doc §6: + +- Per-tier baseline numbers for ≥ 3 of 4 D2 tiers with **bootstrap + 95% CIs**, on the composite eval set. +- Per-tier 7-bucket error decomposition on the same set. +- Free-tier compute accounts (Local / Colab / Kaggle / Lightning / W&B) + verified. +- EGDB author email sent; reply tracked in `docs/DECISIONS.md`. + +## 1. Files to add / modify + +### 1.1 New files + +| Path | Purpose | +|---|---| +| `tabvision/tabvision/eval/parsers/__init__.py` | Parser registry | +| `tabvision/tabvision/eval/parsers/guitarset_jams.py` | JAMS → `list[TabEvent]` | +| `tabvision/tabvision/eval/parsers/guitar_techs_midi.py` | 6-track MIDI → `list[TabEvent]` | +| `tabvision/tabvision/eval/parsers/egdb_gp.py` | GuitarPro tab + MIDI → `list[TabEvent]` (skipped at import-time if PyGuitarPro not installed; runs only when EGDB license clears) | +| `tabvision/tabvision/eval/composite.py` | `run_composite_eval(manifest_path) -> CompositeReport` — dispatches to per-source parsers and aggregates per-tier | +| `tabvision/tabvision/eval/bootstrap.py` | Bootstrap CI helper: `bootstrap_ci(values, statistic=mean, n=10_000, seed=int) -> tuple[float, float, float]` returning `(mean, lower_95, upper_95)` | +| `tabvision/tabvision/eval/error_decomposition.py` | Port of `tabvision-server/tools/error_analysis.py` (apr-28 7-bucket harness) targeting `list[TabEvent]` pairs | +| `tabvision/scripts/eval/composite_eval.py` | CLI wrapper: `tabvision-composite-eval --manifest data/eval/composite.toml --output docs/EVAL_REPORTS/composite_baseline_.md` | +| `tabvision/scripts/eval/decompose_tab_errors.py` | CLI wrapper for error_decomposition.py | +| `tabvision/data/eval/composite.toml` | Composite-eval manifest (live; populated incrementally as datasets arrive) | +| `tabvision/data/fixtures/eval/guitarset_05_BN1-129-Eb_comp.jams` | Single-clip JAMS fixture for parser round-trip test | +| `tabvision/data/fixtures/eval/guitar_techs_sample.mid` | Single-clip 6-track MIDI fixture | +| `tabvision/tests/unit/test_parser_guitarset_jams.py` | JAMS parser round-trip test | +| `tabvision/tests/unit/test_parser_guitar_techs_midi.py` | MIDI parser round-trip test | +| `tabvision/tests/unit/test_bootstrap_ci.py` | CI helper correctness on known distributions | +| `tabvision/tests/unit/test_error_decomposition.py` | 7-bucket assignment correctness on synthetic predicted/gold pairs | +| `tabvision/tests/integration/test_composite_eval_smoke.py` | End-to-end smoke: 5-clip manifest → tier numbers exist + CIs computed | +| `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` | First baseline report (output of Phase 0E) | +| `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` | First 7-bucket decomposition (output of Phase 0D) | + +### 1.2 Modified files + +| Path | Lines | Change | +|---|---|---| +| `tabvision/tabvision/eval/manifest.py` | the `REQUIRED_CLIP_FIELDS` block (currently ~lines 21-28) | Add `annotation_format` field so parser-dispatch can route by source | +| `tabvision/tabvision/eval/manifest.py` | `validate_manifest()` | Reject any clip whose `source` indicates synthetic origin (e.g. starts with `synthtab/` or `dadagp/`) from a non-train split. This is the R8 cross-contamination guard from the strategy doc. | +| `LICENSES.md` | datasets table | Add Guitar-TECHS (CC-BY-4.0), EGDB (pending), free IR packs as they're acquired | +| `docs/DECISIONS.md` | append | D1–D11 from strategy doc §1 | +| `pyproject.toml` (in `tabvision/`) | `[project.optional-dependencies]` | Add `eval` extra with `pretty_midi`, `pyguitarpro`, `jams` (already used elsewhere — verify before adding) | + +### 1.3 NOT modified + +- `tabvision/tabvision/pipeline.py` — no behavior change in Phase 0. +- `tabvision/tabvision/fusion/**` — no fusion changes. +- `tabvision-server/modal_app.py`, `tabvision-server/app/v1_adapter.py` — no production changes. +- `tabvision-server/app/v1_adapter.py:91` `videoIgnoredByQualityGate` — flagged in strategy doc as a faked diagnostic, but the fix is Phase 6's job, not Phase 0's. + +## 2. Test plan + +Every test must be runnable via `pytest tabvision/tests/...` and skip +cleanly when an optional dependency is missing (PyGuitarPro, jams). +Fixtures go under `tabvision/data/fixtures/eval/`. + +### 2.1 Unit tests + +| Test name | Fixture | Assertion | +|---|---|---| +| `test_parser_guitarset_jams.py::test_jams_round_trip_pitch_string_fret` | `guitarset_05_BN1-129-Eb_comp.jams` (small, ~50 notes) | Every emitted `TabEvent` has `0 ≤ string_idx ≤ 5`, `0 ≤ fret ≤ 24`, monotonically non-decreasing `onset_s`. Total event count matches the JAMS namespace's note count. | +| `test_parser_guitarset_jams.py::test_jams_pitch_consistency` | same | For each emitted event, MIDI pitch implied by `(string_idx, fret)` matches the JAMS-reported pitch. | +| `test_parser_guitar_techs_midi.py::test_midi_round_trip_per_string` | `guitar_techs_sample.mid` (6 tracks, 1 per string) | Track index → `string_idx` mapping correct: track 0 → low E (`string_idx=0`), track 5 → high E (`string_idx=5`). | +| `test_parser_guitar_techs_midi.py::test_midi_pitch_to_fret` | same | Per-string MIDI pitch → fret derivation matches expected standard-tuning offsets: E2=40 → fret 0 string 0, A2=45 → fret 5 string 0, etc. | +| `test_bootstrap_ci.py::test_ci_known_normal` | synthetic Gaussian N(0.85, 0.05), n=100 | Returned 95% CI brackets the true mean ≥ 95% of the time over 1000 trials (calibration check). | +| `test_bootstrap_ci.py::test_ci_handles_small_samples` | n=5 | No exception; CI width sane (≥ standard error). | +| `test_bootstrap_ci.py::test_ci_deterministic_with_seed` | any | Same seed → same CI. | +| `test_error_decomposition.py::test_seven_buckets_assigned` | synthetic gold + predicted `TabEvent` lists, one per bucket | Each ground-truth event lands in the expected bucket: `correct`, `wrong_position_same_pitch`, `pitch_off`, `timing_only`, `missed_onset`, `muted_undetectable`, `extra_detection`. | +| `test_error_decomposition.py::test_share_of_loss_sums_to_one` | mixed gold + predicted | Per-bucket share-of-loss percentages sum to 100% (excluding the `correct` bucket). | + +### 2.2 Integration tests + +| Test name | Setup | Assertion | +|---|---|---| +| `test_composite_eval_smoke.py::test_five_clip_manifest` | A 5-clip composite manifest using checked-in fixtures (3 GuitarSet, 2 Guitar-TECHS) | `run_composite_eval(manifest)` returns a `CompositeReport` whose tiers include both `clean_acoustic_single_line` and `clean_acoustic_strummed`. Each tier has a non-null `tab_f1_mean` and `tab_f1_ci_95`. | +| `test_composite_eval_smoke.py::test_synthetic_clip_rejected_from_eval` | A manifest with one clip whose `source = "synthtab/test"` and `split = "test"` | `validate_manifest()` raises with a message mentioning the cross-contamination guard. | +| `test_composite_eval_smoke.py::test_egdb_skipped_when_pyguitarpro_missing` | Manifest with an EGDB clip but PyGuitarPro not installed | Run completes successfully; the EGDB clip is reported as `skipped` with reason `parser_dependency_missing`. Other clips still evaluated. | + +### 2.3 What's NOT tested in Phase 0 + +- The actual D2 acceptance numbers — those are the *output* of running + the harness, not a unit-test assertion. The CI gate is what's tested; + whether the system *hits* 0.85/0.90/0.87/0.80 is a question Phases + 1-8 answer. +- Bootstrap confidence on real production data — covered by the + smoke test on fixtures; running on production data is a one-shot + command, not a CI test. + +## 3. Commands + +All commands run from repo root, in the WSL Ubuntu shell, with the +`tabvision` venv active (`source tabvision/venv/bin/activate` or +`pip install -e tabvision[dev,eval]`). + +### 3.1 One-time setup + +```bash +# Install eval extras (PyGuitarPro, pretty_midi, jams) +cd tabvision && pip install -e '.[dev,eval]' && cd - + +# Verify tests pass on the base +pytest tabvision/tests/unit/test_parser_guitarset_jams.py -v +pytest tabvision/tests/unit/test_bootstrap_ci.py -v +``` + +### 3.2 Acquire Guitar-TECHS + +```bash +# Guitar-TECHS is CC-BY-4.0, hosted on Zenodo (see strategy doc §4.1) +mkdir -p ~/mir_datasets/guitar_techs +# Download the dataset archive from the URL in arXiv:2501.03720 +# (resolved at acquisition time; not committed to repo) +# Extract into ~/mir_datasets/guitar_techs/ +ls ~/mir_datasets/guitar_techs/ +``` + +### 3.3 Build the manifest + +```bash +# Generate composite.toml from on-disk datasets +python tabvision/scripts/eval/build_composite_manifest.py \ + --guitarset ~/mir_datasets/guitarset \ + --guitar-techs ~/mir_datasets/guitar_techs \ + --output tabvision/data/eval/composite.toml + +# Validate it +python -c "from tabvision.eval.manifest import validate_manifest; print(validate_manifest('tabvision/data/eval/composite.toml'))" +``` + +### 3.4 Run the baseline composite eval + +```bash +python tabvision/scripts/eval/composite_eval.py \ + --manifest tabvision/data/eval/composite.toml \ + --backend highres \ + --position-prior guitarset-v1 \ + --bootstrap-n 10000 \ + --bootstrap-seed 42 \ + --output docs/EVAL_REPORTS/composite_baseline_2026-05-13.md +``` + +### 3.5 Run the error decomposition + +```bash +python tabvision/scripts/eval/decompose_tab_errors.py \ + --manifest tabvision/data/eval/composite.toml \ + --backend highres \ + --position-prior guitarset-v1 \ + --output docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md +``` + +### 3.6 Verify free-tier compute accounts + +```bash +# W&B: confirm login + a tiny no-op run +wandb login +python -c "import wandb; r = wandb.init(project='tabvision-phase0', mode='online'); r.log({'hello': 1}); r.finish()" + +# Lightning Studios: open a Studio in the browser, run `nvidia-smi`, screenshot for the DECISIONS.md log + +# Kaggle: open a notebook in the browser, run `!nvidia-smi` + +# Colab: same + +# Modal: skip — used only as last resort per D6 +``` + +### 3.7 Send the EGDB email + +User action — not a command. Template in strategy doc; log the +date sent and the reply (when it arrives) in `docs/DECISIONS.md`. + +## 4. Acceptance outputs + +These are the artifacts whose existence + content gates Phase 1. + +### 4.1 `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` + +Must contain: + +- A per-tier table: + - Tier name + - Clip count (≥ 20 for any tier claimed against D2) + - Mean Tab F1 + - **95% bootstrap CI lower bound** + - Mean Onset F1 + - Mean Pitch F1 +- Per-source breakdown within each tier (GuitarSet / Guitar-TECHS / + EGDB) so we can see whether a tier number is dominated by one + source. +- A "Status vs D2 target" column with one of: **pass** (CI lower ≥ + target), **gap** (mean ≥ target but CI lower below), **fail** (mean + below target). +- Methodology footer: bootstrap N, seed, parser versions, backend + + prior versions, eval-harness commit SHA. + +### 4.2 `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` + +Must contain: + +- Aggregate 7-bucket table (counts + share-of-loss). +- Per-tier 7-bucket table. +- A "biggest lever per tier" callout: which bucket dominates each + tier's loss. Phase 1+ priorities derive from this. + +### 4.3 `tabvision/data/eval/composite.toml` + +Must satisfy `validate_manifest()` and contain: + +- ≥ 20 clips for each of: `clean_acoustic_single_line`, + `clean_acoustic_strummed`. (Guitar-TECHS additions may bring + `clean_electric` to ≥ 20 in Phase 0E; if not, that tier waits for + EGDB.) +- `clean_electric` and `distorted_electric` populated as much as + Guitar-TECHS + EGDB-license-resolved allow. +- No `source = synthtab/...` or `source = dadagp/...` rows in `split = + validation` or `split = test`. + +### 4.4 `docs/DECISIONS.md` entries + +D1–D11 from strategy doc §1, dated 2026-05-13. EGDB email send-date +and reply (when it arrives) as a separate entry. + +### 4.5 CI verification + +`pytest tabvision/tests/unit tabvision/tests/integration -v` passes +on `main` HEAD plus this Phase 0 branch. + +## 5. Decision tree + +What to do after Phase 0E baseline is in: + +- **All four tiers' CI lower bound clears D2** — surprising; sanity + check the eval harness, then declare v1 acceptance and skip to + Phase 9. This is unlikely given the 2026-05-08 0.61 aggregate. +- **Strummed CI lower bound clears D2, other tiers gap or fail** — + expected case. Proceed to Phase 1 (pitch ceiling lift). The + error-decomposition report tells us whether Phase 2 (fine-tune) or + Phase 3 (style priors) is the next priority after Phase 1. +- **All tiers fail** — Phase 0 implementation has a bug, or the + highres backend regressed on the broader corpus. Inspect 3-5 + worst-case clips by hand before any further compute spend. +- **`distorted_electric` has < 20 clips** — EGDB license is the + blocker. Set the tier aside; document the gap in the report; do not + publish D2 acceptance until the EGDB row clears. + +## 6. Time + compute budget + +| Item | Effort | Compute | +|---|---|---| +| Parser implementations + tests (1.1) | 1.5 days | none | +| Manifest extensions + validator hardening (1.2) | 0.5 day | none | +| Composite + bootstrap + error-decomposition modules (1.1) | 1 day | none | +| Guitar-TECHS acquisition + manifest population | 0.5 day | none | +| Baseline + decomposition runs (3.4 + 3.5) | 4-8 wall-clock hours | local CPU | +| Free-tier compute account verification | 0.5 day | none | +| EGDB email + DECISIONS.md updates | 15 minutes | none | +| Report writing | 0.5 day | none | +| **Total** | **4-5 days engineering** | **~$0** | + +## 7. Out of scope for Phase 0 + +- Any production-pipeline change. No edits to `pipeline.py`, `fusion/`, + `audio/`, `video/`, `tabvision-server/`. +- Fine-tuning, training, or model weight changes. +- Anything depending on the EGDB license reply (defer to Phase 8 or + later). +- Style-conditional priors (Phase 3). +- Video pipeline experiments (Phase 6). +- Synthetic-data generation (research/dev only; not part of Phase 0). + +## 8. Done definition + +Phase 0 is **done** when: + +- All items in §1.1 and §1.2 exist on the impl branch. +- All tests in §2.1 and §2.2 pass green. +- `docs/EVAL_REPORTS/composite_baseline_2026-05-13.md` exists and meets + §4.1. +- `docs/EVAL_REPORTS/tab_f1_error_decomposition_2026-05-13.md` exists + and meets §4.2. +- `tabvision/data/eval/composite.toml` exists and validates. +- `docs/DECISIONS.md` includes D1–D11. +- EGDB email send-date recorded. +- Free-tier compute accounts verified (W&B at minimum; Lightning / + Kaggle / Colab logged in `docs/DECISIONS.md`). + +Then — and only then — the Phase 1 implementation plan gets written.