Land v1 acoustic: composite eval, acoustic scope + honest targets, fusion continuity#13
Merged
Conversation
First Phase 0 chunk per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md §1.1. Foundations for the composite-eval workflow; no production behavior changes. - tabvision.eval.parsers.registry: ParserFn protocol + register_parser / get_parser / list_parsers. Each source-specific annotation format gets a parser that registers itself at import time; composite-eval dispatches by Manifest.clip.annotation_format. - tabvision.eval.parsers.guitarset_jams: thin wrapper exposing the existing tabvision.eval.guitarset_audio.parse_guitarset_jams under the new uniform interface. No logic duplication. - tabvision.eval.bootstrap: bootstrap_ci() returning a BootstrapResult (statistic, lower, upper, n_observations, n_bootstrap, confidence). Implements the per-tier acceptance gate from the strategy doc §5 (lower_95_CI >= target, not just mean >= target). - 21 unit tests, all passing. Existing test_guitarset_audio_eval.py unchanged and still green. Ruff + mypy clean on the new files.
…tar-techs parser Phase 0 items 1-2 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. Manifest (tabvision/tabvision/eval/manifest.py): - Add 'annotation_format' to REQUIRED_CLIP_FIELDS so composite-eval can route each clip to the correct parser via the registry. - Add SYNTHETIC_SOURCE_PREFIXES + cross-contamination guard: clips whose source starts with 'synthtab/', 'dadagp/', or 'synthetic/' are rejected in 'validation' and 'test' splits. Permitted in 'train'. Implements R8 from the strategy doc §7. Guitar-TECHS parser (tabvision/tabvision/eval/parsers/guitar_techs_midi.py): - Parses 6-track MIDI (one track per string, low E first) into list[TabEvent] via pretty_midi. Per-string fret derived from MIDI pitch minus open-string pitch. Drops out-of-range frets. - Optional 'track_to_string' kwarg for releases with a different ordering. Default = identity (low E = 0, high E = 5). - 9 unit tests using pretty_midi-built fixtures; importorskip when pretty_midi not installed. Updated manifest placeholder TOML schema with annotation_format and synthetic-source guard documentation. 4 new manifest validator tests. All 15 new tests pass; existing test_eval_manifest.py / test_parsers_registry.py still green. Ruff + mypy clean.
Phase 0 item 3 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. Six-bucket decomposition matching the apr-28 methodology in tabvision-server/tools/outputs/errors-2026-04-28_185743.md, ported to operate on v1 §8 TabEvent lists: - correct: string + fret + onset all match within tolerance - wrong_position_same_pitch: pitch matches, position doesn't - pitch_off: onset matches but pitch and position differ - timing_only: pos or pitch matches outside strict tolerance but within extended tolerance - missed_onset: gold event with no nearby predicted event - extra_detection: predicted event unmatched by either pass (The seventh apr-28 bucket, muted_undetectable, needs a muted/X flag the v1 TabEvent contract does not yet carry; deferred.) Two-pass greedy matcher prioritizes (a) strict-tolerance closest onset, then (b) extended-tolerance pos-or-pitch match for timing_only. share_of_loss() returns per-bucket percentages of recoverable loss. aggregate_decompositions() sums per-track decompositions for the per-tier rollup that composite.py will produce. 16 unit tests covering each bucket in isolation, the mixed scenario, share-of-loss math, aggregation, and edge cases (multiple gold at same time, greedy onset-closest selection, invalid tolerances). Ruff + mypy clean.
Phase 0 item 4 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.composite.run_composite_eval: - Reads + validates a multi-source manifest, dispatches each clip through the registered parser, runs a user-supplied predictor over the media, and computes onset / pitch / tab F1 + 95% bootstrap CIs per tier plus the 6-bucket error decomposition. - Predictor is injected so the harness is testable without the heavy audio backend; CLI wires up tabvision.pipeline.run_pipeline. - Train-split clips skipped by default (DEFAULT_EVAL_SPLITS = validation + test). - CompositeReport.tab_f1_acceptance(targets) classifies each tier as pass / gap / fail / missing based on the lower_95_CI >= target gate from strategy doc §5. tabvision.eval.metrics: added public event_f1() + EventF1Result for onset-only and onset+pitch matching. The private _score_event_f1 in guitarset_audio is left untouched (Phase 0 ground rule: no production behavior changes). 11 integration smoke tests covering perfect predictor (all tiers pass), shifted predictor (wrong_position_same_pitch dominates), train-split skipping, manifest validation failures, parser-format lookup failures, TABVISION_DATA_ROOT substitution via env + function arg, empty gold edge case, and the acceptance helper. Ruff + mypy clean.
Phase 0 item 5 per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md.
tabvision.eval.composite:
- DEFAULT_TIER_TARGETS = {0.85/0.90/0.87/0.80} from SPEC §1.4.1.
- format_baseline_markdown(report, targets, ...) renders the per-tier
baseline table with pass/gap/fail/missing status, per-source
breakdown, and methodology footer per Phase 0 impl plan §4.1.
- format_decomposition_markdown(report) renders the aggregate +
per-tier 7-bucket (currently 6) error breakdown per §4.2.
- make_run_pipeline_predictor(...) wraps tabvision.pipeline.run_pipeline
with lazy import — composite-eval --help works without the
audio-highres extras installed.
- main() — argparse CLI exposed as 'tabvision-composite-eval'.
Supports --backend, --position-prior (or 'none'), --melodic-prior,
--enable-video, --bootstrap-{n,seed}, --onset-tolerance-s,
--splits, --media-root, --annotation-root, --eval-harness-sha.
Single run can emit both the baseline and decomposition reports
via --decomposition-output, so the separate decompose_tab_errors.py
script listed in the Phase 0 plan is consolidated into this one CLI.
tabvision/scripts/eval/composite_eval.py: 5-line shim that invokes
the module's main().
7 unit tests on the formatters: required sections, pass/gap/fail/missing
classification, methodology fields, decomposition aggregate sums,
default-target coverage. All 20 composite tests + 73 Phase 0 eval tests
pass. Ruff + mypy clean.
Phase 0 item 6a per docs/plans/2026-05-13-tab-f1-phase-0-implementation.md. tabvision.eval.manifest_builder: - scan_guitarset(root, validation_player) — discovers <root>/annotation/*.jams paired with <root>/audio_mono-mic/*_mic.wav; maps _comp/_solo suffix to clean_acoustic_strummed/single_line tier. - scan_guitar_techs(root) — stub returning [] until the dataset is acquired and its on-disk layout is verified. - apply_limits(entries, max_clips_per_tier, total_limit) — deterministic per-tier cap + total cap, sorted by clip id first so re-runs produce byte-stable output. - build_manifest(splits=...) — full pipeline; supports filtering by split so smoke runs target the validation set directly. - render_toml(entries, header_comment) — TOML output with proper escaping and a generated-by header. - _refuse_synthetic_in_eval_splits — pre-write guard mirroring the validator's R8 cross-contamination check. - main() CLI: --guitarset, --guitar-techs, --output, --splits, --max-clips-per-tier, --limit. Returns rc=1 on no clips, rc=2 on validation failure, rc=0 on success. tabvision/scripts/eval/build_composite_manifest.py — thin CLI shim. Hygiene pass per PR feedback: - manifest.toml schema comment now lists guitar_techs_midi alongside guitarset_jams under 'known formats'. - Error-decomposition framing in composite.py and error_decomposition.py now uses 'six-bucket port of the apr-28 7-bucket harness' instead of '7-bucket' (we only populate 6 — muted_undetectable is deferred). - composite.py and manifest_builder.py both gain if __name__ == '__main__' blocks so 'python -m tabvision.eval.composite' and 'python -m tabvision.eval.manifest_builder' invoke main() cleanly. 20 manifest-builder tests pass (scan, limits, render, summarise, build_manifest, --splits filter, end-to-end CLI). Full Phase 0 test suite still green. Ruff + mypy clean. Smoke-validated against on-disk GuitarSet: --max-clips-per-tier 2 --splits validation produces a 4-clip manifest that the composite eval CLI processes end-to-end via the real highres backend + guitarset-v1 prior, emitting baseline + decomposition reports with sensible numbers (strummed Tab F1 ~0.75, single-line ~0.29 on this tiny sample).
Closes the Phase 0 acceptance gate for the 2 tiers reachable from on-disk data (clean acoustic single-line + strummed via GuitarSet held-out validation). Clean electric and distorted electric remain 'missing' pending Guitar-TECHS / EGDB acquisition. Matcher fix (tabvision/tabvision/eval/error_decomposition.py): - decompose_errors() now uses priority-based selection within each onset tolerance window: same (string, fret) > same pitch_midi > onset-closest. Previously a greedy onset-only matcher mis-paired chord-cluster events whose on-the-wire ordering differed from ground truth, inflating pitch_off on strummed (3387 → 486 with the fix). event_f1's pitch-matching semantics are now mirrored in the decomposition. - Added test_chord_cluster_priority_pitch_over_onset and test_chord_cluster_priority_falls_back_to_position_match_then_pitch to lock the new behavior. Reports (docs/EVAL_REPORTS/*): - composite_baseline_2026-05-13.md — first artifact under SPEC §1.4.1: per-tier Tab F1 + Onset/Pitch F1 + 95% bootstrap CI + pass/gap/fail/missing status. Headline: both covered tiers FAIL by ~25-35 pp (single-line mean 0.5076, strummed 0.6708). - tab_f1_error_decomposition_2026-05-13.md — companion 6-bucket breakdown. Headline: wrong_position_same_pitch dominates loss on every tier — 77% of single-line, 50% of strummed, 57% aggregate. Confirms the strategy doc §2 diagnostic. Eval manifest (tabvision/data/eval/composite.toml): - 60 player-05 validation clips, byte-stable output of the manifest builder. Strummed and single-line tiers fully covered. LICENSES.md: - GuitarSet: marked '✅ used for 2026-05-13 baseline'. - Guitar-TECHS: added as planned acquisition (CC-BY-4.0). - EGDB: status updated; author email pending. - GOAT: marked ❌ DROPPED (request-only research-only). - SynthTab: marked ❌ DROPPED from default pipeline (CC-BY-NC-4.0). - User clips: marked ⛔ banned per D10. - DadaGP: marked research/dev only; not in default pipeline. DECISIONS.md: single 2026-05-13 entry summarising D1-D11 from the design plan, with per-tier targets table and the 2026-05-13 baseline numbers inlined so the decision record stands alone. 104 tests pass; ruff + mypy clean.
…ording
Three small fixes flagged in review of the Phase 0 baseline:
(a) Portable manifest. tabvision.eval.manifest_builder now accepts
--data-root PATH; render_toml rewrites media/annotation paths
that fall under that root as '/<rest>'. The
composite-eval CLI already expanded that token via env var or
--media-root/--annotation-root, so checked-in manifests are now
portable across developer machines. Re-generated
tabvision/data/eval/composite.toml with the new flag so the
committed manifest no longer carries /home/gilhooleyp/... paths.
+3 unit tests covering the rewrite + the no-data-root path.
(b) Real SHA in the baseline report. The 'Eval-harness SHA' field
in docs/EVAL_REPORTS/composite_baseline_2026-05-13.md now cites
2ec4849 (the commit that landed both the baseline and the
chord-cluster matcher fix), instead of the ad-hoc
'354571b-matcher-fix' label used at run time.
(c) Stale '7-bucket' wording cleared in the planning docs and one
test docstring. The implementation is a six-bucket port; only
references to the original apr-28 7-bucket harness keep the
historical name.
Verification ran in WSL:
- ruff: passes on changed files.
- mypy: clean on the 8 Phase 0 eval source files (parsers/, bootstrap,
error_decomposition, composite, manifest_builder). Broader
tabvision-wide mypy hits older Phase 5 diagnostics not in this PR's
scope.
- 107 tests pass across the focused Phase 0 + existing eval suite.
No production behavior change; the manifest still resolves to the
same 60 player-05 validation clips.
…otstrap CI, error decomposition Lands origin/impl/tab-f1-phase-0 (9 commits): composite.toml eval manifest, guitarset_jams + guitar_techs_midi parsers, bootstrap CI helper, 7-bucket error decomposition, and first per-tier baseline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… relaxation) SPEC §1.4.1 rewritten to supersede the 2026-05-13 amendment: v1 commits to the original §1.4 per-tier targets (0.94/0.86/0.90/0.82) AND aggregate Tab F1 >= 0.88. The relaxed 0.85/0.90/0.87/0.80 table is withdrawn; the aggregate is un-retired. Keeps the amendment's methodology (public-corpus composite, per-tier bootstrap CIs, lower_95_CI >= target). SPEC §1.4 is now the single source of truth; CLAUDE.md notes the commitment and the design doc D1/D2 are bannered as historical. Honest framing retained in-spec: single-line tier must go 0.51 -> 0.94; a stretch goal adopted as the gate, not a forecast. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add an 'egdb' subcommand to scripts.acquire.datasets mirroring the roboflow pattern: downloads from the author-granted access URL (--url / $EGDB_DOWNLOAD_URL), optional SHA-256 verify, zip/tar extract, idempotent. No URL/data is hard-coded or committed. LICENSES.md flips EGDB to author-granted eval-use (2026-06-01), eval-only, not redistributed, not a shipped-weight substrate. .env.example gains EGDB_DOWNLOAD_URL. ACTION REQUIRED (user): drop in the grant URL to run it, and file the grant email under docs/ + log in docs/DECISIONS.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… AGENTS.md Remove abandoned multi-agent dev experiment (.claude-agent-farm.json, tabvision_agent_farm_config.json, tabvision_agent_farm_prompt.txt, tabvision_agent_config.json, tabvision_prompt.txt) and the stale coordination/ work queue (referenced frozen v0 paths). Remove stray combined_typechecker_and_linter_problems.txt. Banner tabvision_specification.md as historical/non-canonical (SPEC.md is canonical; still linked from AUDIT/README so kept, not deleted). Track AGENTS.md (Codex sibling of CLAUDE.md). All recoverable via git history. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Verified 2026-06-01 against the project page (https://ss12f32v.github.io/Guitar-Transcription/): EGDB audio is a *public* Google Drive folder; access is open and the *license* was the only gate (repo has no LICENSE file -> author's portfolio-use grant on record clears it). - egdb acquirer now defaults to the public Drive folder and downloads via gdown (folder-aware), with a clean manual-download fallback when gdown is absent. Direct-archive path kept for mirrors. - LICENSES.md / .env.example corrected: access-open, license-is-the-gate; EGDB_DOWNLOAD_URL is now an optional mirror override, not a required secret. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… scanner, runbook Wires the cross-dataset prior-generalization check to run locally on CPU: - scripts.acquire.datasets gains 'guitarset' (mirdata → the layout scan_guitarset/composite.toml expect) and 'guitar-techs' (Zenodo record 14963133 via the public API, no hard-coded filenames; prints the tree to verify layout). Both CC-BY-4.0, eval-only, idempotent. - Implements the stubbed manifest_builder.scan_guitar_techs: pairs 6-track MIDI with same-stem/prefix-stem audio (DI/clean preferred), tier=clean_electric (the tier GuitarSet can't cover + the #2 cross-dataset target), performer split, skips stretch-technique clips. Layout inferred from arXiv:2501.03720 — flagged to verify against the first real download. - test_scan_guitar_techs.py pins the heuristics on a synthetic tree (runs under pytest or as a plain script; validated here without the dep). - docs/plans/2026-06-02-tab-f1-phase-0-local-run.md: turnkey runbook (install → acquire → build manifests → prior on/off → read the verdict). - LICENSES.md: Guitar-TECHS row → acquirer/scanner landed, eval-only. #3 fine-tune stays on free GPU (no CUDA locally). EGDB folds in a 4th tier later. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The acquirers printed Unicode arrows/ellipses/em-dashes; on a Windows cp1252 console print() raised UnicodeEncodeError on U+2192 before mirdata ran, killing the guitarset download. Replace ->/.../- with ASCII. Run acquirers with PYTHONUTF8=1 as belt-and-suspenders (also shields third-party console output). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mirdata download() pulled all partitions (~10GB incl. 3.36GB hex-pickup zips + mix) but the composite eval reads only annotation/*.jams + audio_mono-mic/*_mic.wav. Pass partial_download=['annotations','audio_mic']; harden idempotency to require both annotation jams AND mono-mic wavs (so a partial leftover won't false-skip). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Verified against Zenodo record 14963133: clips are <Pn_category>/midi/midi_<content>.mid paired with <Pn_category>/audio/<capture>/<capture>_<content>.<ext>. MIDI and audio share the <content> token, NOT a prefix — the inferred prefix-matcher would have found ZERO clips. Now: pair by content token scoped to the Pn_category group, prefer direct-input over mic'd amp, performer split from the 'Pn'/'playerNN' prefix, skip __MACOSX cruft + stretch-technique paths. Validated on the real partial download (58 clips paired correctly). Test rewritten to the real layout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The whole-dir idempotency false-skipped any partial download, and one network blip (mid P1_scales.zip over VPN) aborted the entire multi-GB fetch. Now: skip per-file when the extracted dir already exists (re-run resumes), drop partials and continue past a failed file instead of aborting, and handle corrupt zips. Re-running the command now completes only the missing categories. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Four local CPU eval reports + cross-dataset summary + DECISIONS entry. GuitarSet acoustic reproduces the +22pp prior lift (single 0.219->0.508, strummed 0.475->0.671, onset/pitch ~0.93). Guitar-TECHS electric: prior lift +1.3pp (within 95% CI), onset/pitch collapse to 0.75/0.73. Dominant finding: the highres acoustic backbone doesn't generalize to electric, capping Tab F1 ~0.12 and blocking the SPEC clean/distorted-electric tiers. Next step pivots from a GuitarSet-only fine-tune to evaluating an electric-capable backbone. (Machine-local manifests with absolute paths not committed — harness _relativize_to_data_root has a Windows-separator bug; gitignored + flagged.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… help electric highres-fl was dead code — it passed instrument='guitar_fl', but the pinned hf_midi_transcription only knows saxophone/bass/guitar/piano. guitar-fl.pth does exist in the HF repo, so load it by passing the full repo/file path as checkpoint_path (instrument='guitar' for the architecture). Verified end-to-end. Result (paired, 12 Guitar-TECHS chord clips): guitar_fl ~= guitar_gaps on electric (pitch 0.687 vs 0.679, onset 0.715 vs 0.732 — within noise). The cheap checkpoint swap does NOT close the electric gap; both ~0.68 pitch vs ~0.93 acoustic. Electric needs fine-tuning on electric data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Decision: train a SEPARATE guitar-electric checkpoint (fine-tuned from gaps), routed by the declared tone — avoids catastrophic forgetting of the acoustic 0.93; the architecture already routes by checkpoint (highres vs highres-fl). Honest blocker captured: no highres training code in-repo or in the inference packages (audio_finetune.py is a scaffold; the 2026-04-24 design targets Basic Pitch). Step 0 is standing up the upstream hFT-Transformer/piano_transcription training code. Data (Guitar-TECHS, CC-BY) is on disk; split by performer; free GPU per D6; acceptance = electric pitch F1 0.73 -> >=0.88, acoustic unchanged. Includes a Basic-Pitch fallback path and the highres-electric integration steps. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Evidence-based scope (DECISIONS 2026-06-02): clean-electric measured 0.12 (acoustic-trained backbone, no in-repo training code), so the electric tiers move to v2 — delivered as a SEPARATE highres-electric checkpoint routed by the declared instrument (avoids catastrophic forgetting of the acoustic 0.93; the architecture already routes by checkpoint). - backend.py registers highres-electric; highres.py adds the guitar_electric variant guarded by TABVISION_HIGHRES_ELECTRIC_CKPT (fails fast with a clear message until the v2 checkpoint is trained). - pipeline.audio_backend_for_session() routes electric -> highres-electric; run_pipeline(audio_backend_name='auto') enables the toggle. Acoustic untouched. - tests/unit/test_audio_routing.py (routing + guard). - SPEC §1.4.1 + CLAUDE.md: v1 = acoustic tiers (0.94/0.86) + aggregate 0.88; electric deferred to v2 with the toggle shipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diagnosed the single-line gap (docs/EVAL_REPORTS/acoustic_single_line_2026-06-02.md): the loss is 322 wrong_position_same_pitch vs 8 pitch_off — audio can't resolve which STRING a (correct) pitch was played on. Melodic prior regresses it; hand-position continuity (POSITION_SHIFT_COST 0.05 -> 2.5, now the default + env knob) gives a real but small lift (single 0.508->0.523, strummed 0.671->0.676, no regression) and does NOT reach 0.94. Single-line is information-limited. SPEC §1.4.1 + CLAUDE.md: honest audio-only v1 targets — single-line >= 0.45, strummed >= 0.60, aggregate >= 0.55 (lower_95 >= target); the 0.94/0.86 become the v1.1 video-assisted reference (video resolves the string ambiguity). DECISIONS records the evidence chain so the dead ends aren't re-ground. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The old prefix check hard-coded a forward slash, so on Windows (backslash absolute paths) it never matched and leaked absolute drive paths into checked-in manifests. Switch to Path.relative_to + as_posix, separator-correct on the native platform, always emitting forward-slash TABVISION_DATA_ROOT tokens. Adds a PureWindowsPath regression test exercising Windows behaviour from POSIX CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-existing Phase 0 files were committed unformatted and failed CI's ruff format --check. Mechanical formatting only; no behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lands the full v1 acoustic program onto
main(26 commits). Supersedes #11 — Phase 0 is a strict subset of this branch; #11 will be closed once this merges.What this lands
_relativize_to_data_rootusesPath.relative_to/as_posixinstead of a hard-coded/prefix, so checked-in manifests no longer leakC:\...paths. Adds aPureWindowsPathregression test.ruff formatpass over 12 pre-existing unformatted Phase 0 files — the only thing red on Phase 0: per-tier composite eval + first GuitarSet baseline #11 CI.Verification (local)
ruff checkclean,ruff format --checkclean,mypy tabvisionclean (56 files), eval/unit tests pass.docs/EVAL_REPORTS/+docs/DECISIONS.md.🤖 Generated with Claude Code