Skip to content

[Followup] Generalize compilation_track_artist → library_track (cross-cache-identity §3.2 step 2) #801

@jakebromberg

Description

@jakebromberg

Summary

Step 2 of the two-step plan in compilation-track-schema-design.md. Generalize compilation_track_artistlibrary_track so the track model covers all releases (not just V/A), supports a release-identity correction UI for compilations, and accommodates the librarian-artist link as a first-class concept.

Blocked by #792 (step 1 — library_track_identity_source sibling). Do not start until #792 ships and the sibling identity table is exercised in production.

Aligned with the cross-cache-identity architecture pivot (2026-05-09). Per BS#800, Backend doesn't read LML's PG; identity flows in via POST /api/v1/identity/bulk-resolve-libraries. The "non-V/A track data" deliverable below uses bulk-resolve-libraries' kind: single_artist track listings rather than a new direct-write LML script. The legacy match_compilations.py / merge_cta.py scripts retire under pivot Phase 4-5; the per-track matcher work moves into LML (library-metadata-lookup#271).

Why later, not now

Today's compilation_track_artist covers only V/A library rows (~6,300 of 64,676), was machine-populated by an offline LML backfill (not librarian-curated), and uses a unique key on (library_id, artist_name, track_title) — wrong for the future where track_position is the right key but is currently nullable in some rows.

The librarian directive (2026-05-09) opens onto a UI plan we can't reach without restructuring:

  1. Tracklisting view for compilations.
  2. Search artists + tracks across all releases, not just comps.
  3. Release-identity correction UI for compilations — at the release level, not the track level. Two failure modes: (a) LML matched the comp to the wrong upstream release; (b) the correct release isn't captured in any upstream system at all.

Goals 1 and partial 2 work for V/A on current data once #792 lands. Goal 2 for non-V/A and goal 3 require this ticket.

Scope (tracking-level; sub-issues to be carved out closer to execution)

Schema

  • Rename compilation_track_artistlibrary_track, OR keep the V/A table and add a parallel library_track for non-V/A. Decide in the design phase.
  • Add track_artist_id integer REFERENCES wxyc_schema.artists(id) (nullable) for internal canonicalization — the "library-artist link" question from the design doc.
  • Add track_artist_link_confidence real CHECK (>= 0 AND <= 1) and track_artist_link_method text for the link itself.
  • Tighten track_position to NOT NULL where data permits; backfill or null-segregate the rest before adding the constraint.
  • Update the unique index. Current index is (library_id, artist_name, track_title). The forward-compatible key — and the one [E2] Compilations / Various Artists treatment under library_identity (§3.2 design decision) #792's library_track_identity_source sibling uses — is (library_id, track_position).
  • Refresh the CDC trigger (currently cdc_compilation_track_artist from migration 0046) to cover the renamed/replacement table.

Population — non-V/A track data

  • Under the pivot, non-V/A per-track data flows through bulk-resolve-libraries: extend the response shape so kind: single_artist results carry tracks too (when the caller asks for them). LML matches per-track; Backend writes the response into library_track. The pre-pivot framing of "new direct-write LML script mirroring match_compilations.py" is obsolete.
  • Cardinality estimate: 64K library rows × N tracks ≈ 500K–1M new track rows.
  • Backend-side: a one-shot backfill job (cf. jobs/library-canonical-entity-backfill/ or jobs/library-artwork-url-backfill/ patterns) that pages through library rows, calls bulk-resolve-libraries, and UPSERTs the tracks block into library_track (and identity into library_track_identity_source).

library-etl reconciliation

  • Today's .onConflictDoNothing() keyed on (library_id, artist_name, track_title) (Backend-Service/jobs/library-etl/job.ts:735-743) creates orphan rows when upstream artist_name is corrected. Fine today (no upstream correction UI), real bug once corrections land.
  • Either grow real upsert semantics keyed on (library_id, track_position) (requires the NOT NULL change above), or detect upstream deletions and propagate.
  • Decide cascade semantics for library_track_identity_source when the underlying track is deleted vs. when an artist_name is corrected (true delete → cascade is right; correction → identity may still be valid, depends on whether the correction changed the artist).

Reader updates

  • Backend-Service/apps/backend/services/suggest.service.ts:30-100 reads compilation_track_artist as a secondary autocomplete source. Update to read library_track.
  • Catalog views in dj-site that display tracklists (don't exist yet; add as part of this work or as a downstream UI ticket).
  • Any flowsheet → library track resolution that needs the per-track artist link.

Release-identity correction UI

The librarian-facing correction is at the release level, not the track level. When LML matches a comp to a Discogs/MB release, the tracklist is downstream of that release pointer; if the pointer is wrong, the entire tracklist is wrong. Track-level corrections aren't the primary need.

Two failure modes the UI must handle:

  1. Wrong upstream release matched. LML picked Discogs#X for this comp, but Discogs#Y (or a different MBID, or "none of these") is the correct one. The librarian re-points the comp's release-identity pointer; the tracklist + per-track external identity gets re-fetched and replaced wholesale (sidecar rows from [E2] Compilations / Various Artists treatment under library_identity (§3.2 design decision) #792 deleted and re-inserted against the new release).
  2. No upstream release exists. The comp exists in WXYC's catalog but isn't captured in Discogs / MB / any upstream system. The librarian needs to declare this — a positive "looked, doesn't exist, won't exist" state — distinct from "LML hasn't tried yet" and from "LML tried and failed." For these cases the tracklist would have to be librarian-entered (or stay empty); they'd be the only librarian-curated rows in library_track.

Schema implications:

  • library_identity (the parent library_id's row) needs a state for case 2. Likely a library_identity_source row with method = 'manual_no_upstream' or similar — captures the librarian's positive declaration so the matcher won't keep retrying.
  • library_track rows for case-2 comps would have track_artist_link_method = 'librarian' (vs. 'lml_backfill' for the default machine-populated case). The dual-author concern that was framed away under [E2] Compilations / Various Artists treatment under library_identity (§3.2 design decision) #792 returns here, but only for a small minority of rows.
  • library_track_identity_source (the step-1 sidecar) stays empty for case-2 comps — there's no upstream identity to source.
  • The CDC trigger and sync interactions need to recognize "release pointer changed" as a wholesale-replace event for the dependent track + identity rows.

Open question: lives in tubafrenzy/webapps/wxycdb (extending the existing JSP catalog) or a new dj-site / BS-served surface? Either path has different sync implications. Decide before implementing.

Tubafrenzy schema

  • Open question: does step 2 require schema change in tubafrenzy MySQL, or BS-side-only?
    • BS-only is cleaner but means BS becomes the source of truth for track data, inverting today's tubafrenzy → BS direction.
    • Tubafrenzy-also keeps the source-of-truth direction but means tubafrenzy needs the same schema and the librarian UI gets the new fields.

Hard constraints (carried forward from #792)

The librarian invariants still apply — V/A filing conventions stay invariant from the librarian's perspective. See #792's "Hard constraint" section and the design doc. Specifically:

  • library.artist_name / library.album_artist rendering for V/A rows is unchanged.
  • Z-prefixed call letters file as today.
  • Catalog navigation surfaces bucket-specific shelves, not a conflated __VARIOUS__ page.

The step 1 sidecar (library_track_identity_source) keys on (library_id, track_position, source); step 2 must preserve that keying or migrate the sidecar in lockstep.

End state

  • library_track (renamed or parallel) covers all library rows.
  • Per-track artist link to artists, nullable + fuzzy + with confidence/method.
  • Non-V/A track data populated from a new LML pull.
  • library-etl handles upstream corrections without orphaning rows.
  • Release-identity correction UI for compilations exists somewhere (TBD which repo).
  • Step 1 sidecar continues to work unchanged or is migrated cleanly.
  • Plan §3.2 in the parent canonicalization doc updated to reflect the post-step-2 model.

Acceptance (tracking-level)

Per-deliverable acceptance lives with each sub-issue when this gets broken down. At the tracking-issue level:

  • All deliverables above either shipped or explicitly carved out with follow-ups.
  • No regression in librarian-facing V/A behavior (cf. project_librarian_va_invariant).
  • Track search across all releases works in dj-site or whichever consumer is the target.
  • The librarian-correction UI handles the wrong-release-match scenario the design doc names.

Open questions (must resolve before breaking into sub-issues)

  1. Where does the librarian-correction UI live — tubafrenzy wxycdb or a new BS-served / dj-site surface?
  2. Population strategy for non-V/A track data — fold into library-etl or run as a separate one-shot backfill?
  3. library-etl upsert reconciliation: (library_id, track_position)-keyed upsert, or upstream-deletion propagation, or both?
  4. Tubafrenzy MySQL schema impact: also-update or BS-only?
  5. Rename vs parallel-table: structural choice for the schema migration.

Related

  • Design doc (full analysis): WXYC/wiki/plans/compilation-track-schema-design.md.
  • Blocked by: [E2] Compilations / Various Artists treatment under library_identity (§3.2 design decision) #792 (step 1 — library_track_identity_source sibling).
  • Parent plan: library-hook-canonicalization.md §3.2.
  • Source-of-truth schema: tubafrenzy/docs/database-schema.md (LIBRARY_CODE / LIBRARY_RELEASE / COMPILATION_TRACK_ARTIST).
  • Current BS code touched:
    • Backend-Service/jobs/library-etl/job.ts:701-748 — etl.
    • Backend-Service/apps/backend/services/suggest.service.ts:30-100 — reader.
    • Backend-Service/shared/database/src/migrations/0037_etl-schema-sync.sql — existing schema.
    • Backend-Service/shared/database/src/migrations/0046_cdc_notify_triggers.sql — CDC trigger.
  • LML scripts that wrote the original V/A data:
    • library-metadata-lookup/scripts/match_compilations.py
    • library-metadata-lookup/scripts/merge_cta.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    concern:identityTouches library_identity / library_identity_source / library_identity_historyconcern:schema-migrationPR ships a Drizzle / sqlx / alembic migrationcross-cache-identity-followupPost-MVP follow-up that won't ship in this projectenhancementNew feature or requestkind:followupPost-MVP follow-up (paired with cross-cache-identity-followup)status:readyActionable now — no upstream blockers

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions