POST-PIVOT (2026-05-09). This epic was rescoped on 2026-05-09. Backend no longer composes identity from multiple sources or reads LML's discogs-cache PG. The matcher cascade, §3.4.1.1 composition (Rules 2-6), and §3.2.5 cross-ref detection move to LML; Backend consumes the verdict over HTTP via a new POST /api/v1/identity/bulk-resolve-libraries endpoint and writes locally via the §3.2.2.2 dual-table writer. Decision record: Backend-Service:plans/library-hook-canonicalization/architecture-pivot-2026-05-09.md (merged in #800). Replacement tickets: wxyc-shared#103 (contract), library-metadata-lookup#272 (LML implementation), #802 (Backend consumer). Wiki amendment in WXYC/wiki#25. The substrate from #790 (already merged) and the S1 self-migration leg both survive; the per-source legs S2/S3/S4/S5 and the §3.2.2.1 manual-override-skip endpoint are descoped. The problem statement and goals below remain accurate; the Backend half of the implementation is what changed.
Summary
Introduce a Backend-resident canonical identity record (library_identity + library_identity_source + library_identity_history) that links every library row to its Discogs / MusicBrainz / Wikidata / Spotify / Apple Music IDs with confidence and provenance, plus the LML write contract that populates it.
Today, identity is scattered across flowsheet_match, fuzzy_resolved, wxyc_release_match (Discogs), mb_* tables (MB), discogs_mapping + entity (Wikidata), reconciliation_log (semantic-index, all 53,849 rows have NULL confidence), and prod discogs-cache's entity.identity (23,816 rows, separate schema, no local replica). Backend already has canonical_entity_id + canonical_entity_confidence + canonical_entity_resolved_at columns at ~46% coverage; this epic migrates that opaque-string pattern into a per-source columnar model with a history table.
Two halves, one contract
E2-BS — Backend half (Backend-Service)
library_identity schema (one row per library.id) with per-source ID columns, method enum, confidence, plus a library_identity_source sidecar (per-source forensic detail) and library_identity_history (supersedure log with retention policy from §3.2.0). Writer service at apps/backend/services/identity/library-identity-writer.ts consumes LML's extended /lookup response and applies the §3.2.2 write rules (manual-override protection, higher-confidence supersedes, equal-confidence noop, sanity-check rejection on out-of-range confidence).
Backfill: §4 step 2 reads the existing canonical_entity_id triple and populates library_identity_source per-source columns. Reuses the existing jobs/library-canonical-entity-backfill/ job, refactored to write to the new tables (per §3.2.0.1 audit). The existing jobs/artist-identity-etl/ is kept as-is (per-artist; out of scope here).
Manual-override MVP: a --apply CSV job (§3.2.4) writing structured reason_category + failure_reason enums to library_identity_history.
E2-LML — LML half (library-metadata-lookup)
Matcher cascade extension (§3.2.5): label, year, and Wikidata cross-reference signals layer above the existing trigram matcher. Cascade decision trees A-E (§3.2.5.1) handle disambiguation when multi-candidate matches arise. New cross_source_agreement method emitted when ≥2 sources resolve to entities sharing a discogs_mapping cross-ref → main row confidence 0.95+.
/lookup response extension (§3.2.2): when request includes include_identity: true, response carries api_version: 2 + identity block with a complete (not sparse) resolved array — one entry per source LML attempted, including attempted: false for legs that didn't run. Backend always passes include_identity: true; other consumers (catalog search, dj-site proxy, iOS apps) omit it and continue to receive the byte-identical v0.5.0 response (oneOf with discriminator on the request flag — §3.2.2 locked).
Manual-override skip: §3.2.2.1 read contract — LML refuses to overwrite a library_identity row with method='manual' unless the override is itself a manual op.
Phase 0 prerequisites (filed separately, all block E2-BS step 1)
- E2 step 0a —
wxyc-shared api.yaml v0.5.0 → v0.6.0 with identity block + include_identity request param. Two-step deliverable: schema PR merge + release tag publishing the new TS/Python/Swift/Kotlin types. The release tag is the contract-readiness gate.
- E2 step 0b — Backend CLAUDE.md notes (cross-epic precondition guards, manual-override 8-source extensibility) + flag-only timeout workflow + CLAUDE.md note in
catalog-audits. Per §3.2.3's four-artifact checklist.
- E2 step 0c — LML confidence-threshold review. 15-min sync with the LML maintainer to confirm §3.4.1's locked thresholds (
≥0.85 authoritative; 0.70-0.85 flagged) match LML's existing implicit thresholds. Written summary comment on the issue is the artifact.
- E2 step 0d — Cross-repo env-var documentation. Backend CLAUDE.md is the canonical source of truth for the §4.2 feature flags; consumer repos (LML, semantic-index, discogs-etl, musicbrainz-cache, wikidata-cache) cross-reference.
- E2 step 0e — Existing-jobs lock-pattern audit. Confirms
library-canonical-entity-backfill and artist-identity-etl row-level lock patterns are compatible with the new dual-table writer.
Dependencies
Existing issues folded in
WXYC/library-metadata-lookup#162 — Artist-name variant discovery → input to backfill (§4 step 2). Folds in.
WXYC/library-metadata-lookup#163 — Canonicalization service design → §3.2 specifies it. Folds in / replaces.
WXYC/Backend-Service#624 — Upgrade canonical_entity_id to Wikidata QIDs → becomes the migration step from canonical_entity_id into library_identity_source per §3.2.0. Closed with a comment explaining supersession.
WXYC/Backend-Service#572 — Run LML entity-resolution pipeline to populate entity.identity in prod → superseded by E2-BS write path.
WXYC/library-metadata-lookup#207, #216, #217, #218 — Multi-source legs (MB / WD / Spotify-AM-BC) → coordinate; legs fill in library_identity_source columns.
Acceptance
library_identity populated for ≥99% of library rows (with NULL per-source columns where unresolved).
- §3.2.3 gate-check query PASSES (
truly_unresolved_rows < 1000).
- 30-day dual-run audit (E5) shows zero divergence between
library_identity and the legacy canonical_entity_id columns.
canonical_entity_id columns dropped per §4 step 5.
- Manual-override CSV job runs cleanly on a sample fixture; rejected rows surface in report with structured
failure_reason.
Plan reference
plans/library-hook-canonicalization.md §3.2.0 through §3.2.5.1, §3.4, §4 step 1-5, §4.2, §8.5.1 (E2-BS test strategy).
Summary
Introduce a Backend-resident canonical identity record (
library_identity+library_identity_source+library_identity_history) that links every library row to its Discogs / MusicBrainz / Wikidata / Spotify / Apple Music IDs with confidence and provenance, plus the LML write contract that populates it.Today, identity is scattered across
flowsheet_match,fuzzy_resolved,wxyc_release_match(Discogs),mb_*tables (MB),discogs_mapping+entity(Wikidata),reconciliation_log(semantic-index, all 53,849 rows have NULLconfidence), and prod discogs-cache'sentity.identity(23,816 rows, separate schema, no local replica). Backend already hascanonical_entity_id+canonical_entity_confidence+canonical_entity_resolved_atcolumns at ~46% coverage; this epic migrates that opaque-string pattern into a per-source columnar model with a history table.Two halves, one contract
E2-BS — Backend half (
Backend-Service)library_identityschema (one row perlibrary.id) with per-source ID columns,methodenum,confidence, plus alibrary_identity_sourcesidecar (per-source forensic detail) andlibrary_identity_history(supersedure log with retention policy from §3.2.0). Writer service atapps/backend/services/identity/library-identity-writer.tsconsumes LML's extended/lookupresponse and applies the §3.2.2 write rules (manual-override protection, higher-confidence supersedes, equal-confidence noop, sanity-check rejection on out-of-range confidence).Backfill: §4 step 2 reads the existing
canonical_entity_idtriple and populateslibrary_identity_sourceper-source columns. Reuses the existingjobs/library-canonical-entity-backfill/job, refactored to write to the new tables (per §3.2.0.1 audit). The existingjobs/artist-identity-etl/is kept as-is (per-artist; out of scope here).Manual-override MVP: a
--applyCSV job (§3.2.4) writing structuredreason_category+failure_reasonenums tolibrary_identity_history.E2-LML — LML half (
library-metadata-lookup)Matcher cascade extension (§3.2.5): label, year, and Wikidata cross-reference signals layer above the existing trigram matcher. Cascade decision trees A-E (§3.2.5.1) handle disambiguation when multi-candidate matches arise. New
cross_source_agreementmethod emitted when ≥2 sources resolve to entities sharing adiscogs_mappingcross-ref → main row confidence 0.95+./lookupresponse extension (§3.2.2): when request includesinclude_identity: true, response carriesapi_version: 2+identityblock with a complete (not sparse)resolvedarray — one entry per source LML attempted, includingattempted: falsefor legs that didn't run. Backend always passesinclude_identity: true; other consumers (catalog search, dj-site proxy, iOS apps) omit it and continue to receive the byte-identical v0.5.0 response (oneOfwith discriminator on the request flag — §3.2.2 locked).Manual-override skip: §3.2.2.1 read contract — LML refuses to overwrite a
library_identityrow withmethod='manual'unless the override is itself a manual op.Phase 0 prerequisites (filed separately, all block E2-BS step 1)
wxyc-sharedapi.yamlv0.5.0 → v0.6.0 withidentityblock +include_identityrequest param. Two-step deliverable: schema PR merge + release tag publishing the new TS/Python/Swift/Kotlin types. The release tag is the contract-readiness gate.catalog-audits. Per §3.2.3's four-artifact checklist.≥0.85authoritative;0.70-0.85flagged) match LML's existing implicit thresholds. Written summary comment on the issue is the artifact.library-canonical-entity-backfillandartist-identity-etlrow-level lock patterns are compatible with the new dual-table writer.Dependencies
Existing issues folded in
WXYC/library-metadata-lookup#162— Artist-name variant discovery → input to backfill (§4 step 2). Folds in.WXYC/library-metadata-lookup#163— Canonicalization service design → §3.2 specifies it. Folds in / replaces.WXYC/Backend-Service#624— Upgradecanonical_entity_idto Wikidata QIDs → becomes the migration step fromcanonical_entity_idintolibrary_identity_sourceper §3.2.0. Closed with a comment explaining supersession.WXYC/Backend-Service#572— Run LML entity-resolution pipeline to populateentity.identityin prod → superseded by E2-BS write path.WXYC/library-metadata-lookup#207, #216, #217, #218— Multi-source legs (MB / WD / Spotify-AM-BC) → coordinate; legs fill inlibrary_identity_sourcecolumns.Acceptance
library_identitypopulated for ≥99% of library rows (with NULL per-source columns where unresolved).truly_unresolved_rows < 1000).library_identityand the legacycanonical_entity_idcolumns.canonical_entity_idcolumns dropped per §4 step 5.failure_reason.Plan reference
plans/library-hook-canonicalization.md§3.2.0 through §3.2.5.1, §3.4, §4 step 1-5, §4.2, §8.5.1 (E2-BS test strategy).