Skip to content

[Epic] E2 — Canonical identity record (library_identity) + LML write contract #663

@jakebromberg

Description

@jakebromberg

POST-PIVOT (2026-05-09). This epic was rescoped on 2026-05-09. Backend no longer composes identity from multiple sources or reads LML's discogs-cache PG. The matcher cascade, §3.4.1.1 composition (Rules 2-6), and §3.2.5 cross-ref detection move to LML; Backend consumes the verdict over HTTP via a new POST /api/v1/identity/bulk-resolve-libraries endpoint and writes locally via the §3.2.2.2 dual-table writer. Decision record: Backend-Service:plans/library-hook-canonicalization/architecture-pivot-2026-05-09.md (merged in #800). Replacement tickets: wxyc-shared#103 (contract), library-metadata-lookup#272 (LML implementation), #802 (Backend consumer). Wiki amendment in WXYC/wiki#25. The substrate from #790 (already merged) and the S1 self-migration leg both survive; the per-source legs S2/S3/S4/S5 and the §3.2.2.1 manual-override-skip endpoint are descoped. The problem statement and goals below remain accurate; the Backend half of the implementation is what changed.


Summary

Introduce a Backend-resident canonical identity record (library_identity + library_identity_source + library_identity_history) that links every library row to its Discogs / MusicBrainz / Wikidata / Spotify / Apple Music IDs with confidence and provenance, plus the LML write contract that populates it.

Today, identity is scattered across flowsheet_match, fuzzy_resolved, wxyc_release_match (Discogs), mb_* tables (MB), discogs_mapping + entity (Wikidata), reconciliation_log (semantic-index, all 53,849 rows have NULL confidence), and prod discogs-cache's entity.identity (23,816 rows, separate schema, no local replica). Backend already has canonical_entity_id + canonical_entity_confidence + canonical_entity_resolved_at columns at ~46% coverage; this epic migrates that opaque-string pattern into a per-source columnar model with a history table.

Two halves, one contract

E2-BS — Backend half (Backend-Service)

library_identity schema (one row per library.id) with per-source ID columns, method enum, confidence, plus a library_identity_source sidecar (per-source forensic detail) and library_identity_history (supersedure log with retention policy from §3.2.0). Writer service at apps/backend/services/identity/library-identity-writer.ts consumes LML's extended /lookup response and applies the §3.2.2 write rules (manual-override protection, higher-confidence supersedes, equal-confidence noop, sanity-check rejection on out-of-range confidence).

Backfill: §4 step 2 reads the existing canonical_entity_id triple and populates library_identity_source per-source columns. Reuses the existing jobs/library-canonical-entity-backfill/ job, refactored to write to the new tables (per §3.2.0.1 audit). The existing jobs/artist-identity-etl/ is kept as-is (per-artist; out of scope here).

Manual-override MVP: a --apply CSV job (§3.2.4) writing structured reason_category + failure_reason enums to library_identity_history.

E2-LML — LML half (library-metadata-lookup)

Matcher cascade extension (§3.2.5): label, year, and Wikidata cross-reference signals layer above the existing trigram matcher. Cascade decision trees A-E (§3.2.5.1) handle disambiguation when multi-candidate matches arise. New cross_source_agreement method emitted when ≥2 sources resolve to entities sharing a discogs_mapping cross-ref → main row confidence 0.95+.

/lookup response extension (§3.2.2): when request includes include_identity: true, response carries api_version: 2 + identity block with a complete (not sparse) resolved array — one entry per source LML attempted, including attempted: false for legs that didn't run. Backend always passes include_identity: true; other consumers (catalog search, dj-site proxy, iOS apps) omit it and continue to receive the byte-identical v0.5.0 response (oneOf with discriminator on the request flag — §3.2.2 locked).

Manual-override skip: §3.2.2.1 read contract — LML refuses to overwrite a library_identity row with method='manual' unless the override is itself a manual op.

Phase 0 prerequisites (filed separately, all block E2-BS step 1)

  • E2 step 0awxyc-shared api.yaml v0.5.0 → v0.6.0 with identity block + include_identity request param. Two-step deliverable: schema PR merge + release tag publishing the new TS/Python/Swift/Kotlin types. The release tag is the contract-readiness gate.
  • E2 step 0b — Backend CLAUDE.md notes (cross-epic precondition guards, manual-override 8-source extensibility) + flag-only timeout workflow + CLAUDE.md note in catalog-audits. Per §3.2.3's four-artifact checklist.
  • E2 step 0c — LML confidence-threshold review. 15-min sync with the LML maintainer to confirm §3.4.1's locked thresholds (≥0.85 authoritative; 0.70-0.85 flagged) match LML's existing implicit thresholds. Written summary comment on the issue is the artifact.
  • E2 step 0d — Cross-repo env-var documentation. Backend CLAUDE.md is the canonical source of truth for the §4.2 feature flags; consumer repos (LML, semantic-index, discogs-etl, musicbrainz-cache, wikidata-cache) cross-reference.
  • E2 step 0e — Existing-jobs lock-pattern audit. Confirms library-canonical-entity-backfill and artist-identity-etl row-level lock patterns are compatible with the new dual-table writer.

Dependencies

Existing issues folded in

  • WXYC/library-metadata-lookup#162 — Artist-name variant discovery → input to backfill (§4 step 2). Folds in.
  • WXYC/library-metadata-lookup#163 — Canonicalization service design → §3.2 specifies it. Folds in / replaces.
  • WXYC/Backend-Service#624 — Upgrade canonical_entity_id to Wikidata QIDs → becomes the migration step from canonical_entity_id into library_identity_source per §3.2.0. Closed with a comment explaining supersession.
  • WXYC/Backend-Service#572 — Run LML entity-resolution pipeline to populate entity.identity in prod → superseded by E2-BS write path.
  • WXYC/library-metadata-lookup#207, #216, #217, #218 — Multi-source legs (MB / WD / Spotify-AM-BC) → coordinate; legs fill in library_identity_source columns.

Acceptance

  • library_identity populated for ≥99% of library rows (with NULL per-source columns where unresolved).
  • §3.2.3 gate-check query PASSES (truly_unresolved_rows < 1000).
  • 30-day dual-run audit (E5) shows zero divergence between library_identity and the legacy canonical_entity_id columns.
  • canonical_entity_id columns dropped per §4 step 5.
  • Manual-override CSV job runs cleanly on a sample fixture; rejected rows surface in report with structured failure_reason.

Plan reference

plans/library-hook-canonicalization.md §3.2.0 through §3.2.5.1, §3.4, §4 step 1-5, §4.2, §8.5.1 (E2-BS test strategy).

Metadata

Metadata

Assignees

No one assigned

    Labels

    cross-cache-identityProject tag for the cross-cache-identity initiative (library hook + identity record + normalization)enhancementNew feature or requestepic:e2-bsParent epic E2 — Backend half — library_identity schema + writer + backfillepic:e2-lmlParent epic E2 — LML half — matcher cascade + /lookup responsekind:epicParent epic issuephase:0Mojibake phase 0 — audit

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions