Skip to content

§4 step 2 sub-PR 2.1 — S2 LML entity.identity reader + cross-ref pre-index #793

@jakebromberg

Description

@jakebromberg

Add the LML/discogs-cache PG leg as the second source for the library-identity-backfill (cross-cache-identity §4 step 2). Triggers cross-source agreement detection: library rows that have BOTH a discogs_release from S1 AND e.g. a discogs_artist + mb_artist from S2 cross-ref via Wikidata's discogs_mapping → main row method='cross_source_agreement', confidence=0.95.

Plan: plans/library-hook-canonicalization/section-4-step-2-backfill-plan.md sub-PR 2.1.

Scope

  • New module jobs/library-identity-backfill/sources/lml-entity-identity.ts. Connects to DATABASE_URL_DISCOGS (read-only). Bulk-reads entity.identity joined to a Backend-side mapping (library.artist_id → artists.artist_name).
  • Each entity.identity row produces up to 6 per-source rows in library_identity_source (one per non-NULL external ID column: discogs_artist, mb_artist, wikidata, spotify, apple_music, bandcamp).
  • §5.2 cross-ref pre-index: TS in-memory Map<discogs_id, Set<mb_id|wikidata_qid>> built at job start from wikidata-cache + musicbrainz-cache. Per-row cross-ref check is O(1).
  • §3.2.5 cross_ref_present(s1, s2) helper. Fires when discogs_release_id (from S1) + discogs_artist_id (from S2) cross-reference, OR when discogs_artist_id + mb_artist_mbid + wikidata_qid cross-reference per Wikidata's discogs_mapping.
  • Granularity fanout: S2 is artist-level; the new schema is library-row-level. Resolution: a library row with artist_id=N and any release_id receives the same six artist-level external IDs as every other library row with the same artist. Document the fanout in the resolver — it's correct because identity (Spotify ID, MB UUID) is artist-level, not release-level.

Confidence (locked, conservative interim)

method='alias_match', confidence=0.85 for all six external-ID fields, tagged notes='backfill:S2,trust=lml-aggregate'. Rationale: see plan sub-PR 2.1 details + §5.1.1. Above the 0.70 audit threshold but below name_variation's 0.90+ floor — the right "trust the aggregate, don't promote to exact" tier.

Upside path: once LML adds per-row method+confidence to entity.identity (see WXYC/library-metadata-lookup#270), a follow-up sub-PR re-derives S2 confidences per-row and supersedes the blanket 0.85 via §3.2.2 Rule 4 (existing < new → supersede).

Acceptance

  • DRY_RUN=true reports per-source row counts as JSON without writing.
  • Real run populates 6 library_identity_source rows per entity.identity row × library_id fanout.
  • Cross-source agreement triggers on rows that S1 already wrote: main row method becomes cross_source_agreement, confidence=0.95, agreement_sources populated. Verified by SELECTing a known-corroborated library_id post-run.
  • Unit tests for the cross-ref pre-index (build + lookup) against fixture data.
  • Integration test (or SQL contract test, per Restore end-to-end TS-modules integration tests for library-identity-backfill #791) for the per-source UPSERT shape.
  • §3.2.3 gate-check query reports lower truly_unresolved_rows than after 2.0.

Estimated LOC

~400 (reader + cross-ref function + tests) per plan §4 sub-PR 2.1.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions