You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add the LML/discogs-cache PG leg as the second source for the library-identity-backfill (cross-cache-identity §4 step 2). Triggers cross-source agreement detection: library rows that have BOTH a discogs_release from S1 AND e.g. a discogs_artist + mb_artist from S2 cross-ref via Wikidata's discogs_mapping → main row method='cross_source_agreement', confidence=0.95.
New module jobs/library-identity-backfill/sources/lml-entity-identity.ts. Connects to DATABASE_URL_DISCOGS (read-only). Bulk-reads entity.identity joined to a Backend-side mapping (library.artist_id → artists.artist_name).
Each entity.identity row produces up to 6 per-source rows in library_identity_source (one per non-NULL external ID column: discogs_artist, mb_artist, wikidata, spotify, apple_music, bandcamp).
§5.2 cross-ref pre-index: TS in-memory Map<discogs_id, Set<mb_id|wikidata_qid>> built at job start from wikidata-cache + musicbrainz-cache. Per-row cross-ref check is O(1).
§3.2.5 cross_ref_present(s1, s2) helper. Fires when discogs_release_id (from S1) + discogs_artist_id (from S2) cross-reference, OR when discogs_artist_id + mb_artist_mbid + wikidata_qid cross-reference per Wikidata's discogs_mapping.
Granularity fanout: S2 is artist-level; the new schema is library-row-level. Resolution: a library row with artist_id=N and any release_id receives the same six artist-level external IDs as every other library row with the same artist. Document the fanout in the resolver — it's correct because identity (Spotify ID, MB UUID) is artist-level, not release-level.
Confidence (locked, conservative interim)
method='alias_match', confidence=0.85 for all six external-ID fields, tagged notes='backfill:S2,trust=lml-aggregate'. Rationale: see plan sub-PR 2.1 details + §5.1.1. Above the 0.70 audit threshold but below name_variation's 0.90+ floor — the right "trust the aggregate, don't promote to exact" tier.
Upside path: once LML adds per-row method+confidence to entity.identity (see WXYC/library-metadata-lookup#270), a follow-up sub-PR re-derives S2 confidences per-row and supersedes the blanket 0.85 via §3.2.2 Rule 4 (existing < new → supersede).
Acceptance
DRY_RUN=true reports per-source row counts as JSON without writing.
Real run populates 6 library_identity_source rows per entity.identity row × library_id fanout.
Cross-source agreement triggers on rows that S1 already wrote: main row method becomes cross_source_agreement, confidence=0.95, agreement_sources populated. Verified by SELECTing a known-corroborated library_id post-run.
Unit tests for the cross-ref pre-index (build + lookup) against fixture data.
Add the LML/discogs-cache PG leg as the second source for the library-identity-backfill (cross-cache-identity §4 step 2). Triggers cross-source agreement detection: library rows that have BOTH a
discogs_releasefrom S1 AND e.g. adiscogs_artist+mb_artistfrom S2 cross-ref via Wikidata'sdiscogs_mapping→ main rowmethod='cross_source_agreement', confidence=0.95.Plan:
plans/library-hook-canonicalization/section-4-step-2-backfill-plan.mdsub-PR 2.1.Scope
jobs/library-identity-backfill/sources/lml-entity-identity.ts. Connects toDATABASE_URL_DISCOGS(read-only). Bulk-readsentity.identityjoined to a Backend-side mapping (library.artist_id → artists.artist_name).entity.identityrow produces up to 6 per-source rows inlibrary_identity_source(one per non-NULL external ID column:discogs_artist,mb_artist,wikidata,spotify,apple_music,bandcamp).Map<discogs_id, Set<mb_id|wikidata_qid>>built at job start from wikidata-cache + musicbrainz-cache. Per-row cross-ref check is O(1).cross_ref_present(s1, s2)helper. Fires whendiscogs_release_id(from S1) +discogs_artist_id(from S2) cross-reference, OR whendiscogs_artist_id+mb_artist_mbid+wikidata_qidcross-reference per Wikidata'sdiscogs_mapping.artist_id=Nand any release_id receives the same six artist-level external IDs as every other library row with the same artist. Document the fanout in the resolver — it's correct because identity (Spotify ID, MB UUID) is artist-level, not release-level.Confidence (locked, conservative interim)
method='alias_match', confidence=0.85for all six external-ID fields, taggednotes='backfill:S2,trust=lml-aggregate'. Rationale: see plan sub-PR 2.1 details + §5.1.1. Above the 0.70 audit threshold but belowname_variation's 0.90+ floor — the right "trust the aggregate, don't promote to exact" tier.Upside path: once LML adds per-row method+confidence to
entity.identity(see WXYC/library-metadata-lookup#270), a follow-up sub-PR re-derives S2 confidences per-row and supersedes the blanket 0.85 via §3.2.2 Rule 4 (existing < new → supersede).Acceptance
DRY_RUN=truereports per-source row counts as JSON without writing.library_identity_sourcerows perentity.identityrow × library_id fanout.cross_source_agreement,confidence=0.95,agreement_sourcespopulated. Verified by SELECTing a known-corroborated library_id post-run.truly_unresolved_rowsthan after 2.0.Estimated LOC
~400 (reader + cross-ref function + tests) per plan §4 sub-PR 2.1.
Refs