Add fuzzy_resolved as the S4 source leg for the library-identity-backfill (cross-cache-identity §4 step 2). Trigram-resolved Backend library.id linker — the only source that contributes identity at the Backend library_id level (not the Discogs entity level).
Plan: plans/library-hook-canonicalization/section-4-step-2-backfill-plan.md sub-PR 2.2b.
Spike: plans/library-hook-canonicalization/audits/discogs-cache-match-score-shape.md (lands in PR #794).
Scope
- New module
jobs/library-identity-backfill/sources/discogs-cache-fuzzy-resolved.ts. Reads from DATABASE_URL_DISCOGS. The table is materialized by Backend-Service/scripts/fuzzy-trigram-flowsheet.sql:146-176 during the backfill window.
- Maps each
fuzzy_resolved row → one library_identity_source row tagged notes='backfill:S4'.
Important: no Discogs ID
fuzzy_resolved carries no Discogs ID at all — only (artist_norm, album_norm, resolved_library_id). S4 contributes identity at the Backend library_id level, not at the Discogs entity level. This is the source of the writer-contract dispensation below.
Method assignment (post-spike)
The trigram score lives only on the upstream un-persisted fuzzy_full staging table as combined = similarity(artist) + similarity(album) ∈ [1.55, 2.0]. Two options:
- (a, preferred) modify
fuzzy-trigram-flowsheet.sql (~3 lines) to retain trgm_artist_score + trgm_album_score columns on fuzzy_resolved. Map via confidence = 0.7 + 0.3 * (trgm_artist_score * trgm_album_score) clamped to [0.70, 0.95]. Cost: one re-run of the source SQL.
- (b, fallback) stamp fixed
confidence=0.85 for all fuzzy_resolved rows. Matches the existing linkage_confidence=0.85 baked into the prod-side UPDATE at fuzzy-trigram-flowsheet.sql:39, 86. Lower-risk if a re-run is undesirable; doesn't scale with similarity.
This PR picks one; default is (a) unless the source-SQL re-run blocks land timing.
Writer contract dispensation (NEW for §3.2.2.2)
S4 cannot populate any external-ID column. The per-source row needs:
external_id = 'wxyc:library:N' (synthetic; or just stamp the Backend library.id itself, depending on what fits the type system cleanly)
source = 'trigram_match'
method = 'trigram', confidence per the option above
notes = 'backfill:S4'
The §3.2.2.2 main-row recompute treats this row as agreement-only: its (method, confidence) factor into the MIN-of-confidences fallback (Rule 4) but its external_id never lands on a library_identity main column. The recompute's column-mapping table needs a carve-out to ignore source='trigram_match'.
No schema change; only the recompute behavior amendment + a notes qualifier.
Acceptance
DRY_RUN=true reports per-source row counts as JSON without writing.
- Real run populates one
library_identity_source row per fuzzy_resolved row.
- Main-row recompute correctly excludes S4 rows from external-ID column population (only contributes to confidence/agreement).
- Unit test for the recompute carve-out: an S4-only library_id produces a main row with all NULL external IDs but a non-null
(method, confidence).
- Cross-source agreement: when S1/S2/S3 already wrote external IDs for the library_id, S4's contribution boosts main confidence via Rule 2 (cross_source_agreement) when applicable.
Estimated LOC
~300 (reader + recompute carve-out + tests). Plus optional ~3-line edit to fuzzy-trigram-flowsheet.sql if option (a) is chosen.
Refs
Add
fuzzy_resolvedas the S4 source leg for the library-identity-backfill (cross-cache-identity §4 step 2). Trigram-resolved Backendlibrary.idlinker — the only source that contributes identity at the Backendlibrary_idlevel (not the Discogs entity level).Plan:
plans/library-hook-canonicalization/section-4-step-2-backfill-plan.mdsub-PR 2.2b.Spike:
plans/library-hook-canonicalization/audits/discogs-cache-match-score-shape.md(lands in PR #794).Scope
jobs/library-identity-backfill/sources/discogs-cache-fuzzy-resolved.ts. Reads fromDATABASE_URL_DISCOGS. The table is materialized byBackend-Service/scripts/fuzzy-trigram-flowsheet.sql:146-176during the backfill window.fuzzy_resolvedrow → onelibrary_identity_sourcerow taggednotes='backfill:S4'.Important: no Discogs ID
fuzzy_resolvedcarries no Discogs ID at all — only(artist_norm, album_norm, resolved_library_id). S4 contributes identity at the Backendlibrary_idlevel, not at the Discogs entity level. This is the source of the writer-contract dispensation below.Method assignment (post-spike)
The trigram score lives only on the upstream un-persisted
fuzzy_fullstaging table ascombined = similarity(artist) + similarity(album) ∈ [1.55, 2.0]. Two options:fuzzy-trigram-flowsheet.sql(~3 lines) to retaintrgm_artist_score+trgm_album_scorecolumns onfuzzy_resolved. Map viaconfidence = 0.7 + 0.3 * (trgm_artist_score * trgm_album_score)clamped to[0.70, 0.95]. Cost: one re-run of the source SQL.confidence=0.85for allfuzzy_resolvedrows. Matches the existinglinkage_confidence=0.85baked into the prod-side UPDATE atfuzzy-trigram-flowsheet.sql:39, 86. Lower-risk if a re-run is undesirable; doesn't scale with similarity.This PR picks one; default is (a) unless the source-SQL re-run blocks land timing.
Writer contract dispensation (NEW for §3.2.2.2)
S4 cannot populate any external-ID column. The per-source row needs:
external_id='wxyc:library:N'(synthetic; or just stamp the Backendlibrary.iditself, depending on what fits the type system cleanly)source='trigram_match'method='trigram',confidenceper the option abovenotes='backfill:S4'The §3.2.2.2 main-row recompute treats this row as agreement-only: its
(method, confidence)factor into the MIN-of-confidences fallback (Rule 4) but itsexternal_idnever lands on alibrary_identitymain column. The recompute's column-mapping table needs a carve-out to ignoresource='trigram_match'.No schema change; only the recompute behavior amendment + a notes qualifier.
Acceptance
DRY_RUN=truereports per-source row counts as JSON without writing.library_identity_sourcerow perfuzzy_resolvedrow.(method, confidence).Estimated LOC
~300 (reader + recompute carve-out + tests). Plus optional ~3-line edit to
fuzzy-trigram-flowsheet.sqlif option (a) is chosen.Refs
flowsheet_match)