Skip to content

§4 step 2 sub-PR 2.2b — S4 discogs-cache fuzzy_resolved reader + writer dispensation #796

@jakebromberg

Description

@jakebromberg

Add fuzzy_resolved as the S4 source leg for the library-identity-backfill (cross-cache-identity §4 step 2). Trigram-resolved Backend library.id linker — the only source that contributes identity at the Backend library_id level (not the Discogs entity level).

Plan: plans/library-hook-canonicalization/section-4-step-2-backfill-plan.md sub-PR 2.2b.
Spike: plans/library-hook-canonicalization/audits/discogs-cache-match-score-shape.md (lands in PR #794).

Scope

  • New module jobs/library-identity-backfill/sources/discogs-cache-fuzzy-resolved.ts. Reads from DATABASE_URL_DISCOGS. The table is materialized by Backend-Service/scripts/fuzzy-trigram-flowsheet.sql:146-176 during the backfill window.
  • Maps each fuzzy_resolved row → one library_identity_source row tagged notes='backfill:S4'.

Important: no Discogs ID

fuzzy_resolved carries no Discogs ID at all — only (artist_norm, album_norm, resolved_library_id). S4 contributes identity at the Backend library_id level, not at the Discogs entity level. This is the source of the writer-contract dispensation below.

Method assignment (post-spike)

The trigram score lives only on the upstream un-persisted fuzzy_full staging table as combined = similarity(artist) + similarity(album) ∈ [1.55, 2.0]. Two options:

  • (a, preferred) modify fuzzy-trigram-flowsheet.sql (~3 lines) to retain trgm_artist_score + trgm_album_score columns on fuzzy_resolved. Map via confidence = 0.7 + 0.3 * (trgm_artist_score * trgm_album_score) clamped to [0.70, 0.95]. Cost: one re-run of the source SQL.
  • (b, fallback) stamp fixed confidence=0.85 for all fuzzy_resolved rows. Matches the existing linkage_confidence=0.85 baked into the prod-side UPDATE at fuzzy-trigram-flowsheet.sql:39, 86. Lower-risk if a re-run is undesirable; doesn't scale with similarity.

This PR picks one; default is (a) unless the source-SQL re-run blocks land timing.

Writer contract dispensation (NEW for §3.2.2.2)

S4 cannot populate any external-ID column. The per-source row needs:

  • external_id = 'wxyc:library:N' (synthetic; or just stamp the Backend library.id itself, depending on what fits the type system cleanly)
  • source = 'trigram_match'
  • method = 'trigram', confidence per the option above
  • notes = 'backfill:S4'

The §3.2.2.2 main-row recompute treats this row as agreement-only: its (method, confidence) factor into the MIN-of-confidences fallback (Rule 4) but its external_id never lands on a library_identity main column. The recompute's column-mapping table needs a carve-out to ignore source='trigram_match'.

No schema change; only the recompute behavior amendment + a notes qualifier.

Acceptance

  • DRY_RUN=true reports per-source row counts as JSON without writing.
  • Real run populates one library_identity_source row per fuzzy_resolved row.
  • Main-row recompute correctly excludes S4 rows from external-ID column population (only contributes to confidence/agreement).
  • Unit test for the recompute carve-out: an S4-only library_id produces a main row with all NULL external IDs but a non-null (method, confidence).
  • Cross-source agreement: when S1/S2/S3 already wrote external IDs for the library_id, S4's contribution boosts main confidence via Rule 2 (cross_source_agreement) when applicable.

Estimated LOC

~300 (reader + recompute carve-out + tests). Plus optional ~3-line edit to fuzzy-trigram-flowsheet.sql if option (a) is chosen.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions