Skip to content

§4 step 2 sub-PR 2.2a — S3 discogs-cache flowsheet_match reader #795

@jakebromberg

Description

@jakebromberg

Add flowsheet_match as the S3 source leg for the library-identity-backfill (cross-cache-identity §4 step 2). One of the highest-volume sources (~23,408 rows from the 2026-04-28 reconciliation run).

Plan: plans/library-hook-canonicalization/section-4-step-2-backfill-plan.md sub-PR 2.2a.
Spike: plans/library-hook-canonicalization/audits/discogs-cache-match-score-shape.md (lands in PR #794).

Scope

  • New module jobs/library-identity-backfill/sources/discogs-cache-flowsheet-match.ts. Reads from DATABASE_URL_DISCOGS. The table is materialized by Backend-Service/scripts/discogs-bridge-flowsheet.sql:144-161 during the backfill window — the reader either re-materializes it inline or reads from a snapshot the operator has already produced.
  • Filters to rows where the join columns can be normalized to (library.norm_artist, library.norm_title) per the §3.3 normalization function (already deployed via wxyc-etl 0.3.0).
  • Maps each flowsheet_match row → one library_identity_source row tagged notes='backfill:S3'.

Method assignment (post-spike)

  • distinct_entities = 1method='exact_match', confidence=1.00
  • distinct_entities > 1 (the 4,594 ambiguous rows) → method='alias_match', confidence=0.75. The plan's prior method='trigram', confidence=0.7+0.3*<trgm_score> branch is dead code per the spike — flowsheet_match is an exact equi-join on normalized strings; no per-row score exists.

Discogs ID mapping (post-spike, option (c))

  • flowsheet_match.master_id IS NOT NULL → write discogs_master_id = master_id, leave discogs_release_id NULL.
  • flowsheet_match.master_id IS NULL → write discogs_release_id = release_id, leave discogs_master_id NULL.
  • Never both, never neither. Mirrors the upstream bridge's master-preferred / release-fallback pattern.

Writer contract

Standard. One per-source row per flowsheet_match × library_id, populating exactly one of the two release-level Discogs ID columns. The §3.2.2.2 main-row recompute uses S3's contributions in the cross-source agreement detector (§3.2.5) — when S1 and S3 both pin discogs_release_id for the same library row, that's a corroboration.

Acceptance

  • DRY_RUN=true reports per-source row counts as JSON without writing.
  • Real run populates one library_identity_source row per matched flowsheet pair × library_id.
  • Cross-source agreement triggers on rows where S1+S3 both pin a Discogs entity: main row method becomes cross_source_agreement, confidence=0.95, agreement_sources populated.
  • Unit tests for the master-vs-release option-(c) selector against fixture data.
  • Integration test (or SQL contract test) for the per-source UPSERT shape.
  • §3.2.3 gate-check query reports lower truly_unresolved_rows than after 2.1.

Estimated LOC

~300 (reader + master/release selector + tests).

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions