Add flowsheet_match as the S3 source leg for the library-identity-backfill (cross-cache-identity §4 step 2). One of the highest-volume sources (~23,408 rows from the 2026-04-28 reconciliation run).
Plan: plans/library-hook-canonicalization/section-4-step-2-backfill-plan.md sub-PR 2.2a.
Spike: plans/library-hook-canonicalization/audits/discogs-cache-match-score-shape.md (lands in PR #794).
Scope
- New module
jobs/library-identity-backfill/sources/discogs-cache-flowsheet-match.ts. Reads from DATABASE_URL_DISCOGS. The table is materialized by Backend-Service/scripts/discogs-bridge-flowsheet.sql:144-161 during the backfill window — the reader either re-materializes it inline or reads from a snapshot the operator has already produced.
- Filters to rows where the join columns can be normalized to
(library.norm_artist, library.norm_title) per the §3.3 normalization function (already deployed via wxyc-etl 0.3.0).
- Maps each
flowsheet_match row → one library_identity_source row tagged notes='backfill:S3'.
Method assignment (post-spike)
distinct_entities = 1 → method='exact_match', confidence=1.00
distinct_entities > 1 (the 4,594 ambiguous rows) → method='alias_match', confidence=0.75. The plan's prior method='trigram', confidence=0.7+0.3*<trgm_score> branch is dead code per the spike — flowsheet_match is an exact equi-join on normalized strings; no per-row score exists.
Discogs ID mapping (post-spike, option (c))
flowsheet_match.master_id IS NOT NULL → write discogs_master_id = master_id, leave discogs_release_id NULL.
flowsheet_match.master_id IS NULL → write discogs_release_id = release_id, leave discogs_master_id NULL.
- Never both, never neither. Mirrors the upstream bridge's master-preferred / release-fallback pattern.
Writer contract
Standard. One per-source row per flowsheet_match × library_id, populating exactly one of the two release-level Discogs ID columns. The §3.2.2.2 main-row recompute uses S3's contributions in the cross-source agreement detector (§3.2.5) — when S1 and S3 both pin discogs_release_id for the same library row, that's a corroboration.
Acceptance
DRY_RUN=true reports per-source row counts as JSON without writing.
- Real run populates one
library_identity_source row per matched flowsheet pair × library_id.
- Cross-source agreement triggers on rows where S1+S3 both pin a Discogs entity: main row method becomes
cross_source_agreement, confidence=0.95, agreement_sources populated.
- Unit tests for the master-vs-release option-(c) selector against fixture data.
- Integration test (or SQL contract test) for the per-source UPSERT shape.
- §3.2.3 gate-check query reports lower
truly_unresolved_rows than after 2.1.
Estimated LOC
~300 (reader + master/release selector + tests).
Refs
Add
flowsheet_matchas the S3 source leg for the library-identity-backfill (cross-cache-identity §4 step 2). One of the highest-volume sources (~23,408 rows from the 2026-04-28 reconciliation run).Plan:
plans/library-hook-canonicalization/section-4-step-2-backfill-plan.mdsub-PR 2.2a.Spike:
plans/library-hook-canonicalization/audits/discogs-cache-match-score-shape.md(lands in PR #794).Scope
jobs/library-identity-backfill/sources/discogs-cache-flowsheet-match.ts. Reads fromDATABASE_URL_DISCOGS. The table is materialized byBackend-Service/scripts/discogs-bridge-flowsheet.sql:144-161during the backfill window — the reader either re-materializes it inline or reads from a snapshot the operator has already produced.(library.norm_artist, library.norm_title)per the §3.3 normalization function (already deployed via wxyc-etl 0.3.0).flowsheet_matchrow → onelibrary_identity_sourcerow taggednotes='backfill:S3'.Method assignment (post-spike)
distinct_entities = 1→method='exact_match', confidence=1.00distinct_entities > 1(the 4,594 ambiguous rows) →method='alias_match', confidence=0.75. The plan's priormethod='trigram', confidence=0.7+0.3*<trgm_score>branch is dead code per the spike —flowsheet_matchis an exact equi-join on normalized strings; no per-row score exists.Discogs ID mapping (post-spike, option (c))
flowsheet_match.master_id IS NOT NULL→ writediscogs_master_id = master_id, leavediscogs_release_idNULL.flowsheet_match.master_id IS NULL→ writediscogs_release_id = release_id, leavediscogs_master_idNULL.Writer contract
Standard. One per-source row per
flowsheet_match× library_id, populating exactly one of the two release-level Discogs ID columns. The §3.2.2.2 main-row recompute uses S3's contributions in the cross-source agreement detector (§3.2.5) — when S1 and S3 both pindiscogs_release_idfor the same library row, that's a corroboration.Acceptance
DRY_RUN=truereports per-source row counts as JSON without writing.library_identity_sourcerow per matched flowsheet pair × library_id.cross_source_agreement,confidence=0.95,agreement_sourcespopulated.truly_unresolved_rowsthan after 2.1.Estimated LOC
~300 (reader + master/release selector + tests).
Refs