Skip to content

Fix performer Wikipedia-link script + harden nickname matching#211

Merged
dprodger merged 4 commits into
mainfrom
wikipedia-performer-matching
May 31, 2026
Merged

Fix performer Wikipedia-link script + harden nickname matching#211
dprodger merged 4 commits into
mainfrom
wikipedia-performer-matching

Conversation

@dprodger

Copy link
Copy Markdown
Owner

Summary

Fixes the performer Wikipedia-link batch script (backend/scripts/verify_performer_references.py) so it actually runs, makes its bulk output usable, and substantially hardens the shared Wikipedia matcher against false-positive matches for performers whose stored name carries a quoted nickname (e.g. "Brother" Jack McDuff, 'Papa' John DeFrancesco).

Four focused commits:

1. Load .env in the script

The script rolled its own sys.path bootstrap instead of using ScriptBase, so it never loaded .env. db_utils then read empty DB_* vars and fell back to the local Postgres socket (host=None). Added the same dotenv-loading block script_base.py uses.

2. Reduce noise and idle time

For bulk runs over 30k+ performers:

  • Demote unchanged/skipped rows to DEBUG so the INFO stream shows only real changes and items needing review.
  • Add a progress heartbeat every 50 performers with running counts and an ETA.
  • Drop the 1.5s per-performer loop sleep — the Wikipedia and MusicBrainz paths already throttle their own live requests, so it was pure idle time (~13h over 30k rows).

3. Strip quoted nicknames in matching (shared WikipediaSearcher)

Names with a decorative quoted nickname failed lookup two ways: OpenSearch returned an album/redirect instead of the canonical article, and the name comparison scored only a partial (45 < the 50 threshold).

  • _strip_nickname() removes paired double quotes and smart single quotes; lone apostrophes (O'Brien, D'Angelo) are untouched. The stripped form is only used when it still has >= 2 tokens, so a bare surname ('Doc' West -> West) is not used — it would fuzzy-match unrelated famous people.
  • Verify compares the stripped name on both sides (exact, not partial, match).
  • Search queries the canonical form first, then the stored name, merging candidates. Extracted _opensearch(), collapsing duplicated branches and no longer caching transient request failures.

4. Reject non-musician and ambiguous matches

Nickname stripping can yield a common "First Last" that collides with a famous different person ("Captain" Kirk Douglas -> the actor; "Virginia" Joe Jones -> the Fluxus musician).

  • Non-musician guard: reject pages whose infobox/lead is an actor/athlete/politician with no music signal (a music term protects genuine musicians).
  • Disambiguation-corroboration guard: a (...)-disambiguated title requires a birth/death-year or song match before acceptance.
  • Strip hatnotes ("For the musician, see ...") before scoring so cross-references don't leak the other subject's keywords.

Scope note

Commits 3 and 4 touch the shared WikipediaSearcher, so they also improve the core ingestion path (#208) and create_artist.py, not just the batch script.

Verification

Verified end-to-end against the live Wikipedia/production data:

Name Result
"Brother" Jack McDuff -> canonical Jack_McDuff
'Papa' John DeFrancesco -> John_DeFrancesco
Miles Davis (plain) still valid
'Doc' West, "Bugs" Bower, "Bumps" Myers no match (was: Kanye West, Kris Bowers, Myers)
"Captain" Kirk Douglas no match (was: actor Kirk Douglas)
"Virginia" Joe Jones no match (was: wrong Joe Jones)

Adds 20 offline unit tests (test_wikipedia_nickname.py, test_wikipedia_verify_guards.py) covering _strip_nickname and all the guards via crafted HTML.

Related issues

🤖 Generated with Claude Code

dprodger and others added 4 commits May 31, 2026 15:28
The script hand-rolls its own sys.path bootstrap instead of using
ScriptBase, and so never loaded .env. db_utils then read empty DB_*
vars and fell back to the local Postgres socket, failing with
host=None. Add the same dotenv-loading block script_base.py uses so
the script connects using the configured database.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bulk runs over 30k+ performers were dominated by per-row log lines and
loop-level sleep:

- Demote "unchanged"/"skipped" rows to DEBUG so the INFO stream shows
  only actual changes and items needing review.
- Add a progress heartbeat every PROGRESS_INTERVAL (50) performers with
  running counts and an ETA.
- Drop the 1.5s per-performer sleep. WikipediaSearcher.rate_limit() and
  the MusicBrainz path already throttle their own live requests, so the
  extra loop sleep was pure idle time (~13h over 30k rows).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Performer names stored with a decorative quoted nickname (e.g.
"Brother" Jack McDuff, 'Papa' John DeFrancesco) failed Wikipedia lookup
two ways: OpenSearch returned an album/redirect instead of the canonical
article, and the name comparison scored only a partial match (45 < the
50-point threshold).

- Add WikipediaSearcher._strip_nickname() - removes paired double quotes
  and smart single quotes. Lone apostrophes (O'Brien, D'Angelo) have no
  opener and are left untouched. The stripped form is only used when it
  still has >= 2 tokens (a plausible first + last name); collapsing to a
  single bare surname is rejected and the original kept, since a lone
  surname fuzzy-matches unrelated famous people ('Doc' West -> Kanye West,
  'Bugs' Bower -> Kris Bowers).
- verify_wikipedia_reference now strips nicknames from both the performer
  name and the page title before matching, so the stripped legal name is
  an exact (not partial) match.
- search_wikipedia queries the nickname-stripped (canonical) form first,
  then the stored name, merging candidates - so the canonical article is
  found and preferred over an album/redirect. Extracted the OpenSearch
  call into _opensearch(), collapsing the duplicated force/non-force
  branches and no longer caching transient request failures.
- Add unit tests for _strip_nickname, including the single-surname guard.

Improves the shared searcher used by both the batch verifier and the
core ingestion path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Nickname stripping can yield a common "First Last" that collides with a
famous *different* person, and exact-name + generic music keywords (50)
was too weak to tell them apart. Two false positives seen in the wild:
"Captain" Kirk Douglas matched the actor Kirk Douglas, and "Virginia"
Joe Jones matched a different Joe Jones (the Fluxus musician).

Add two precision guards to verify_wikipedia_reference:

- Non-musician guard: if the infobox/lead establishes a non-musician
  subject (actor, athlete, politician, ...) and carries no music signal,
  reject. A music term in the infobox/lead protects genuine musicians
  (e.g. "jazz organist"), so McDuff/DeFrancesco are unaffected.
- Disambiguation-corroboration guard: a parenthetically disambiguated
  title ("Joe Jones (Fluxus musician)") means several same-named people
  exist, so require a birth/death-year or song match before accepting.

Also strip hatnotes ("For the musician, see ...") before reading the
page text. They are cross-references to other subjects; letting their
keywords leak in mis-scored pages (the Kirk Douglas actor hatnote even
mentions "musician" and points at the real performer).

Both bad cases now resolve to no match; McDuff, DeFrancesco and Miles
Davis still verify. Adds offline unit tests (crafted HTML) for the
guards.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dprodger dprodger added the data-cleanup projected related to the underlying metadata, scrapers, ingesters, etc. label May 31, 2026
@dprodger dprodger merged commit 9fa32e1 into main May 31, 2026
3 checks passed
@dprodger dprodger deleted the wikipedia-performer-matching branch May 31, 2026 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data-cleanup projected related to the underlying metadata, scrapers, ingesters, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant