Fix performer Wikipedia-link script + harden nickname matching#211
Merged
Conversation
The script hand-rolls its own sys.path bootstrap instead of using ScriptBase, and so never loaded .env. db_utils then read empty DB_* vars and fell back to the local Postgres socket, failing with host=None. Add the same dotenv-loading block script_base.py uses so the script connects using the configured database. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bulk runs over 30k+ performers were dominated by per-row log lines and loop-level sleep: - Demote "unchanged"/"skipped" rows to DEBUG so the INFO stream shows only actual changes and items needing review. - Add a progress heartbeat every PROGRESS_INTERVAL (50) performers with running counts and an ETA. - Drop the 1.5s per-performer sleep. WikipediaSearcher.rate_limit() and the MusicBrainz path already throttle their own live requests, so the extra loop sleep was pure idle time (~13h over 30k rows). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Performer names stored with a decorative quoted nickname (e.g.
"Brother" Jack McDuff, 'Papa' John DeFrancesco) failed Wikipedia lookup
two ways: OpenSearch returned an album/redirect instead of the canonical
article, and the name comparison scored only a partial match (45 < the
50-point threshold).
- Add WikipediaSearcher._strip_nickname() - removes paired double quotes
and smart single quotes. Lone apostrophes (O'Brien, D'Angelo) have no
opener and are left untouched. The stripped form is only used when it
still has >= 2 tokens (a plausible first + last name); collapsing to a
single bare surname is rejected and the original kept, since a lone
surname fuzzy-matches unrelated famous people ('Doc' West -> Kanye West,
'Bugs' Bower -> Kris Bowers).
- verify_wikipedia_reference now strips nicknames from both the performer
name and the page title before matching, so the stripped legal name is
an exact (not partial) match.
- search_wikipedia queries the nickname-stripped (canonical) form first,
then the stored name, merging candidates - so the canonical article is
found and preferred over an album/redirect. Extracted the OpenSearch
call into _opensearch(), collapsing the duplicated force/non-force
branches and no longer caching transient request failures.
- Add unit tests for _strip_nickname, including the single-surname guard.
Improves the shared searcher used by both the batch verifier and the
core ingestion path.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Nickname stripping can yield a common "First Last" that collides with a
famous *different* person, and exact-name + generic music keywords (50)
was too weak to tell them apart. Two false positives seen in the wild:
"Captain" Kirk Douglas matched the actor Kirk Douglas, and "Virginia"
Joe Jones matched a different Joe Jones (the Fluxus musician).
Add two precision guards to verify_wikipedia_reference:
- Non-musician guard: if the infobox/lead establishes a non-musician
subject (actor, athlete, politician, ...) and carries no music signal,
reject. A music term in the infobox/lead protects genuine musicians
(e.g. "jazz organist"), so McDuff/DeFrancesco are unaffected.
- Disambiguation-corroboration guard: a parenthetically disambiguated
title ("Joe Jones (Fluxus musician)") means several same-named people
exist, so require a birth/death-year or song match before accepting.
Also strip hatnotes ("For the musician, see ...") before reading the
page text. They are cross-references to other subjects; letting their
keywords leak in mis-scored pages (the Kirk Douglas actor hatnote even
mentions "musician" and points at the real performer).
Both bad cases now resolve to no match; McDuff, DeFrancesco and Miles
Davis still verify. Adds offline unit tests (crafted HTML) for the
guards.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the performer Wikipedia-link batch script (
backend/scripts/verify_performer_references.py) so it actually runs, makes its bulk output usable, and substantially hardens the shared Wikipedia matcher against false-positive matches for performers whose stored name carries a quoted nickname (e.g."Brother" Jack McDuff,'Papa' John DeFrancesco).Four focused commits:
1. Load
.envin the scriptThe script rolled its own
sys.pathbootstrap instead of usingScriptBase, so it never loaded.env.db_utilsthen read emptyDB_*vars and fell back to the local Postgres socket (host=None). Added the same dotenv-loading blockscript_base.pyuses.2. Reduce noise and idle time
For bulk runs over 30k+ performers:
unchanged/skippedrows to DEBUG so the INFO stream shows only real changes and items needing review.3. Strip quoted nicknames in matching (shared
WikipediaSearcher)Names with a decorative quoted nickname failed lookup two ways: OpenSearch returned an album/redirect instead of the canonical article, and the name comparison scored only a partial (45 < the 50 threshold).
_strip_nickname()removes paired double quotes and smart single quotes; lone apostrophes (O'Brien, D'Angelo) are untouched. The stripped form is only used when it still has >= 2 tokens, so a bare surname ('Doc' West->West) is not used — it would fuzzy-match unrelated famous people._opensearch(), collapsing duplicated branches and no longer caching transient request failures.4. Reject non-musician and ambiguous matches
Nickname stripping can yield a common "First Last" that collides with a famous different person (
"Captain" Kirk Douglas-> the actor;"Virginia" Joe Jones-> the Fluxus musician).(...)-disambiguated title requires a birth/death-year or song match before acceptance.Scope note
Commits 3 and 4 touch the shared
WikipediaSearcher, so they also improve the core ingestion path (#208) andcreate_artist.py, not just the batch script.Verification
Verified end-to-end against the live Wikipedia/production data:
"Brother" Jack McDuffJack_McDuff'Papa' John DeFrancescoJohn_DeFrancescoMiles Davis(plain)'Doc' West,"Bugs" Bower,"Bumps" Myers"Captain" Kirk Douglas"Virginia" Joe JonesAdds 20 offline unit tests (
test_wikipedia_nickname.py,test_wikipedia_verify_guards.py) covering_strip_nicknameand all the guards via crafted HTML.Related issues
🤖 Generated with Claude Code