feat(import): add work_identifiers support for work-level matching (#9472)#12945
Draft
mekarpeles wants to merge 2 commits into
Draft
feat(import): add work_identifiers support for work-level matching (#9472)#12945mekarpeles wants to merge 2 commits into
mekarpeles wants to merge 2 commits into
Conversation
Adds a new optional `work_identifiers` field to the import schema, parallel to `identifiers` (edition). When present, the import pipeline uses these IDs to locate the matching Work before falling back to title comparison. Changes: - import.schema.json: add `work_identifiers` field (same shape as `identifiers`) - find_matching_work(): identifier-first lookup via infobase things() query, title-based match retained as fallback - update_work_with_rec_data(): merges incoming work_identifiers onto the matched Work, deduplicated - 5 new tests covering happy path, identifier-wins-over-title conflict, enrichment write-back, unknown-identifier fallback, and absent-field backward compat
7839ca3 to
fbac84a
Compare
This comment has been minimized.
This comment has been minimized.
40e237e to
bdcfe33
Compare
This comment has been minimized.
This comment has been minimized.
250d8cb to
6912633
Compare
Member
Author
|
This branch had a messy development history — multiple fixup passes and a pre-commit.ci loop. That is now resolved. Current state: History (2 commits above master):
CI: all checks passing — Note on the pre-commit.ci loop: local ruff and CI ruff disagree on import blank-line placement in — Pierre, via PAM |
When a record with work_identifiers matched an existing work via load_data() (the no-edition-match code path), the identifiers were never merged onto the work object. Only the load() → update_work_with_rec_data() path handled this. Merges incoming identifiers into the work, deduplicates, and marks for update.
54bdbc0 to
8cdccfb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #9472.
Adds a new optional
work_identifiersfield to the OL import record schema. Importers that know the Work-level identity of a book (e.g. a Goodreads work ID or Wikidata Q-number) can now supply it, and the import pipeline will use it to find the right Work before falling back to title matching.Motivation
The existing
identifiersfield is for edition-level IDs. There was no way to supply a Work-level identifier — making it hard to correctly group new editions under the right Work, especially for books without ISBNs (e.g. Wikidata/Wikisource items, aggregator feeds).Changes
openlibrary/schemata/import.schema.jsonwork_identifiersproperty — same shape asidentifiers({key: [val, ...]})identifiersdescription to clarify it's edition-scopedopenlibrary/catalog/add_book/__init__.pyfind_matching_work(): trieswork_identifiersfirst (viasite.things()query), falls back to existing title-based matchupdate_work_with_rec_data(): merges newwork_identifiersonto the matched Work, deduplicatingopenlibrary/catalog/add_book/tests/test_add_book.pytest_work_identifiers_match_existing_work— Goodreads ID matches the right Worktest_work_identifiers_match_wins_over_title— when identifier and title conflict, identifier winstest_work_identifiers_enriched_onto_matched_work— new identifier written back to Worktest_work_identifiers_no_match_falls_back_to_title— unknown identifier → title fallback, no errortest_work_identifiers_absent_preserves_existing_behaviour— nowork_identifierskey → original behaviour unchangedDesign decisions
Identifier-first, title-fallback — same precedence model as
find_author()(remote_ids check first). A caller supplying a work identifier is asserting they know which Work this is; that's stronger signal than a title match.Conflict resolution: identifier wins — if identifier matches Work A but title matches Work B, identifier wins. Same as
find_author(). The identifier came from an external catalogue; the title match is fuzzy by design.Match + enrich — if a work identifier isn't already on the Work, it's written back. Follows the pattern already used for edition identifiers (line 887) and author remote_ids (line 269).
What this does NOT do
work_identifiers— out of scopeidentifiers(edition) semanticsOpen question
Whether infobase
things()queries supportidentifiers.{key}lookups on Works the same way they do on Editions needs Docker verification. Editions use this pattern (line 452:identifiers.wikisource); Works may or may not be indexed identically. Flagged here for reviewer attention.🤖 Draft opened by Richy, via PAM (Open Library's Project AI Manager), on behalf of @mekarpeles.