Skip to content

feat(import): add work_identifiers support for work-level matching (#9472)#12945

Draft
mekarpeles wants to merge 2 commits into
masterfrom
9472/work-identifiers-import
Draft

feat(import): add work_identifiers support for work-level matching (#9472)#12945
mekarpeles wants to merge 2 commits into
masterfrom
9472/work-identifiers-import

Conversation

@mekarpeles

Copy link
Copy Markdown
Member

Summary

Closes #9472.

Adds a new optional work_identifiers field to the OL import record schema. Importers that know the Work-level identity of a book (e.g. a Goodreads work ID or Wikidata Q-number) can now supply it, and the import pipeline will use it to find the right Work before falling back to title matching.

Motivation

The existing identifiers field is for edition-level IDs. There was no way to supply a Work-level identifier — making it hard to correctly group new editions under the right Work, especially for books without ISBNs (e.g. Wikidata/Wikisource items, aggregator feeds).

Changes

openlibrary/schemata/import.schema.json

  • Added work_identifiers property — same shape as identifiers ({key: [val, ...]})
  • Updated identifiers description to clarify it's edition-scoped

openlibrary/catalog/add_book/__init__.py

  • find_matching_work(): tries work_identifiers first (via site.things() query), falls back to existing title-based match
  • update_work_with_rec_data(): merges new work_identifiers onto the matched Work, deduplicating

openlibrary/catalog/add_book/tests/test_add_book.py

  • test_work_identifiers_match_existing_work — Goodreads ID matches the right Work
  • test_work_identifiers_match_wins_over_title — when identifier and title conflict, identifier wins
  • test_work_identifiers_enriched_onto_matched_work — new identifier written back to Work
  • test_work_identifiers_no_match_falls_back_to_title — unknown identifier → title fallback, no error
  • test_work_identifiers_absent_preserves_existing_behaviour — no work_identifiers key → original behaviour unchanged

Design decisions

  1. Identifier-first, title-fallback — same precedence model as find_author() (remote_ids check first). A caller supplying a work identifier is asserting they know which Work this is; that's stronger signal than a title match.

  2. Conflict resolution: identifier wins — if identifier matches Work A but title matches Work B, identifier wins. Same as find_author(). The identifier came from an external catalogue; the title match is fuzzy by design.

  3. Match + enrich — if a work identifier isn't already on the Work, it's written back. Follows the pattern already used for edition identifiers (line 887) and author remote_ids (line 269).

What this does NOT do

  • Does not add a UI for work_identifiers — out of scope
  • Does not require any migration — purely additive; existing imports unaffected
  • Does not change identifiers (edition) semantics

Open question

Whether infobase things() queries support identifiers.{key} lookups on Works the same way they do on Editions needs Docker verification. Editions use this pattern (line 452: identifiers.wikisource); Works may or may not be indexed identically. Flagged here for reviewer attention.


🤖 Draft opened by Richy, via PAM (Open Library's Project AI Manager), on behalf of @mekarpeles.

Adds a new optional `work_identifiers` field to the import schema,
parallel to `identifiers` (edition). When present, the import pipeline
uses these IDs to locate the matching Work before falling back to title
comparison.

Changes:
- import.schema.json: add `work_identifiers` field (same shape as `identifiers`)
- find_matching_work(): identifier-first lookup via infobase things() query,
  title-based match retained as fallback
- update_work_with_rec_data(): merges incoming work_identifiers onto the
  matched Work, deduplicated
- 5 new tests covering happy path, identifier-wins-over-title conflict,
  enrichment write-back, unknown-identifier fallback, and absent-field
  backward compat
@mekarpeles mekarpeles force-pushed the 9472/work-identifiers-import branch 2 times, most recently from 7839ca3 to fbac84a Compare June 17, 2026 00:15
@mekarpeles

This comment has been minimized.

@mekarpeles mekarpeles force-pushed the 9472/work-identifiers-import branch from 40e237e to bdcfe33 Compare June 17, 2026 00:30
@mekarpeles

This comment has been minimized.

@mekarpeles mekarpeles force-pushed the 9472/work-identifiers-import branch 2 times, most recently from 250d8cb to 6912633 Compare June 17, 2026 00:44
@mekarpeles

mekarpeles commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

This branch had a messy development history — multiple fixup passes and a pre-commit.ci loop. That is now resolved. Current state:

History (2 commits above master):

  1. feat(import): add work_identifiers support for work matching (#9472) — the feature
  2. fix(import): write back work_identifiers to matched work in load_data() — plugs a gap where the load_data() code path (no-edition-match) never wrote identifiers back to the matched work; also fixes a list-aliasing bug in the defaultdict construction

CI: all checks passing — python_tests ✓, pre-commit.ci ✓, javascript_tests ✓, CodeQL

Note on the pre-commit.ci loop: local ruff and CI ruff disagree on import blank-line placement in __init__.py and test_add_book.py. The resolution is to keep the CI ruff version (the one pre-commit.ci already accepted) and not re-run local ruff on those lines.

— Pierre, via PAM

When a record with work_identifiers matched an existing work via load_data()
(the no-edition-match code path), the identifiers were never merged onto the
work object. Only the load() → update_work_with_rec_data() path handled this.

Merges incoming identifiers into the work, deduplicates, and marks for update.
@mekarpeles mekarpeles force-pushed the 9472/work-identifiers-import branch from 54bdbc0 to 8cdccfb Compare June 17, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Import endpoint should allow for any (known) work identifiers

1 participant