feat(import): add work_identifiers support for work-level matching (#9472) by mekarpeles · Pull Request #12945 · internetarchive/openlibrary

mekarpeles · 2026-06-16T23:39:45Z

Summary

Closes #9472.

Adds a new optional work_identifiers field to the OL import record schema. Importers that know the Work-level identity of a book (e.g. a Goodreads work ID or Wikidata Q-number) can now supply it, and the import pipeline will use it to find the right Work before falling back to title matching.

Motivation

The existing identifiers field is for edition-level IDs. There was no way to supply a Work-level identifier — making it hard to correctly group new editions under the right Work, especially for books without ISBNs (e.g. Wikidata/Wikisource items, aggregator feeds).

Changes

openlibrary/schemata/import.schema.json

Added work_identifiers property — same shape as identifiers ({key: [val, ...]})
Updated identifiers description to clarify it's edition-scoped

openlibrary/catalog/add_book/__init__.py

find_matching_work(): tries work_identifiers first (via site.things() query), falls back to existing title-based match
update_work_with_rec_data(): merges new work_identifiers onto the matched Work, deduplicating

openlibrary/catalog/add_book/tests/test_add_book.py

test_work_identifiers_match_existing_work — Goodreads ID matches the right Work
test_work_identifiers_match_wins_over_title — when identifier and title conflict, identifier wins
test_work_identifiers_enriched_onto_matched_work — new identifier written back to Work
test_work_identifiers_no_match_falls_back_to_title — unknown identifier → title fallback, no error
test_work_identifiers_absent_preserves_existing_behaviour — no work_identifiers key → original behaviour unchanged

Design decisions

Identifier-first, title-fallback — same precedence model as find_author() (remote_ids check first). A caller supplying a work identifier is asserting they know which Work this is; that's stronger signal than a title match.
Conflict resolution: identifier wins — if identifier matches Work A but title matches Work B, identifier wins. Same as find_author(). The identifier came from an external catalogue; the title match is fuzzy by design.
Match + enrich — if a work identifier isn't already on the Work, it's written back. Follows the pattern already used for edition identifiers (line 887) and author remote_ids (line 269).

What this does NOT do

Does not add a UI for work_identifiers — out of scope
Does not require any migration — purely additive; existing imports unaffected
Does not change identifiers (edition) semantics

Open question

Whether infobase things() queries support identifiers.{key} lookups on Works the same way they do on Editions needs Docker verification. Editions use this pattern (line 452: identifiers.wikisource); Works may or may not be indexed identically. Flagged here for reviewer attention.

🤖 Draft opened by Richy, via PAM (Open Library's Project AI Manager), on behalf of @mekarpeles.

Adds a new optional `work_identifiers` field to the import schema, parallel to `identifiers` (edition). When present, the import pipeline uses these IDs to locate the matching Work before falling back to title comparison. Changes: - import.schema.json: add `work_identifiers` field (same shape as `identifiers`) - find_matching_work(): identifier-first lookup via infobase things() query, title-based match retained as fallback - update_work_with_rec_data(): merges incoming work_identifiers onto the matched Work, deduplicated - 5 new tests covering happy path, identifier-wins-over-title conflict, enrichment write-back, unknown-identifier fallback, and absent-field backward compat

mekarpeles · 2026-06-17T01:04:42Z

This branch had a messy development history — multiple fixup passes and a pre-commit.ci loop. That is now resolved. Current state:

History (2 commits above master):

feat(import): add work_identifiers support for work matching (#9472) — the feature
fix(import): write back work_identifiers to matched work in load_data() — plugs a gap where the load_data() code path (no-edition-match) never wrote identifiers back to the matched work; also fixes a list-aliasing bug in the defaultdict construction

CI: all checks passing — python_tests ✓, pre-commit.ci ✓, javascript_tests ✓, CodeQL ✓

Note on the pre-commit.ci loop: local ruff and CI ruff disagree on import blank-line placement in __init__.py and test_add_book.py. The resolution is to keep the CI ruff version (the one pre-commit.ci already accepted) and not re-run local ruff on those lines.

— Pierre, via PAM

When a record with work_identifiers matched an existing work via load_data() (the no-edition-match code path), the identifiers were never merged onto the work object. Only the load() → update_work_with_rec_data() path handled this. Merges incoming identifiers into the work, deduplicates, and marks for update.

mekarpeles force-pushed the 9472/work-identifiers-import branch 2 times, most recently from 7839ca3 to fbac84a Compare June 17, 2026 00:15

This comment has been minimized.

Sign in to view

mekarpeles force-pushed the 9472/work-identifiers-import branch from 40e237e to bdcfe33 Compare June 17, 2026 00:30

This comment has been minimized.

Sign in to view

mekarpeles force-pushed the 9472/work-identifiers-import branch 2 times, most recently from 250d8cb to 6912633 Compare June 17, 2026 00:44

mekarpeles force-pushed the 9472/work-identifiers-import branch from 54bdbc0 to 8cdccfb Compare June 17, 2026 03:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(import): add work_identifiers support for work-level matching (#9472)#12945

feat(import): add work_identifiers support for work-level matching (#9472)#12945
mekarpeles wants to merge 2 commits into
masterfrom
9472/work-identifiers-import

mekarpeles commented Jun 16, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

mekarpeles commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mekarpeles commented Jun 16, 2026

Summary

Motivation

Changes

Design decisions

What this does NOT do

Open question

Uh oh!

This comment has been minimized.

This comment has been minimized.

mekarpeles commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mekarpeles commented Jun 17, 2026 •

edited

Loading