Skip to content

Add transaction_details table#279

Merged
mekarpeles merged 5 commits into
masterfrom
txn_details
Apr 8, 2026
Merged

Add transaction_details table#279
mekarpeles merged 5 commits into
masterfrom
txn_details

Conversation

@jimchamp

@jimchamp jimchamp commented Apr 8, 2026

Copy link
Copy Markdown
Collaborator

Adds DDL, persistence logic, and feature flag for a new transaction_details table.

When the use_transaction_details_table configuration is set to True, detailed information about each save and save_many transaction are written to the new table.

The persistence logic leverages the IndexUtil, which diffs a Thing with it's previous version in order to write to the property and datum_* (and other index tables like edition_*, work_*, etc.) tables. IndexUtil.update_index was modified to return the collection of deletes and inserts used to update the index tables.

This data, along with the transaction_id, author key, and author's bot status is passed to SaveImpl._add_transaction_details for persistence in the new table.

jimchamp added 4 commits April 6, 2026 23:28
All save and save_many transactions will
fail if we attempt to write to the new table
before it exists.

This commit prevents writing to
`transaction_defails` by default.  This can,
and likely should, be overridden in the
`infogami.yml` configuration file.
Comment thread infogami/infobase/_dbstore/save.py Outdated

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this file is only executed when the install action runs.

LANGUAGE SQL;

create table transaction_details (
id serial primary key,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
id serial primary key,
id bigserial primary key,

@mekarpeles

Copy link
Copy Markdown
Member

Walkthrough: transaction_details Table (infogami PR #279)

Related: internetarchive/infogami#279 · internetarchive/openlibrary#11955

Note

This document was auto-generated by PAM (Project AI Manager) based on a hands-on exploration session with Mek. It is intended to help contributors understand and review Jim's PR.


Why This Change

Open Library's Participation Score initiative aims to track what contributors are actually editing — which fields on which books and authors they're changing. The existing transaction table tells you that a save happened (who, when, comment), but not what changed at the field level. The new transaction_details table fills that gap.


How Infogami's Data Store Works

Open Library's backend is built on Infogami, a wiki-style versioned data store. Understanding a few key concepts makes the PR much easier to follow.

Everything Is a Thing

Every entity — works (/works/OL123W), editions (/books/OL456M), authors (/authors/OL789A), even types and users — is a row in the thing table. Things have a key, a type (itself a thing), and a latest_revision.

Two Copies of Every Edit

When something is saved, Infogami writes data in two complementary ways:

  1. Full JSON blob in the data table — every field, every revision, verbatim. This is the authoritative record.
  2. Index tables — scalar values extracted from the blob and stored in typed tables (work_str, edition_str, author_str, datum_str, work_ref, edition_ref, etc.) for efficient querying. Only str, int, and ref datatypes are indexed.

The property Table

The property table registers which field names are known for each type. It is created lazily — a property row is only written the first time infogami encounters that field name on a given type during a save.

Index Table Routing

When indexing a field, infogami calls find_table(type, datatype, name) to decide which index table to write to:

  • Well-known type-specific fields → edition_str, work_str, author_str, etc.
  • Everything else → datum_str, datum_int, datum_ref (the catch-all tables)

For example, edition.titleedition_str. But edition.description (a dynamically discovered property) → datum_str.

What Does NOT Get Indexed

Fields whose value is a rich-text object like {"type": "/type/text", "value": "..."} are never indexed/type/text is not in the set of indexable datatypes (str, int, ref). These fields exist only in the data JSON blob. Editing them produces no index rows and, prior to this PR, no observable signal in any table.

If you save that same description as a plain Python string "Some text", it will be indexed (in datum_str) and will appear in transaction_details.

The Save Pipeline

The entry point is SaveImpl.save() in infogami/infobase/_dbstore/save.py. In order, a save:

  1. Loads existing records from the thing and data tables (with FOR UPDATE locks to prevent conflicts).
  2. Writes updated rows to thing and data.
  3. Inserts a row into transaction (the "who/when/why" record).
  4. Calls IndexUtil.update_index(), which:
    • Diffs old and new docs to produce deletes and inserts dictionaries keyed by (table, thing_id, property_id).
    • Compiles the doc-level index into a db-level index (resolving thing IDs, property IDs, target tables).
    • Applies the deletes and inserts to the index tables.
  5. (New in this PR) Calls SaveImpl._add_transaction_details() with those same deletes and inserts.

What Jim's PR Adds

The DDL

CREATE TABLE transaction_details (
    id              serial primary key,
    transaction_id  integer references transaction(id),
    thing_id        integer references thing(id),
    key_id          integer references property(id),
    property_action text,       -- 'create', 'update', or 'delete'
    author_id       integer references thing(id),
    is_bot          boolean,
    created         timestamp without time zone default (current_timestamp at time zone 'utc')
);

One row per (transaction, thing, property) combination. Not one row per changed value — if subjects changes from ["Science"] to ["Science", "Math"], that's one row for the subjects property.

The Feature Flag

In infogami/infobase/config.py:

use_transaction_details_table = False  # off by default

In your conf/infobase.yml, opt in with:

use_transaction_details_table: true

Infogami's update_config() applies every YAML key as a setattr on the config module at startup, so the flag is live without any code change.

The Logic

_add_transaction_details classifies each changed property as:

Condition Action
Key in inserts only create — property didn't exist before
Key in deletes only delete — property was removed
Key in both update — property existed but its value changed

It then bulk-inserts those rows into transaction_details.


Useful Query

SELECT
    tx.comment,
    tx.created                  AS tx_time,
    t.key                       AS thing_edited,
    tt.key                      AS thing_type,
    p.name                      AS property_name,
    td.property_action          AS action,
    au.key                      AS editor,
    td.is_bot
FROM transaction_details td
JOIN transaction  tx ON tx.id  = td.transaction_id
JOIN thing        t  ON t.id   = td.thing_id
JOIN thing        tt ON tt.id  = t.type
JOIN property     p  ON p.id   = td.key_id
JOIN thing        au ON au.id  = td.author_id
ORDER BY td.id DESC
LIMIT 50;

Note: rows with key_id = NULL (possible during type-change saves) will be silently excluded by this inner join. Use a LEFT JOIN on property if you want to see them.


What We Verified (28/29 Tests Pass)

Scenario Result
Edit a single string field (title, name, birth_date…) Exactly one row, correct property name
Edit multiple fields in one save One row per changed property, nothing extra
Add a value to a list (subjects, publishers…) update
Remove all values from a list field delete
Add a field that didn't previously exist create
Remove a field entirely delete
No-op save (same data re-saved) Zero rows
Empty string "" on a field Treated as removal → delete
description saved as {"type": "/type/text", …} Zero rows — not indexed at all
description saved as a plain string create row, value lands in datum_str
works.authors change on a work Row for authors.author (the ref property)
Multi-doc save (work + author in one call) Rows scoped correctly to each thing
create → update → delete lifecycle All three actions classified correctly

Known Bug (not yet fixed in PR)

Anonymous edits (author=None) crash with KeyError: None inside _add_transaction_details:

author_id = self.thing_ids[author_key]  # KeyError when author_key is None

Fix: author_id = author_key and self.thing_ids.get(author_key)
DDL: author_id is already nullable, so no schema change needed.


How to Run This Locally

1. Worktree setup

cd ~/Projects/openlibrary
git fetch origin master
git worktree add ~/Projects/openlibrary-tx-details -b 11955/feature/transaction-details origin/master
cd ~/Projects/openlibrary-tx-details
git submodule update --init vendor/infogami
git -C vendor/infogami fetch origin txn_details
git -C vendor/infogami checkout txn_details

2. Enable the feature flag

In conf/infobase.yml, add:

use_transaction_details_table: true

3. Start Docker with local infogami

The compose.infogami-local.yaml overlay bind-mounts ./vendor/infogami over the named ol-vendor volume so the PR branch is live inside all containers:

OL_MOUNT_DIR="$(pwd)" docker compose \
  -f compose.yaml \
  -f compose.override.yaml \
  -f compose.infogami-local.yaml \
  up -d

4. Create the table

docker compose exec db psql -U openlibrary -c "
CREATE TABLE transaction_details (
    id              serial primary key,
    transaction_id  integer references transaction(id),
    thing_id        integer references thing(id),
    key_id          integer references property(id),
    property_action text,
    author_id       integer references thing(id),
    is_bot          boolean,
    created         timestamp without time zone default (current_timestamp at time zone 'utc')
);"

5. Run the test suite

docker compose exec infobase python3 /openlibrary/scripts/test_transaction_details_comprehensive.py

This runs 29 tests covering works, authors, and editions across create/update/delete scenarios, non-indexed fields, multi-doc saves, and edge cases.


Things Worth Raising in Review

  1. Anonymous edit bugauthor=None raises KeyError. Needs a guard in _add_transaction_details.
  2. /type/text fields are invisibledescription, notes, and other rich-text fields don't appear in transaction_details. This is consistent with how infogami indexes work, but worth an explicit callout since participation scoring may want to count these edits.
  3. No indexes on transaction_details — at scale, queries filtering by transaction_id, thing_id, or author_id will need indexes. Worth adding at least (transaction_id) and (author_id, created) before this goes to production.
  4. key_id can be NULL on type-change saves (when infogami deletes all properties with name=None). Those rows are silently dropped from inner-join queries.

Comment on lines +75 to +78
deletes, inserts = self._update_index(records)

if config.use_transaction_details_table:
self._add_transaction_details(deletes, inserts, tx_id, author, bot)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting these lines should fix things if things go horribly wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants