Skip to content

test(tables): stale-base concurrent-commit regression tests#620

Open
mkuchenbecker wants to merge 14 commits into
linkedin:mainfrom
mkuchenbecker:mkuchenb/incident-12185-repro-test
Open

test(tables): stale-base concurrent-commit regression tests#620
mkuchenbecker wants to merge 14 commits into
linkedin:mainfrom
mkuchenbecker:mkuchenb/incident-12185-repro-test

Conversation

@mkuchenbecker
Copy link
Copy Markdown
Collaborator

@mkuchenbecker mkuchenbecker commented May 29, 2026

Summary

Concurrency regression tests (StaleBaseLostUpdateTest, H2-backed, single-JVM) for committing a snapshot set against a stale base version: a writer that still declares an older base must not silently drop a snapshot another writer committed concurrently.

Scenarios

Each test loads a table at version L1, lets a second writer commit a new snapshot (advancing the catalog to L2), then commits a writer that still declares L1 as its base with a snapshot set computed at L1 — which omits the second writer's snapshot. The concurrently added snapshot must survive, and the stale commit must be rejected.

  • testStaleInsertDropsConcurrentDataCommitOnFreshTable — two writers race the first insert into a fresh table.
  • testExpireSnapshotsDropsConcurrentDataCommit — an expiration (keep only the current head) races a concurrent insert.
  • testStaleMetadataUpdateDropsConcurrentDataCommit — a metadata change that adds no snapshot races a concurrent insert.
  • testConcurrentInsertOnPopulatedTableIsRejected — two writers insert into a table that already holds data.

Results across commits

Test before #406 ff9a40c1 #406 14fe4603 before #612 d4fc9fe7 #612
fresh-table insert FAIL FAIL FAIL PASS
expire vs insert PASS FAIL FAIL PASS
metadata-update vs insert PASS FAIL FAIL PASS
populated-table insert PASS PASS PASS PASS

Why the results differ

Before #406, the catalog rejected a stale commit that deletes the latest snapshot only when the commit adds no snapshot of its own (SnapshotInspector.validateSnapshotsUpdate, which exempts snapshot-adding commits). So:

The single check that covers all of these is #612's base-divergence abort: it rejects any commit whose declared base no longer matches the catalog.

Testing

./gradlew :services:tables:test --tests "*StaleBaseLostUpdateTest"

All four pass with #612. Each test asserts the stale commit is rejected (throws) and that the table holds exactly the prior snapshots plus the concurrently committed snapshot. Test-only; no production changes.

🤖 Generated with Claude Code

mkuchenbecker and others added 6 commits May 29, 2026 10:39
…cident-12185)

Single-JVM, no-threads, no-docker reproduction of the BaseTransaction.applyUpdates
silent-rebase lost-update. A stale writer stages its L1 snapshot view + COMMIT_KEY=L1
in a held transaction; a racing writer advances the catalog to L2; committing the
stale transaction rebases the stale payload onto L2 so doCommit would subtract the
racing snapshot. Asserts the racing snapshot survives.

Passes on main (with the linkedin#612 stale-base abort CAS); reproduces the silent drop on
the pre-fix tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Factor the staged-conflict concurrency into a shared core and add a second test
where the table already holds committed data history (two snapshots) before the
race begins. Both assert the racing snapshot survives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both contending writers now append data: the stale writer commits its own data
snapshot at base L1 (payload omitting the racing snapshot), modeling a genuine
concurrent-insert race rather than a property-only update. Also assert the stale
writer's conflicting insert is rejected wholesale, not merged onto the racing commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers the race on a freshly created table with no committed data snapshots: the
stale writer's first insert races a concurrent first insert. Confirms the racing
snapshot survives even when the base has no prior history.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Models the production maintenance shape: an expire-snapshots commit (keep-subset
payload computed at base L1) racing a concurrent data insert. The stale expire's
subset omits the racing snapshot, so the subtractive merge would expire the fresh
data commit. Asserts the data commit survives and the stale expire is rejected
wholesale.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
count=0 (plain create) already reproduces the insert race; the single-snapshot
case added no coverage. Keep: zero-prior-data insert, two-snapshot prior-data
insert, and the expire-vs-insert race (which requires base snapshots).
@mkuchenbecker mkuchenbecker changed the title (wip | no review) test(tables): deterministic H2 repro for stale-base snapshot loss (incident-12185) test(tables): deterministic repro for stale-base snapshot loss May 29, 2026
mkuchenbecker and others added 3 commits May 29, 2026 11:10
Co-authored-by: Mike Kuchenbecker <mike@michaelkuchenbecker.com>
Validated by running pre-fix (d4fc9fe): both tests FAIL (racing snapshot silently
dropped); post-fix (linkedin#612): both PASS.

Three layered guards must all be bypassed for the lost update:
  1. HTS optimistic-version CAS  -> bypassed by applyUpdates' silent rebase (held txn)
  2. failIfRetryUpdate per-JVM cache -> bypassed cross-replica (cleared in the test)
  3. Iceberg snapshot sequence-number validation -> only bypassed by a SUBTRACTIVE
     stale commit (adds no new snapshot). A stale writer adding its own snapshot on a
     multi-snapshot base is rejected by guard #3, so it cannot lose data -- which is
     why incident-12185 was an expire/optimizer drop, not an insert race.

Tests (both subtractive): expire racing a data insert (prior history); two concurrent
first-inserts on a fresh table.
…scribe behavior only

Third case: two writers insert into a table that already holds data; the catalog
rejects the stale commit and the concurrent insert remains. Comments rewritten to
describe observable behavior without external references.
@mkuchenbecker mkuchenbecker changed the title test(tables): deterministic repro for stale-base snapshot loss (wip | no review) test(tables): stale-base concurrent-commit regression tests May 29, 2026
Inline commitThroughRepository, snapshotJson, idOf, and refs at their call sites.
Assert the table holds exactly the prior snapshots plus the concurrently committed
snapshot, regardless of whether the stale commit throws.
Prepare each commit's snapshot list explicitly at the call site so the base + new
snapshot is visible in the flow.
Use assertThrows(CommitFailedException) for the rejection, then assert the table
holds exactly the prior snapshots plus the concurrent commit.
@mkuchenbecker mkuchenbecker marked this pull request as ready for review May 29, 2026 20:15
…ate case

Assert the stale commit is rejected (any exception) rather than a specific type, so
the contract is rejection + final table state. Add a case where the stale commit
changes metadata without adding a snapshot.
@mkuchenbecker mkuchenbecker changed the title (wip | no review) test(tables): stale-base concurrent-commit regression tests test(tables): stale-base concurrent-commit regression tests May 29, 2026
Set an actual table property (plus the base snapshot view, no new snapshot) so the
case is a genuine property-changing commit rather than internal bookkeeping only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant