fix(catalog): abort doCommit on stale-base divergence#612
Merged
mkuchenbecker merged 1 commit intoMay 29, 2026
Merged
Conversation
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 28, 2026
…eck, guard URI.create - Pull COMMIT_KEY capture into abortIfWriterBaseDivergedFromCatalog so the read+check stay colocated (addresses #3319525189). Signature now takes Map<String, String> properties; caller's only job is to invoke it before failIfRetryUpdate strips COMMIT_KEY. - Guard isSameMetadataPath against malformed URI inputs that would throw IllegalArgumentException out of doCommit as an unclassified runtime error (addresses #3319521822). Fall back to literal-inequality so divergence is reported as CommitFailedException via the caller, which Iceberg's retry loop classifies as retriable.
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 28, 2026
…elper Single rewrite of abortIfWriterBaseDivergedFromCatalog addressing all five inline comments: - Inlined writerClaimsNoExistingBase (trivial check; #3320098729). - Inlined isSameMetadataPath (#3320106771). - Switched URI.create -> org.apache.hadoop.fs.Path(...).toUri().getPath(); Hadoop Path is lenient and does not require the try/catch wrapper guarding malformed inputs (#3320093363). - Collapsed the two-throw structure (no-existing-base + path-mismatch) into a single positive check writerBaseMatchesCatalog, throwing on negation (#3320108911). - Stripped HTML formatting from the javadoc (#3320103880). Net: one helper, one check, one throw. Same semantics — racing-CREATE (writer claims null/INITIAL against persisted catalog) and racing-UPDATE (path mismatch) are both caught by the single negation. The error message collapses to the path-mismatch flavor; the test testDoCommitAbortsOnStaleClaimedBase still asserts on "Cannot commit" which is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mkuchenbecker
commented
May 28, 2026
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 28, 2026
…view) Per inline review on line 611: base non-null with claimed base null/INITIAL is its own error class — make it explicit instead of folding into the combined check. Distinct error message names the case.
kamanavishnu
added a commit
to kamanavishnu/openhouse
that referenced
this pull request
May 29, 2026
…ase lost update Reconstructs the BaseTransaction.applyUpdates silent-rebase inputs at the doCommit boundary (base=T_Y containing the racing snapshot, but COMMIT_KEY=T_X) and asserts doCommit aborts instead of silently dropping the racing snapshot via the subtractive merge; also asserts save() is never reached. Fails on main (racing snapshot silently dropped); passes once the doCommit stale-base CAS (OSS PR linkedin#612) lands.
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 29, 2026
…base-repro-test test(internalcatalog): assert linkedin#612 prevents the stale-base lost update (no silent snapshot drop)
mkuchenbecker
commented
May 29, 2026
mkuchenbecker
commented
May 29, 2026
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 29, 2026
Backport of PR linkedin#612 onto hotfix/0.5.417. Aborts doCommit when the writer's declared base (COMMIT_KEY) diverges from the catalog's actual persisted base, preventing the BaseTransaction.applyUpdates silent-rebase lost-update (incident-12185). Propagates IS_REPLACE_COMMIT_KEY so wholesale replace commits are exempted from the CAS, and strips it before persisting. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c49eaec to
869cff8
Compare
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 29, 2026
Backport of PR linkedin#612 onto hotfix/0.5.417. Aborts doCommit when the writer's declared base (COMMIT_KEY) diverges from the catalog's actual persisted base, preventing the BaseTransaction.applyUpdates silent-rebase lost-update (incident-12185), where applyUpdates re-stamps the writer's original non-null COMMIT_KEY on top of a concurrently-advanced base. Commits that leave COMMIT_KEY unset (wholesale replace/create) are authoritative over the snapshot set and are intentionally not defended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
869cff8 to
8cca103
Compare
mkuchenbecker
commented
May 29, 2026
mkuchenbecker
commented
May 29, 2026
Aborts doCommit when the writer's declared base (COMMIT_KEY) diverges from the catalog's actual persisted base, preventing the BaseTransaction.applyUpdates silent-rebase lost-update (incident-12185), where applyUpdates re-stamps the writer's original non-null COMMIT_KEY on top of a concurrently-advanced base. Commits that leave COMMIT_KEY unset (wholesale replace/create) are authoritative over the snapshot set and are intentionally not defended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
f8e0c8f to
ebd7230
Compare
cbb330
approved these changes
May 29, 2026
Collaborator
cbb330
left a comment
There was a problem hiding this comment.
This works as a mitigation for testing purposes though it'll need further analysis before we can deploy safely.
Short-term, I think it's worth taking a closer look at this code: https://github.com/linkedin/openhouse/pull/612/changes#diff-526cc78ff45a139113a29951f4db0a28242b9a7f3bc9caef9ceec26b04c5aeb8R331
it may point us toward a more targeted, reliable, and race-condition-free solution.
mkuchenbecker
added a commit
to mkuchenbecker/openhouse
that referenced
this pull request
May 29, 2026
Validated by running pre-fix (d4fc9fe): both tests FAIL (racing snapshot silently dropped); post-fix (linkedin#612): both PASS. Three layered guards must all be bypassed for the lost update: 1. HTS optimistic-version CAS -> bypassed by applyUpdates' silent rebase (held txn) 2. failIfRetryUpdate per-JVM cache -> bypassed cross-replica (cleared in the test) 3. Iceberg snapshot sequence-number validation -> only bypassed by a SUBTRACTIVE stale commit (adds no new snapshot). A stale writer adding its own snapshot on a multi-snapshot base is rejected by guard #3, so it cannot lose data -- which is why incident-12185 was an expire/optimizer drop, not an insert race. Tests (both subtractive): expire racing a data insert (prior history); two concurrent first-inserts on a fresh table.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a catalog-level CAS in
OpenHouseInternalTableOperations.doCommitthat aborts when the writer's declared base (COMMIT_KEY) does not match the catalog's current persisted base — closing theBaseTransaction.applyUpdatessilent-rebase variant of the stale-base lost-update bug (incident-12185).Mechanism
COMMIT_KEY = T_Xand aSNAPSHOTS_JSON_KEYpayload computed againstT_X.T_X → T_Y, adding a snapshot.BaseTransaction.applyUpdatessilently refreshes the in-flight base toT_Yand re-applies the staged update — re-stampingCOMMIT_KEY = T_Xon top ofT_Ywhile leaving the staleSNAPSHOTS_JSON_KEY.doCommitruns withbase = T_YbutCOMMIT_KEY = T_X. Without the check, the subtractive snapshot merge computestoRemove = T_Y.snapshots() − stale payload = {racing snapshot}and silently drops it.The fix reads
COMMIT_KEYbeforefailIfRetryUpdatestrips it, URI-normalizes both paths via HadoopPath, and throwsCommitFailedExceptionon mismatch so Iceberg retries against the fresh base.Scope of the check
COMMIT_KEYis a concrete location differing from the catalog base, or isINITIAL_VERSION.COMMIT_KEYunset — wholesale replace/create (replaceTable, stage-create, stage-replace) are authoritative over the snapshot set, so there is no stale base to compare against.Changes
Bug fix + unit test, internal catalog only (2 files).
doCommitmay now throwCommitFailedExceptionon a stale-base commit that previously silently dropped a racing snapshot.Testing
Unit test
testDoCommitMustAbortStaleBaseRebaseToPreventSnapshotLossinOpenHouseInternalTableOperationsTestround-trips the post-refreshTableMetadatathroughTableMetadataParsersobase.metadataFileLocation()is non-null (matching a loaded-from-disk base) and the URI-normalized comparison runs. AssertsCommitFailedExceptionis thrown andhouseTableRepository.saveis never invoked.Not covered
A Spark concurrent-insert behavior test (
SparkConcurrentInsertFunctionalTest, PR #614) was explored and removed: it only reproduces against the H2 test fixture, not production MySQL+HTS. A prod-realistic black-box repro would need the real HTS app or a deployed instance.🤖 Generated with Claude Code