Skip to content

Spark: Bound runaway serializable-isolation and concurrent-refresh tests#16562

Open
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16359-bound-runaway-concurrency-tests
Open

Spark: Bound runaway serializable-isolation and concurrent-refresh tests#16562
wombatu-kun wants to merge 1 commit into
apache:mainfrom
wombatu-kun:issue/16359-bound-runaway-concurrency-tests

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

Closes #16359

Summary

Several spark-extensions concurrency tests ran two worker threads in a barrier-synchronized loop bounded by Integer.MAX_VALUE while the main thread blocked on assertThatThrownBy(<op>Future::get) with no timeout, expecting a conflict exception. When that exception was never thrown, nothing bounded either thread — the append thread kept committing data files until CI hit its wall-clock limit or ran out of disk (as in #16303, where the runaway filled the GitHub Actions disk and was retriggered several times before the cause was found). This makes those tests fail fast instead of relying on external limits.

What changed

  • Bounded both worker loops with a shared MAX_OPERATIONS = 20 constant — the same value the sibling *WithSnapshotIsolation tests already use with this identical harness — so the loop can no longer run unbounded.
  • Wrapped the wait in Future.get(OPERATION_TIMEOUT_MINUTES, TimeUnit.MINUTES) (5 minutes) and cancelled the operation future in finally, so a stuck operation is interrupted and the wait can't block forever.
  • Hoisted MAX_OPERATIONS and OPERATION_TIMEOUT_MINUTES into SparkRowLevelOperationsTestBase so these concurrency tests share one bound (the snapshot-isolation siblings now reference the same constant).

Affected methods (across Spark v3.5/v4.0/v4.1 where present): testMergeWithSerializableIsolation, testDeleteWithSerializableIsolation, testUpdateWithSerializableIsolation, and the copy-on-write testMergeWithConcurrentTableRefresh, testDeleteWithConcurrentTableRefresh, testUpdateWithConcurrentTableRefresh.

In a passing run the conflict still fires within the first couple of iterations, so behavior is unchanged; in a regression the bounded loop plus timeout make the test fail fast with a clear assertion instead of exhausting CI resources.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the spark label May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark: serializable-isolation tests can run until external CI/resource limits

1 participant