Flink: Fix flaky TestMonitorSource.testStateRestore by wombatu-kun · Pull Request #16548 · apache/iceberg

wombatu-kun · 2026-05-24T04:21:31Z

Summary

TestMonitorSource.testStateRestore (the testStateRestore(File, ClusterClient) variant in the Flink v2.0/v2.1 trees) intermittently fails with a TimeoutException from CollectingSink.poll (example CI run). The timeout only means the sink queue stayed empty for 5s; the real cause is a savepoint-completion race in the shared test helper, not slow startup.

OperatorTestBase.closeJobClient(JobClient, File) discarded the CompletableFuture<String> returned by stopWithSavepoint and instead waited for the savepoint directory to appear on disk. That directory is created early in the savepoint process, before the _metadata/state files finish writing. Phase 2 of the test then restores from that path via clusterClient.submitJob; when the restore races savepoint completion, the restored job never comes up and emits nothing, so the poll times out.

What changed

OperatorTestBase.closeJobClient now awaits the stopWithSavepoint(...) future and returns the path it resolves to, so the savepoint is guaranteed to be fully written before any job restores from it. This mirrors the existing idiom in TestIcebergSourceFailover.testBoundedWithSavepoint, which awaits the savepoint future with .get(). The only caller that passes a non-null savepoint directory is TestMonitorSource.testStateRestore, so the change is scoped to this test.
As a backstop for restored-job startup latency on busy CI, the Phase 2 poll in testStateRestore is raised from 5s to 30s. The assertion stays strict — a genuine re-read emits a non-empty event quickly and still fails fast — so the longer timeout only extends the wait for the (correct) first event, mirroring the "deterministic fix + generous timeout backstop" pattern used when TestIcebergSourceFailover was de-flaked.

Both changes apply to the Flink v1.20, v2.0, and v2.1 trees. The observed CI failure was the testStateRestore(File, ClusterClient) variant that only exists in v2.0/v2.1, but the same savepoint-completion race is latent in v1.20's closeJobClient, so the fix is applied there as well. The v1.20 testStateRestore restores via env.executeAsync from a Configuration carrying SavepointConfigOptions.SAVEPOINT_PATH rather than clusterClient.submitJob, so its closeJobClient now sets that path from the awaited stopWithSavepoint(...) future and the Phase 2 poll backstop is raised to 30s to match v2.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pvary · 2026-05-25T06:29:39Z

Is this generated by the help of an AI agent?
If so, please check https://iceberg.apache.org/contribute/#guidelines-for-ai-assisted-contributions

The stopWithSavepoint().get() seems like a good solution, but I think it should be changed in all Flink versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wombatu-kun · 2026-05-25T07:45:55Z

@pvary thanks for the review!

On the AI question: yes, this PR was prepared with AI assistance (drafting the fix, the refactor, and the test changes). Per the AI-assisted contribution guidelines I reviewed and verified it end-to-end — I understand the savepoint-completion race and ran the affected tests locally — and kept the wording and style aligned with the codebase. Happy to walk through any part during review.

On applying it to all Flink versions: done in 806876f. I'd originally scoped it to v2.0/v2.1 because the observed CI failure was the testStateRestore(File, ClusterClient) variant that only exists there, but you're right that the same latent race lives in v1.20's closeJobClient. PR description is updated.

pvary · 2026-05-26T11:35:34Z


-      assertThat(resultWithSavepoint.poll(Duration.ofSeconds(5L))).isEqualTo(EMPTY_EVENT);
+      // Restoring from a savepoint on a busy cluster may take longer than the default 5s poll
+      assertThat(resultWithSavepoint.poll(Duration.ofSeconds(30L))).isEqualTo(EMPTY_EVENT);


Are we sure we need this? Wasn't this only a concurrency issue?

The deterministic fix in closeJobClient removes the savepoint-completion race, which was the actual cause of the failure here. I kept the timeout bump as a separate, defensive measure: restoring a job on a busy CI runner (resubmit → schedule → deploy tasks) can occasionally take longer than 5s even when the savepoint is fully written, so the larger timeout guards against that kind of CI lag.

It doesn't slow the normal run down, though — poll(...) returns as soon as the expected event arrives, so on an unloaded runner / in the happy path the extra time is never actually spent; the 30s only caps how long we're willing to wait before failing. That's why I'd lean towards keeping it as a low-cost robustness backstop.

That said, I don't feel strongly about it — if you'd rather rely solely on the deterministic fix and keep the change minimal, just let me know and I'll revert the timeout part.

Flink: Fix flaky TestMonitorSource.testStateRestore

fe70f2d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the flink label May 24, 2026

Flink: Apply TestMonitorSource state-restore fix to 1.20

806876f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pvary reviewed May 26, 2026

View reviewed changes

wombatu-kun requested a review from pvary May 26, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Fix flaky TestMonitorSource.testStateRestore#16548

Flink: Fix flaky TestMonitorSource.testStateRestore#16548
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16546-flaky-test-monitor-source-state-restore

wombatu-kun commented May 24, 2026 •

edited

Loading

Uh oh!

pvary commented May 25, 2026

Uh oh!

wombatu-kun commented May 25, 2026

Uh oh!

pvary May 26, 2026

Uh oh!

wombatu-kun May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wombatu-kun commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Uh oh!

pvary commented May 25, 2026

Uh oh!

wombatu-kun commented May 25, 2026

Uh oh!

pvary May 26, 2026

Choose a reason for hiding this comment

Uh oh!

wombatu-kun May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wombatu-kun commented May 24, 2026 •

edited

Loading