Skip to content

Flink: Fix flaky TestMonitorSource.testStateRestore#16548

Open
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16546-flaky-test-monitor-source-state-restore
Open

Flink: Fix flaky TestMonitorSource.testStateRestore#16548
wombatu-kun wants to merge 2 commits into
apache:mainfrom
wombatu-kun:issue/16546-flaky-test-monitor-source-state-restore

Conversation

@wombatu-kun
Copy link
Copy Markdown
Contributor

@wombatu-kun wombatu-kun commented May 24, 2026

Closes #16546.

Summary

TestMonitorSource.testStateRestore (the testStateRestore(File, ClusterClient) variant in the Flink v2.0/v2.1 trees) intermittently fails with a TimeoutException from CollectingSink.poll (example CI run). The timeout only means the sink queue stayed empty for 5s; the real cause is a savepoint-completion race in the shared test helper, not slow startup.

OperatorTestBase.closeJobClient(JobClient, File) discarded the CompletableFuture<String> returned by stopWithSavepoint and instead waited for the savepoint directory to appear on disk. That directory is created early in the savepoint process, before the _metadata/state files finish writing. Phase 2 of the test then restores from that path via clusterClient.submitJob; when the restore races savepoint completion, the restored job never comes up and emits nothing, so the poll times out.

What changed

  • OperatorTestBase.closeJobClient now awaits the stopWithSavepoint(...) future and returns the path it resolves to, so the savepoint is guaranteed to be fully written before any job restores from it. This mirrors the existing idiom in TestIcebergSourceFailover.testBoundedWithSavepoint, which awaits the savepoint future with .get(). The only caller that passes a non-null savepoint directory is TestMonitorSource.testStateRestore, so the change is scoped to this test.
  • As a backstop for restored-job startup latency on busy CI, the Phase 2 poll in testStateRestore is raised from 5s to 30s. The assertion stays strict — a genuine re-read emits a non-empty event quickly and still fails fast — so the longer timeout only extends the wait for the (correct) first event, mirroring the "deterministic fix + generous timeout backstop" pattern used when TestIcebergSourceFailover was de-flaked.

Both changes apply to the Flink v1.20, v2.0, and v2.1 trees. The observed CI failure was the testStateRestore(File, ClusterClient) variant that only exists in v2.0/v2.1, but the same savepoint-completion race is latent in v1.20's closeJobClient, so the fix is applied there as well. The v1.20 testStateRestore restores via env.executeAsync from a Configuration carrying SavepointConfigOptions.SAVEPOINT_PATH rather than clusterClient.submitJob, so its closeJobClient now sets that path from the awaited stopWithSavepoint(...) future and the Phase 2 poll backstop is raised to 30s to match v2.x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the flink label May 24, 2026
@pvary
Copy link
Copy Markdown
Contributor

pvary commented May 25, 2026

Is this generated by the help of an AI agent?
If so, please check https://iceberg.apache.org/contribute/#guidelines-for-ai-assisted-contributions

The stopWithSavepoint().get() seems like a good solution, but I think it should be changed in all Flink versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wombatu-kun
Copy link
Copy Markdown
Contributor Author

@pvary thanks for the review!

On the AI question: yes, this PR was prepared with AI assistance (drafting the fix, the refactor, and the test changes). Per the AI-assisted contribution guidelines I reviewed and verified it end-to-end — I understand the savepoint-completion race and ran the affected tests locally — and kept the wording and style aligned with the codebase. Happy to walk through any part during review.

On applying it to all Flink versions: done in 806876f. I'd originally scoped it to v2.0/v2.1 because the observed CI failure was the testStateRestore(File, ClusterClient) variant that only exists there, but you're right that the same latent race lives in v1.20's closeJobClient. PR description is updated.


assertThat(resultWithSavepoint.poll(Duration.ofSeconds(5L))).isEqualTo(EMPTY_EVENT);
// Restoring from a savepoint on a busy cluster may take longer than the default 5s poll
assertThat(resultWithSavepoint.poll(Duration.ofSeconds(30L))).isEqualTo(EMPTY_EVENT);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we need this? Wasn't this only a concurrency issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deterministic fix in closeJobClient removes the savepoint-completion race, which was the actual cause of the failure here. I kept the timeout bump as a separate, defensive measure: restoring a job on a busy CI runner (resubmit → schedule → deploy tasks) can occasionally take longer than 5s even when the savepoint is fully written, so the larger timeout guards against that kind of CI lag.

It doesn't slow the normal run down, though — poll(...) returns as soon as the expected event arrives, so on an unloaded runner / in the happy path the extra time is never actually spent; the 30s only caps how long we're willing to wait before failing. That's why I'd lean towards keeping it as a low-cost robustness backstop.

That said, I don't feel strongly about it — if you'd rather rely solely on the deterministic fix and keep the change minimal, just let me know and I'll revert the timeout part.

@wombatu-kun wombatu-kun requested a review from pvary May 26, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test TestMonitorSource > testStateRestore

2 participants