Skip to content

test(trino): de-flake TestHudi*FileOperations by disabling async table statistics#18995

Merged
wombatu-kun merged 1 commit into
apache:masterfrom
wombatu-kun:deflake-trino-fileoperations
Jun 14, 2026
Merged

test(trino): de-flake TestHudi*FileOperations by disabling async table statistics#18995
wombatu-kun merged 1 commit into
apache:masterfrom
wombatu-kun:deflake-trino-fileoperations

Conversation

@wombatu-kun

Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

The Trino-plugin tests TestHudiMemoryCacheFileOperations, TestHudiNoCacheFileOperations and TestHudiAlluxioCacheFileOperations are flaky. They intermittently fail in CI with a symmetric off-by-N mismatch in the metadata-table file-operation counts between the two paired measurements (for example testJoin short by 18 cacheLength / 24 cacheStream / 18 lastModified while testSelectWithFilter is long by exactly the same amounts). This was last observed on the CI run for PR #18988, a change confined to hudi-client-common that cannot affect the Trino read path. The earlier de-flake attempt in #18766 (a span-stability poll) reduced the failure rate but did not eliminate it.

Summary and Changelog

These tests assert the exact multiset of low-level filesystem-access spans that a single query emits against the Hudi metadata table. The root cause of the flake is that HudiMetadata.getTableStatistics submits an asynchronous table-statistics refresh on a shared background executor for every query during planning. That task reads the metadata-table column-stats partition and emits METADATA_TABLE spans that can outlive the synchronous query and arrive in the next test's measurement window, scrambling the counts (the symmetric off-by-N signature).

This PR disables the async refresh in the three test query runners by setting hudi.table-statistics-enabled=false. With the only asynchronous metadata reader gone, the file-operation counts are deterministic as soon as the query returns, so the waitForStableSpans polling helper added in #18766 is removed and the expected multisets are recalibrated to the synchronous query I/O. Only testJoin's first-query expectations change: the first query no longer performs the extra column-stats read, so its counts now equal the already-warm second query. testSelectWithFilter is unchanged. The throws InterruptedException declarations that existed only for the removed Thread.sleep are dropped.

No production code is changed, and no code was copied from third-party sources.

Impact

Test-only change, scoped to three test classes in hudi-trino-plugin. No production code, public API, configuration default, or runtime behavior changes. The hudi.table-statistics-enabled setting is toggled only inside these tests' query runners.

Risk Level

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…e statistics

The three TestHudi*FileOperations tests (Memory, NoCache, Alluxio) assert the exact multiset of filesystem-access spans a query emits against the Hudi metadata table, and were intermittently failing in CI with a symmetric off-by-N mismatch between the paired testJoin and testSelectWithFilter measurements.

Root cause: HudiMetadata.getTableStatistics submits an asynchronous table-statistics refresh on a shared background executor for every query during planning. That task reads the metadata-table column-stats partition and emits METADATA_TABLE spans that can outlive the synchronous query and arrive in the next test's measurement window, scrambling the counts. The earlier span-stability poll (apache#18766) narrowed the race but did not close it.

Fix: disable the async refresh in these tests via hudi.table-statistics-enabled=false. With the only asynchronous metadata reader gone, the counts are deterministic the moment the query returns, so the waitForStableSpans poll is removed and the expected multisets are recalibrated to the synchronous query I/O. Only testJoin's first-query counts change (they now match the already-warm second query); testSelectWithFilter is unchanged.

Verified under JDK 23 / trino-root 472: the three classes pass 20/20 consecutive runs with zero checkstyle violations.
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 13, 2026
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.25%. Comparing base (97c03f7) to head (452a99b).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18995   +/-   ##
=========================================
  Coverage     68.25%   68.25%           
- Complexity    29509    29512    +3     
=========================================
  Files          2544     2544           
  Lines        142744   142754   +10     
  Branches      17816    17816           
=========================================
+ Hits          97433    97442    +9     
  Misses        37304    37304           
- Partials       8007     8008    +1     
Flag Coverage Δ
common-and-other-modules 44.78% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 44.63% <ø> (-0.08%) ⬇️
spark-client-hadoop-common 48.11% <ø> (-0.01%) ⬇️
spark-java-tests 48.78% <ø> (+0.01%) ⬆️
spark-scala-tests 44.77% <ø> (-0.02%) ⬇️
utilities 37.21% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 18 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@voonhous

Copy link
Copy Markdown
Member

I'm okay with disabling this for now. We need to find a more reliable way of testing this without having this get in the way of CI flaking.

@voonhous voonhous left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR de-flakes three Trino plugin file-operation tests by disabling the async hudi.table-statistics-enabled refresh at the query-runner level, removes the waitForStableSpans polling helper, and recalibrates testJoin's first-query counts to the now-synchronous I/O. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@wombatu-kun wombatu-kun merged commit b75a5ac into apache:master Jun 14, 2026
119 of 123 checks passed
@voonhous

Copy link
Copy Markdown
Member

@wombatu-kun
Tests on master are still flaking: https://github.com/apache/hudi/actions/runs/27489081724/job/81250705383

Do you mind taking a look at it?

@wombatu-kun

Copy link
Copy Markdown
Contributor Author

@wombatu-kun

Tests on master are still flaking: https://github.com/apache/hudi/actions/runs/27489081724/job/81250705383

Do you mind taking a look at it?

yes, i'll do it today

@wombatu-kun

Copy link
Copy Markdown
Contributor Author

@voonhous dug into the master failure (run 27489081724). #18995 removed only one of several asynchronous sources of these spans, so the race stayed open.

The tests assert the exact per-query filesystem-span multiset, but Trino resets the span exporter at the start of each executeWithPlan. Any span a background thread emits after the synchronous query returns therefore lands in the next test's measurement window, which is the symmetric off-by-N you see. Besides the stats refresh, the metadata table is read by the background split-loader / partition-listing / index-support pools, and the Alluxio variant additionally depends on async cache warmth (a read is Alluxio.readCached vs readExternal depending on whether an earlier cache write finished). Every observed flake delta was a METADATA_TABLE op (plus Alluxio.readCached on the Alluxio test).

Fix in #19004: stop asserting the racy quantities and keep only the synchronous foreground reads, i.e. filter out METADATA_TABLE ops in all three classes and all Alluxio.* ops in the Alluxio class. hudi.table-statistics-enabled=false stays, because the stats executor also reads index.json and the hoodie.properties files on a background thread, which are part of the surviving asserted set (dropping the flag reproduced an off-by-one on exactly those files). Verified locally: build with zero checkstyle violations and 20 consecutive green runs of the three classes.

wombatu-kun pushed a commit to wombatu-kun/hudi that referenced this pull request Jun 14, 2026
…ronous reads

PR apache#18995 disabled the async table-statistics refresh to de-flake the three TestHudi*FileOperations classes, but they kept flaking on master with the same symmetric off-by-N mismatch in the metadata-table span counts. The stats refresh was only one of several asynchronous sources of filesystem-access spans.

These tests assert the exact multiset of low-level filesystem spans a query emits. Trino resets the span exporter at the start of each executeWithPlan, so any span a background thread emits after the synchronous query returns lands in the next test's measurement window and scrambles the counts. The Hudi read path emits such spans from shared background pools that read the metadata table (split loading, partition listing, index support) and, for the Alluxio variant, from asynchronous cache population whose hit/miss outcome depends on whether an earlier cache write had completed.

Rather than continue trying to time these background tasks, this change stops asserting the quantities they produce and keeps only the operations that happen synchronously on the foreground planning/scan path. getFileOperations now excludes METADATA_TABLE operations in all three classes, and additionally excludes all Alluxio.* operations in the Alluxio class; the corresponding expected entries are removed. hudi.table-statistics-enabled=false is kept because the stats executor also reads the index definition and table-property files on a background thread, which are part of the surviving asserted set.

No production code is changed. Verified under JDK 23 / trino-root 472: the build succeeds with zero checkstyle violations and the three classes pass 20 consecutive runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants