test(trino): de-flake TestHudi*FileOperations by disabling async table statistics#18995
Conversation
…e statistics The three TestHudi*FileOperations tests (Memory, NoCache, Alluxio) assert the exact multiset of filesystem-access spans a query emits against the Hudi metadata table, and were intermittently failing in CI with a symmetric off-by-N mismatch between the paired testJoin and testSelectWithFilter measurements. Root cause: HudiMetadata.getTableStatistics submits an asynchronous table-statistics refresh on a shared background executor for every query during planning. That task reads the metadata-table column-stats partition and emits METADATA_TABLE spans that can outlive the synchronous query and arrive in the next test's measurement window, scrambling the counts. The earlier span-stability poll (apache#18766) narrowed the race but did not close it. Fix: disable the async refresh in these tests via hudi.table-statistics-enabled=false. With the only asynchronous metadata reader gone, the counts are deterministic the moment the query returns, so the waitForStableSpans poll is removed and the expected multisets are recalibrated to the synchronous query I/O. Only testJoin's first-query counts change (they now match the already-warm second query); testSelectWithFilter is unchanged. Verified under JDK 23 / trino-root 472: the three classes pass 20/20 consecutive runs with zero checkstyle violations.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18995 +/- ##
=========================================
Coverage 68.25% 68.25%
- Complexity 29509 29512 +3
=========================================
Files 2544 2544
Lines 142744 142754 +10
Branches 17816 17816
=========================================
+ Hits 97433 97442 +9
Misses 37304 37304
- Partials 8007 8008 +1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
I'm okay with disabling this for now. We need to find a more reliable way of testing this without having this get in the way of CI flaking. |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR de-flakes three Trino plugin file-operation tests by disabling the async hudi.table-statistics-enabled refresh at the query-runner level, removes the waitForStableSpans polling helper, and recalibrates testJoin's first-query counts to the now-synchronous I/O. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
|
@wombatu-kun Do you mind taking a look at it? |
yes, i'll do it today |
|
@voonhous dug into the master failure (run 27489081724). #18995 removed only one of several asynchronous sources of these spans, so the race stayed open. The tests assert the exact per-query filesystem-span multiset, but Trino resets the span exporter at the start of each Fix in #19004: stop asserting the racy quantities and keep only the synchronous foreground reads, i.e. filter out |
…ronous reads PR apache#18995 disabled the async table-statistics refresh to de-flake the three TestHudi*FileOperations classes, but they kept flaking on master with the same symmetric off-by-N mismatch in the metadata-table span counts. The stats refresh was only one of several asynchronous sources of filesystem-access spans. These tests assert the exact multiset of low-level filesystem spans a query emits. Trino resets the span exporter at the start of each executeWithPlan, so any span a background thread emits after the synchronous query returns lands in the next test's measurement window and scrambles the counts. The Hudi read path emits such spans from shared background pools that read the metadata table (split loading, partition listing, index support) and, for the Alluxio variant, from asynchronous cache population whose hit/miss outcome depends on whether an earlier cache write had completed. Rather than continue trying to time these background tasks, this change stops asserting the quantities they produce and keeps only the operations that happen synchronously on the foreground planning/scan path. getFileOperations now excludes METADATA_TABLE operations in all three classes, and additionally excludes all Alluxio.* operations in the Alluxio class; the corresponding expected entries are removed. hudi.table-statistics-enabled=false is kept because the stats executor also reads the index definition and table-property files on a background thread, which are part of the surviving asserted set. No production code is changed. Verified under JDK 23 / trino-root 472: the build succeeds with zero checkstyle violations and the three classes pass 20 consecutive runs.
Describe the issue this Pull Request addresses
The Trino-plugin tests
TestHudiMemoryCacheFileOperations,TestHudiNoCacheFileOperationsandTestHudiAlluxioCacheFileOperationsare flaky. They intermittently fail in CI with a symmetric off-by-N mismatch in the metadata-table file-operation counts between the two paired measurements (for exampletestJoinshort by 18 cacheLength / 24 cacheStream / 18 lastModified whiletestSelectWithFilteris long by exactly the same amounts). This was last observed on the CI run for PR #18988, a change confined tohudi-client-commonthat cannot affect the Trino read path. The earlier de-flake attempt in #18766 (a span-stability poll) reduced the failure rate but did not eliminate it.Summary and Changelog
These tests assert the exact multiset of low-level filesystem-access spans that a single query emits against the Hudi metadata table. The root cause of the flake is that
HudiMetadata.getTableStatisticssubmits an asynchronous table-statistics refresh on a shared background executor for every query during planning. That task reads the metadata-table column-stats partition and emitsMETADATA_TABLEspans that can outlive the synchronous query and arrive in the next test's measurement window, scrambling the counts (the symmetric off-by-N signature).This PR disables the async refresh in the three test query runners by setting
hudi.table-statistics-enabled=false. With the only asynchronous metadata reader gone, the file-operation counts are deterministic as soon as the query returns, so thewaitForStableSpanspolling helper added in #18766 is removed and the expected multisets are recalibrated to the synchronous query I/O. OnlytestJoin's first-query expectations change: the first query no longer performs the extra column-stats read, so its counts now equal the already-warm second query.testSelectWithFilteris unchanged. Thethrows InterruptedExceptiondeclarations that existed only for the removedThread.sleepare dropped.No production code is changed, and no code was copied from third-party sources.
Impact
Test-only change, scoped to three test classes in
hudi-trino-plugin. No production code, public API, configuration default, or runtime behavior changes. Thehudi.table-statistics-enabledsetting is toggled only inside these tests' query runners.Risk Level
none
Documentation Update
none
Contributor's checklist