fix(spark): support consistent hashing clustering on non-partitioned tables by ad1happy2go · Pull Request #18968 · apache/hudi

ad1happy2go · 2026-06-10T15:38:51Z

Describe the issue this Pull Request addresses

Consistent hashing bucket-index clustering (bucket resizing) fails on non-partitioned tables.

For a non-partitioned table the partition path stored in the clustering group metadata is an empty string (""). SingleSparkJobConsistentHashingExecutionStrategy validated the partition with a "not null or empty" guard and threw IllegalArgumentException: Partition should not be null or empty before any clustering work could run, so split/merge resizing was impossible on non-partitioned tables.

GitHub issue: #18161

Summary and Changelog

Relax the partition guard in SingleSparkJobConsistentHashingExecutionStrategy (both the merge and split paths) from !StringUtils.isNullOrEmpty(partition) to partition != null. An empty partition path is valid for non-partitioned tables; a genuinely absent metadata key (null) is still rejected.
Add a parameterized test testResizingNonPartitioned in TestSparkConsistentBucketClustering, mirroring the existing testResizing, covering split and merge resizing on a non-partitioned table across the single-job and multi-job execution strategies and the row-writer on/off paths. A setup(..., boolean nonPartitioned) overload configures the non-partition key generator and an empty partition-path field.

Impact

Consistent hashing clustering now works on non-partitioned tables. No behavior change for partitioned tables — when the partition path is non-empty (every partitioned table), the guard evaluates identically and the code path is unchanged. No public API or config changes.

Risk Level

low

Partitioned-table behavior is byte-for-byte identical (the relaxed guard only differs when the partition path is the empty string). Verified via the new parameterized test and an end-to-end spark-shell run on Spark 4.0.2 against a non-partitioned MOR table with consistent-hashing bucket index: a follow-up write triggered inline split clustering through SingleSparkJobConsistentHashingExecutionStrategy, increasing the bucket count from 2 to 4 with all records preserved.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…tables Consistent hashing clustering failed on non-partitioned tables because SingleSparkJobConsistentHashingExecutionStrategy rejected the empty ("") partition path with a "not null or empty" guard. Relax the guard to only reject null, since an empty partition path is valid for non-partitioned tables. Add a parameterized test (testResizingNonPartitioned) covering split and merge resizing on a non-partitioned table for both the single-job and multi-job execution strategies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

lokeshj1703

@ad1happy2go Thanks for working on this! I have one minor comment below. Also it seems like HUDI-18161 is not accessible anymore. Can we update the reference?

Address review feedback on PR apache#18968: - Fold testResizingNonPartitioned into testResizing via a nonPartitioned parameter and resizingConfigParams() cross-product, since the two tests were identical apart from the partition dimension. - Update the stale HUDI-18161 reference to GitHub issue apache#18161. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov-commenter · 2026-06-12T12:18:22Z

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.24%. Comparing base (782bf65) to head (ce8febd).
⚠️ Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
...gleSparkJobConsistentHashingExecutionStrategy.java	0.00%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18968      +/-   ##
============================================
- Coverage     68.26%   68.24%   -0.02%     
+ Complexity    29486    29478       -8     
============================================
  Files          2542     2542              
  Lines        142580   142541      -39     
  Branches      17781    17798      +17     
============================================
- Hits          97330    97284      -46     
- Misses        37237    37253      +16     
+ Partials       8013     8004       -9

Flag	Coverage Δ
common-and-other-modules	`44.77% <0.00%> (+0.02%)`	⬆️
hadoop-mr-java-client	`44.75% <ø> (+0.01%)`	⬆️
spark-client-hadoop-common	`48.06% <0.00%> (+<0.01%)`	⬆️
spark-java-tests	`48.76% <0.00%> (-0.03%)`	⬇️
spark-scala-tests	`44.80% <0.00%> (-0.05%)`	⬇️
utilities	`37.22% <0.00%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...gleSparkJobConsistentHashingExecutionStrategy.java	`91.35% <0.00%> (ø)`

... and 77 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-06-12T12:43:51Z

CI report:

ce8febd Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR relaxes the partition guard in SingleSparkJobConsistentHashingExecutionStrategy from !isNullOrEmpty to != null so consistent-hashing clustering can run on non-partitioned tables (whose partition path is the empty string), and folds non-partitioned coverage into the existing testResizing parameterized test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming nit in the new test helper method; the rest of the changes are clean and well-commented.

cc @yihua

hudi-agent · 2026-06-13T19:47:56Z

+  // configParams crossed with the partitioned / non-partitioned dimension (isSplit, rowWriterEnable, single, nonPartitioned).
+  private static Stream<Arguments> resizingConfigParams() {
+    return configParams().flatMap(args -> {
+      Object[] a = args.get();


🤖 nit: could you rename a to something like origArgs or baseArgs? The single-letter name is a bit hard to follow when it's indexed three times on the lines below.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 10, 2026

lokeshj1703 reviewed Jun 12, 2026

View reviewed changes

Comment thread ...hudi-spark/src/test/java/org/apache/hudi/functional/TestSparkConsistentBucketClustering.java Outdated

wombatu-kun approved these changes Jun 13, 2026

View reviewed changes

hudi-agent reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): support consistent hashing clustering on non-partitioned tables#18968

fix(spark): support consistent hashing clustering on non-partitioned tables#18968
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:HUDI-18161-consistent-hashing-non-partitioned

ad1happy2go commented Jun 10, 2026 •

edited

Loading

Uh oh!

lokeshj1703 left a comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 12, 2026

Uh oh!

hudi-bot commented Jun 12, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ad1happy2go commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-commenter commented Jun 12, 2026

Codecov Report

Uh oh!

hudi-bot commented Jun 12, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ad1happy2go commented Jun 10, 2026 •

edited

Loading