Skip to content

fix(spark): support consistent hashing clustering on non-partitioned tables#18968

Open
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:HUDI-18161-consistent-hashing-non-partitioned
Open

fix(spark): support consistent hashing clustering on non-partitioned tables#18968
ad1happy2go wants to merge 2 commits into
apache:masterfrom
ad1happy2go:HUDI-18161-consistent-hashing-non-partitioned

Conversation

@ad1happy2go

@ad1happy2go ad1happy2go commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Describe the issue this Pull Request addresses

Consistent hashing bucket-index clustering (bucket resizing) fails on non-partitioned tables.

For a non-partitioned table the partition path stored in the clustering group metadata is an empty string (""). SingleSparkJobConsistentHashingExecutionStrategy validated the partition with a "not null or empty" guard and threw IllegalArgumentException: Partition should not be null or empty before any clustering work could run, so split/merge resizing was impossible on non-partitioned tables.

GitHub issue: #18161

Summary and Changelog

  • Relax the partition guard in SingleSparkJobConsistentHashingExecutionStrategy (both the merge and split paths) from !StringUtils.isNullOrEmpty(partition) to partition != null. An empty partition path is valid for non-partitioned tables; a genuinely absent metadata key (null) is still rejected.
  • Add a parameterized test testResizingNonPartitioned in TestSparkConsistentBucketClustering, mirroring the existing testResizing, covering split and merge resizing on a non-partitioned table across the single-job and multi-job execution strategies and the row-writer on/off paths. A setup(..., boolean nonPartitioned) overload configures the non-partition key generator and an empty partition-path field.

Impact

Consistent hashing clustering now works on non-partitioned tables. No behavior change for partitioned tables — when the partition path is non-empty (every partitioned table), the guard evaluates identically and the code path is unchanged. No public API or config changes.

Risk Level

low

Partitioned-table behavior is byte-for-byte identical (the relaxed guard only differs when the partition path is the empty string). Verified via the new parameterized test and an end-to-end spark-shell run on Spark 4.0.2 against a non-partitioned MOR table with consistent-hashing bucket index: a follow-up write triggered inline split clustering through SingleSparkJobConsistentHashingExecutionStrategy, increasing the bucket count from 2 to 4 with all records preserved.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…tables

Consistent hashing clustering failed on non-partitioned tables because
SingleSparkJobConsistentHashingExecutionStrategy rejected the empty ("")
partition path with a "not null or empty" guard. Relax the guard to only
reject null, since an empty partition path is valid for non-partitioned
tables.

Add a parameterized test (testResizingNonPartitioned) covering split and
merge resizing on a non-partitioned table for both the single-job and
multi-job execution strategies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Jun 10, 2026

@lokeshj1703 lokeshj1703 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ad1happy2go Thanks for working on this! I have one minor comment below. Also it seems like HUDI-18161 is not accessible anymore. Can we update the reference?

Address review feedback on PR apache#18968:
- Fold testResizingNonPartitioned into testResizing via a nonPartitioned
  parameter and resizingConfigParams() cross-product, since the two tests
  were identical apart from the partition dimension.
- Update the stale HUDI-18161 reference to GitHub issue apache#18161.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.24%. Comparing base (782bf65) to head (ce8febd).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
...gleSparkJobConsistentHashingExecutionStrategy.java 0.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18968      +/-   ##
============================================
- Coverage     68.26%   68.24%   -0.02%     
+ Complexity    29486    29478       -8     
============================================
  Files          2542     2542              
  Lines        142580   142541      -39     
  Branches      17781    17798      +17     
============================================
- Hits          97330    97284      -46     
- Misses        37237    37253      +16     
+ Partials       8013     8004       -9     
Flag Coverage Δ
common-and-other-modules 44.77% <0.00%> (+0.02%) ⬆️
hadoop-mr-java-client 44.75% <ø> (+0.01%) ⬆️
spark-client-hadoop-common 48.06% <0.00%> (+<0.01%) ⬆️
spark-java-tests 48.76% <0.00%> (-0.03%) ⬇️
spark-scala-tests 44.80% <0.00%> (-0.05%) ⬇️
utilities 37.22% <0.00%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...gleSparkJobConsistentHashingExecutionStrategy.java 91.35% <0.00%> (ø)

... and 77 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR relaxes the partition guard in SingleSparkJobConsistentHashingExecutionStrategy from !isNullOrEmpty to != null so consistent-hashing clustering can run on non-partitioned tables (whose partition path is the empty string), and folds non-partitioned coverage into the existing testResizing parameterized test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming nit in the new test helper method; the rest of the changes are clean and well-commented.

cc @yihua

// configParams crossed with the partitioned / non-partitioned dimension (isSplit, rowWriterEnable, single, nonPartitioned).
private static Stream<Arguments> resizingConfigParams() {
return configParams().flatMap(args -> {
Object[] a = args.get();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you rename a to something like origArgs or baseArgs? The single-letter name is a bit hard to follow when it's indexed three times on the lines below.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants