fix(spark): support consistent hashing clustering on non-partitioned tables#18968
fix(spark): support consistent hashing clustering on non-partitioned tables#18968ad1happy2go wants to merge 2 commits into
Conversation
…tables
Consistent hashing clustering failed on non-partitioned tables because
SingleSparkJobConsistentHashingExecutionStrategy rejected the empty ("")
partition path with a "not null or empty" guard. Relax the guard to only
reject null, since an empty partition path is valid for non-partitioned
tables.
Add a parameterized test (testResizingNonPartitioned) covering split and
merge resizing on a non-partitioned table for both the single-job and
multi-job execution strategies.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
lokeshj1703
left a comment
There was a problem hiding this comment.
@ad1happy2go Thanks for working on this! I have one minor comment below. Also it seems like HUDI-18161 is not accessible anymore. Can we update the reference?
Address review feedback on PR apache#18968: - Fold testResizingNonPartitioned into testResizing via a nonPartitioned parameter and resizingConfigParams() cross-product, since the two tests were identical apart from the partition dimension. - Update the stale HUDI-18161 reference to GitHub issue apache#18161. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18968 +/- ##
============================================
- Coverage 68.26% 68.24% -0.02%
+ Complexity 29486 29478 -8
============================================
Files 2542 2542
Lines 142580 142541 -39
Branches 17781 17798 +17
============================================
- Hits 97330 97284 -46
- Misses 37237 37253 +16
+ Partials 8013 8004 -9
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR relaxes the partition guard in SingleSparkJobConsistentHashingExecutionStrategy from !isNullOrEmpty to != null so consistent-hashing clustering can run on non-partitioned tables (whose partition path is the empty string), and folds non-partitioned coverage into the existing testResizing parameterized test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor naming nit in the new test helper method; the rest of the changes are clean and well-commented.
cc @yihua
| // configParams crossed with the partitioned / non-partitioned dimension (isSplit, rowWriterEnable, single, nonPartitioned). | ||
| private static Stream<Arguments> resizingConfigParams() { | ||
| return configParams().flatMap(args -> { | ||
| Object[] a = args.get(); |
There was a problem hiding this comment.
🤖 nit: could you rename a to something like origArgs or baseArgs? The single-letter name is a bit hard to follow when it's indexed three times on the lines below.
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Describe the issue this Pull Request addresses
Consistent hashing bucket-index clustering (bucket resizing) fails on non-partitioned tables.
For a non-partitioned table the partition path stored in the clustering group metadata is an empty string (
"").SingleSparkJobConsistentHashingExecutionStrategyvalidated the partition with a "not null or empty" guard and threwIllegalArgumentException: Partition should not be null or emptybefore any clustering work could run, so split/merge resizing was impossible on non-partitioned tables.GitHub issue: #18161
Summary and Changelog
SingleSparkJobConsistentHashingExecutionStrategy(both the merge and split paths) from!StringUtils.isNullOrEmpty(partition)topartition != null. An empty partition path is valid for non-partitioned tables; a genuinely absent metadata key (null) is still rejected.testResizingNonPartitionedinTestSparkConsistentBucketClustering, mirroring the existingtestResizing, covering split and merge resizing on a non-partitioned table across the single-job and multi-job execution strategies and the row-writer on/off paths. Asetup(..., boolean nonPartitioned)overload configures the non-partition key generator and an empty partition-path field.Impact
Consistent hashing clustering now works on non-partitioned tables. No behavior change for partitioned tables — when the partition path is non-empty (every partitioned table), the guard evaluates identically and the code path is unchanged. No public API or config changes.
Risk Level
low
Partitioned-table behavior is byte-for-byte identical (the relaxed guard only differs when the partition path is the empty string). Verified via the new parameterized test and an end-to-end
spark-shellrun on Spark 4.0.2 against a non-partitioned MOR table with consistent-hashing bucket index: a follow-up write triggered inline split clustering throughSingleSparkJobConsistentHashingExecutionStrategy, increasing the bucket count from 2 to 4 with all records preserved.Documentation Update
none
Contributor's checklist