Summary
Characterize the performance of Variant in Hudi so we can make data-backed claims
about the shredded vs unshredded tradeoff. This is the benchmarking item referenced
in the Variant epic roadmap (#17744). It is intentionally deferred out of the initial
1.2 functional work; results will also back the planned Variant blog post.
Motivation
- We currently ship shredded and unshredded Variant (Spark 4.0+ write, Spark 4.1+
shredded read) without Hudi-specific performance numbers.
- Shredding trades write cost for read pruning, predicate pushdown, and data
skipping. We need to quantify that tradeoff on Hudi tables (COW and MOR) rather
than borrow numbers from other projects.
Baselines to compare
Same logical data and same queries across:
- JSON stored as a STRING column
- Semi-structured data stored as a nested STRUCT
- Variant, unshredded
- Variant, shredded
Table types: COW and MOR. Engine: Spark 4.1+ (required for shredded read-back).
Metrics
Write path:
- write throughput / wall-clock for bulk insert and for upsert / MERGE
- write amplification: base file size, log file size (MOR), total bytes
- shredding overhead vs unshredded write
Read path:
- targeted single-field access latency (variant_get on a shredded field)
- full-row reconstruction latency
- bytes scanned and row groups pruned (predicate pushdown + data skipping on
shredded fields)
- projection pruning effectiveness
Storage:
- on-disk size by encoding, with and without shredding
Workload knobs
- field cardinality / number of top-level keys
- nesting depth
- shredding coverage (fraction of fields shredded vs left in the residual value)
- selectivity of predicates over shredded fields
- null density and type heterogeneity (exercise the typed_value fallback path)
Deliverables
- A reproducible benchmark harness (prefer extending existing Hudi benchmark
tooling over a new one).
- A short results writeup (tables / plots) suitable to feed the Variant blog post.
- Recommendations: default shredding guidance, when shredding hurts (e.g.
write-heavy MOR), and config tuning notes.
Out of scope (tracked elsewhere)
Notes
Summary
Characterize the performance of Variant in Hudi so we can make data-backed claims
about the shredded vs unshredded tradeoff. This is the benchmarking item referenced
in the Variant epic roadmap (#17744). It is intentionally deferred out of the initial
1.2 functional work; results will also back the planned Variant blog post.
Motivation
shredded read) without Hudi-specific performance numbers.
skipping. We need to quantify that tradeoff on Hudi tables (COW and MOR) rather
than borrow numbers from other projects.
Baselines to compare
Same logical data and same queries across:
Table types: COW and MOR. Engine: Spark 4.1+ (required for shredded read-back).
Metrics
Write path:
Read path:
shredded fields)
Storage:
Workload knobs
Deliverables
tooling over a new one).
write-heavy MOR), and config tuning notes.
Out of scope (tracked elsewhere)
Notes