Benchmark Variant performance: shredded vs unshredded (write cost, read, data skipping, storage)

## Summary

Characterize the performance of Variant in Hudi so we can make data-backed claims
about the shredded vs unshredded tradeoff. This is the benchmarking item referenced
in the Variant epic roadmap (#17744). It is intentionally deferred out of the initial
1.2 functional work; results will also back the planned Variant blog post.

## Motivation

- We currently ship shredded and unshredded Variant (Spark 4.0+ write, Spark 4.1+
  shredded read) without Hudi-specific performance numbers.
- Shredding trades write cost for read pruning, predicate pushdown, and data
  skipping. We need to quantify that tradeoff on Hudi tables (COW and MOR) rather
  than borrow numbers from other projects.

## Baselines to compare

Same logical data and same queries across:

1. JSON stored as a STRING column
2. Semi-structured data stored as a nested STRUCT
3. Variant, unshredded
4. Variant, shredded

Table types: COW and MOR. Engine: Spark 4.1+ (required for shredded read-back).

## Metrics

Write path:
- write throughput / wall-clock for bulk insert and for upsert / MERGE
- write amplification: base file size, log file size (MOR), total bytes
- shredding overhead vs unshredded write

Read path:
- targeted single-field access latency (variant_get on a shredded field)
- full-row reconstruction latency
- bytes scanned and row groups pruned (predicate pushdown + data skipping on
  shredded fields)
- projection pruning effectiveness

Storage:
- on-disk size by encoding, with and without shredding

## Workload knobs

- field cardinality / number of top-level keys
- nesting depth
- shredding coverage (fraction of fields shredded vs left in the residual value)
- selectivity of predicates over shredded fields
- null density and type heterogeneity (exercise the typed_value fallback path)

## Deliverables

- A reproducible benchmark harness (prefer extending existing Hudi benchmark
  tooling over a new one).
- A short results writeup (tables / plots) suitable to feed the Variant blog post.
- Recommendations: default shredding guidance, when shredding hurts (e.g.
  write-heavy MOR), and config tuning notes.

## Out of scope (tracked elsewhere)

- read-then-reshred / compaction over shredded base files: #18931
- colstats for shredded variants: #17988
- auto shredding schema inference: #18038, #18937

## Notes

- Parent epic: #17744
- Extension of RFC-99.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Variant performance: shredded vs unshredded (write cost, read, data skipping, storage) #19035

Summary

Motivation

Baselines to compare

Metrics

Workload knobs

Deliverables

Out of scope (tracked elsewhere)

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark Variant performance: shredded vs unshredded (write cost, read, data skipping, storage) #19035

Description

Summary

Motivation

Baselines to compare

Metrics

Workload knobs

Deliverables

Out of scope (tracked elsewhere)

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions