Skip to content

Benchmark Variant performance: shredded vs unshredded (write cost, read, data skipping, storage) #19035

@voonhous

Description

@voonhous

Summary

Characterize the performance of Variant in Hudi so we can make data-backed claims
about the shredded vs unshredded tradeoff. This is the benchmarking item referenced
in the Variant epic roadmap (#17744). It is intentionally deferred out of the initial
1.2 functional work; results will also back the planned Variant blog post.

Motivation

  • We currently ship shredded and unshredded Variant (Spark 4.0+ write, Spark 4.1+
    shredded read) without Hudi-specific performance numbers.
  • Shredding trades write cost for read pruning, predicate pushdown, and data
    skipping. We need to quantify that tradeoff on Hudi tables (COW and MOR) rather
    than borrow numbers from other projects.

Baselines to compare

Same logical data and same queries across:

  1. JSON stored as a STRING column
  2. Semi-structured data stored as a nested STRUCT
  3. Variant, unshredded
  4. Variant, shredded

Table types: COW and MOR. Engine: Spark 4.1+ (required for shredded read-back).

Metrics

Write path:

  • write throughput / wall-clock for bulk insert and for upsert / MERGE
  • write amplification: base file size, log file size (MOR), total bytes
  • shredding overhead vs unshredded write

Read path:

  • targeted single-field access latency (variant_get on a shredded field)
  • full-row reconstruction latency
  • bytes scanned and row groups pruned (predicate pushdown + data skipping on
    shredded fields)
  • projection pruning effectiveness

Storage:

  • on-disk size by encoding, with and without shredding

Workload knobs

  • field cardinality / number of top-level keys
  • nesting depth
  • shredding coverage (fraction of fields shredded vs left in the residual value)
  • selectivity of predicates over shredded fields
  • null density and type heterogeneity (exercise the typed_value fallback path)

Deliverables

  • A reproducible benchmark harness (prefer extending existing Hudi benchmark
    tooling over a new one).
  • A short results writeup (tables / plots) suitable to feed the Variant blog post.
  • Recommendations: default shredding guidance, when shredding hurts (e.g.
    write-heavy MOR), and config tuning notes.

Out of scope (tracked elsewhere)

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:schemaSchema evolution and data typestype:devtaskDevelopment tasks and maintenance work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Open

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions