Skip to content

feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference#18961

Open
voonhous wants to merge 18 commits into
apache:masterfrom
voonhous:feat-18937-variant-shredding-inference
Open

feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference#18961
voonhous wants to merge 18 commits into
apache:masterfrom
voonhous:feat-18937-variant-shredding-inference

Conversation

@voonhous

@voonhous voonhous commented Jun 10, 2026

Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Closes #18937
Closes #18038

Hudi can write and read shredded variants, but typed_value only ever comes from an explicit table schema or the test-only force-shredding DDL, so production tables never shred. Spark 4.1 infers a per-file shredding schema from the data (SPARK-53659, on by default), but that lives in Spark's own writer stack which Hudi bypasses.

Stacked on #18938 (read-side reconstruction), which stacks on #18065. Do not merge until both land.

Summary and Changelog

When hoodie.parquet.variant.shredding.schema.inference.enabled is set (default false), Hudi infers a shredding schema per base file from a sample of the records written to it, for both record types (SPARK, AVRO) and the bulk-insert row writer. Requires Spark 4.1+ on the writer classpath; Spark 4.0/Flink/Java silently keep writing unshredded.

  • New config hoodie.parquet.variant.shredding.schema.inference.enabled in HoodieStorageConfig.
  • New VariantShreddingSchemaInferrer SPI in hudi-common, loaded by classpath detection (VariantShreddingRuntime, which also consolidates the duplicated provider-candidate arrays).
  • Spark41VariantShreddingSchemaInferrer (hudi-spark4.1.x) delegates to Spark's InferVariantShreddingSchema, so Hudi inherits Spark's heuristics verbatim (no code copied). One call per file covers all variant columns (preserves the global width budget); object keys that are not valid Avro names are dropped and legally fall back to the residual value column.
  • VariantShreddingInferenceFileWriter (+ a row-writer sibling): buffers up to 4096 records / 64MB (mirrors Spark's ParquetOutputWriterWithVariantShredding), infers once, creates the real writer with the inferred typed_value spliced in, then replays in order. Inference failures fall back to unshredded (a throwing inference must not fail compaction); writer-creation or replay failures latch and rethrow through close() so buffered records cannot be dropped silently.
  • Wiring: the AVRO factory splices the schema argument; the SPARK and row-writer factories splice a copied config, since HoodieRowParquetWriteSupport resolves its schema from hoodie.write.schema / hoodie.avro.schema rather than the factory argument.
  • Fixes latent issues this feature trips: Variant.getPlainTypedValueSchema is now recursive (nested objects, arrays, value-only wrappers), Avro "Field already used" in stripVariantShredding / VariantReconstruction, and the table-schema footer fallback now strips typed_value by shape so per-file layouts never leak into the resolved table schema.
  • Tests: unit coverage for the decorator, schema utils and HoodieSchema recursion; functional tests in TestVariantDataType for COW (multi-column with declines, update over a shredded base), MOR inline compaction, and the bulk-insert row writer, all gated on Spark 4.1+.

Impact

No behavior change unless the new config is enabled. When enabled on Spark 4.1+, base files carry a per-file inferred typed_value; readers already handle shredded files (#18938 on the AVRO path, Spark native otherwise). MOR log files always stay unshredded; shredding materializes at compaction. Flipping the default on is a follow-up tracked in #18937.

Risk Level

Low. The feature is off by default and engines without the Spark 4.1 inferrer on the classpath are unaffected even when it is on. Verified with new unit and functional tests across both record types and all three write paths, plus compile checks under the spark3.5, spark4.0 and spark4.1 profiles.

Documentation Update

New config documented via its withDocumentation text (picked up by the generated config reference). Website updates deferred to the default-flip follow-up.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous force-pushed the feat-18937-variant-shredding-inference branch from 6e26ed9 to 7e260ad Compare June 10, 2026 10:20
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Jun 10, 2026
@voonhous

Copy link
Copy Markdown
Member Author

Heads up on a latent bug inherited from the #18938 read path, found while testing inference here and fixed in this PR:

VariantReconstruction never engaged on real files. The reader derives the file schema by converting the parquet footer MessageType, and that conversion loses the variant logical type, so the getType() == VARIANT check never matched. The shredded base file was then read with the unshredded {metadata, value} projection, silently dropping all typed_value fields on the AVRO read path (a reconstructed row came back as just the residual, e.g. {"bad-key":false}).

Fixed in this PR with shape-based detection anchored on the requested column: the requested side (from the table schema, logical type intact) must be a variant, and the on-disk side is matched by the shredded shape {metadata: bytes, value: [nullable] bytes, typed_value}, so a plain footer-derived record still triggers reconstruction.

Worth noting why #18938 did not catch it: the new COW inference test here is the first to read a shredded base file end to end through the AVRO reader. The existing MOR compaction test compacts a log-only file group (no shredded base to read), and post-compaction queries go through the Spark native reader.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.35611% with 308 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.29%. Comparing base (c40f765) to head (5e10346).
⚠️ Report is 24 commits behind head on master.

Files with missing lines Patch % Lines
...e/hudi/variant/Spark4VariantShreddingProvider.java 48.10% 95 Missing and 28 partials ⚠️
...ariantShreddingInferenceInternalRowFileWriter.java 57.42% 29 Missing and 14 partials ⚠️
...a/org/apache/hudi/avro/HoodieAvroWriteSupport.java 73.68% 19 Missing and 16 partials ⚠️
...o/storage/VariantShreddingInferenceFileWriter.java 82.79% 13 Missing and 3 partials ⚠️
...ariant/Spark41VariantShreddingSchemaInferrer.scala 65.21% 5 Missing and 11 partials ⚠️
.../hudi/io/storage/hadoop/VariantReconstruction.java 79.10% 7 Missing and 7 partials ⚠️
...va/org/apache/hudi/common/schema/HoodieSchema.java 71.73% 3 Missing and 10 partials ⚠️
...g/apache/hudi/avro/AvroVariantSampleExtractor.java 64.70% 7 Missing and 5 partials ⚠️
.../java/org/apache/hudi/avro/VariantSchemaUtils.java 90.00% 2 Missing and 10 partials ⚠️
...e/hudi/io/storage/SparkVariantSampleExtractor.java 73.68% 2 Missing and 3 partials ⚠️
... and 8 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18961      +/-   ##
============================================
- Coverage     68.77%   68.29%   -0.49%     
- Complexity    29091    29721     +630     
============================================
  Files          2517     2552      +35     
  Lines        139822   143673    +3851     
  Branches      17209    18002     +793     
============================================
+ Hits          96167    98116    +1949     
- Misses        35883    37444    +1561     
- Partials       7772     8113     +341     
Flag Coverage Δ
common-and-other-modules 44.80% <41.63%> (+0.54%) ⬆️
hadoop-mr-java-client 44.55% <18.83%> (-0.20%) ⬇️
spark-client-hadoop-common 47.75% <14.79%> (-0.30%) ⬇️
spark-java-tests 48.65% <33.49%> (-0.67%) ⬇️
spark-scala-tests 45.09% <63.90%> (-0.13%) ⬇️
utilities 37.13% <15.64%> (-0.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ache/hudi/avro/VariantShreddingSchemaInferrer.java 100.00% <100.00%> (ø)
...apache/hudi/common/config/HoodieStorageConfig.java 89.48% <100.00%> (+0.63%) ⬆️
.../apache/hudi/common/table/TableSchemaResolver.java 87.84% <100.00%> (+0.13%) ⬆️
...udi/io/storage/hadoop/HoodieAvroParquetReader.java 93.44% <100.00%> (+0.46%) ⬆️
...rces/parquet/Spark40HoodieParquetReadSupport.scala 78.94% <ø> (ø)
...ion/datasources/parquet/Spark40ParquetReader.scala 95.73% <100.00%> (+0.73%) ⬆️
...ion/datasources/parquet/Spark41ParquetReader.scala 95.53% <100.00%> (+0.10%) ⬆️
.../hudi/io/storage/HoodieSparkFileWriterFactory.java 90.14% <90.90%> (+0.14%) ⬆️
...i/io/storage/row/HoodieRowParquetWriteSupport.java 78.36% <75.00%> (+2.09%) ⬆️
...scala/org/apache/spark/sql/hudi/SparkAdapter.scala 56.25% <0.00%> (-3.75%) ⬇️
... and 15 more

... and 119 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@voonhous voonhous changed the title feat(variant): [DNM] auto-infer per-file variant shredding schemas on writedding inference feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference Jun 12, 2026

@hudi-agent hudi-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds per-file variant shredding schema inference for both the Avro record write path and the row writer, with a decorator-based file writer, an Avro<->VariantSchema bridge for write/rebuild, and classpath-detection of the Spark 4.1 InferVariantShreddingSchema provider. The implementation is generally careful (defensive copies, null/missing-field handling, identifier sanitization, precedence-aware schema splicing), but a few spots are worth a second look: a session-level SQLConf mutation inside HoodieHadoopFsRelationFactory that could affect other queries, a redundant delegate.close() pattern in VariantShreddingInferenceFileWriter, and an ambiguity in the Avro timestamp/local-timestamp logical-type mapping in Spark4VariantShreddingProvider. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming and ordering nits below, code is otherwise clean and well-documented. a couple of minor readability nits in Spark4VariantShreddingProvider — rest of the code is clean and well-commented.

}
if ("timestamp-micros".equals(name)) {
return new VariantSchema.TimestampType();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Mapping both timestamp-millis and timestamp-micros (and similarly local-* variants) to the same VariantSchema.TimestampType()/TimestampNTZType() looks suspicious — if Spark's VariantSchema.TimestampType has fixed micros precision (variant binary format stores µs), then a millis Long would be interpreted as µs and be off by 1000x. In the inferred-write flow this branch is dead (Hudi always generates timestamp-micros via HoodieSchema.createTimestampMicros()), but the read/rebuild path could in principle hit an Avro schema with millis. Is this branch intended as a true equivalence, or should millis schemas decline shredding (return null) so rebuild doesn't silently misinterpret values?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- it's currently unreachable but a real latent trap, so I hardened it. The Variant binary spec stores timestamps in microseconds, so a timestamp-millis typed_value can't represent a variant timestamp; mapping it to the micros TimestampType would scale the value 1000x on rebuild. Made millis decline instead, so the value stays in the residual unshredded binary:

// The Variant binary spec stores timestamps in microseconds, so a millisecond-precision
// typed_value cannot represent a variant timestamp. Decline to shred it as a scalar (the value
// stays in the residual unshredded binary) rather than mapping it to the micros TimestampType,
// which would silently scale the value by 1000x.
if ("timestamp-millis".equals(name) || "local-timestamp-millis".equals(name)) {
  return null;
}

In practice this never fires (Hudi writes typed_value via createTimestampMicros, the read path only reads Hudi-shredded files, and the Spark inferrer produces micros from micros-only variant samples), so behavior is unchanged -- this just removes the silent-corruption path.

avroTypeToScalarType is shared write/read code and the concern is sharpest on the rebuild path, so this landed in the base PR #18938; it'll flow into this PR on the next rebase.

voonhous added 9 commits June 18, 2026 17:06
…pache#18931)

Compaction/clustering reading an already-shredded base file via the AVRO record
path now rebuilds the unshredded {metadata, value} variant before records reach
the merger/writer, replacing the prior fail-fast guard.

- VariantShreddingProvider: add rebuildVariantRecord (inverse of shredVariantRecord);
  Spark4VariantShreddingProvider implements it via Spark's ShreddingUtils.rebuild over
  Avro-backed ShreddedRow rows.
- HoodieAvroParquetReader: detect shredded variant columns, read them at the file's
  shredded schema, and reconstruct to unshredded per record (VariantReconstruction);
  provider loaded via config or classpath, gated on
  hoodie.parquet.variant.allow.reading.shredded.
- Extract stripVariantShredding into VariantSchemaUtils (shared by reader/writer).
- Remove the read-then-reshred guard from HoodieAvroWriteSupport and its unit test.
- Extend the MOR compaction test in TestVariantDataType to write shredded and read
  back (AVRO reconstruction + SPARK native via withRecordType).
The compaction base file is written by the AVRO shredding writer as
[metadata, value, typed_value]. Spark 4.0's reorderVariantFields rebuilds
that group as [value, metadata] and drops typed_value, so the native read
after compaction fails with MALFORMED_VARIANT. Spark 4.1+ reads variant
fields by name (SPARK-54410) and reconstructs correctly.

Fixes the spark4.0 leg of apache#18931.
withSchema(replacement) rebuilds the field from replacement, which already
mirrors the original nullability, so the intermediate makeNullable() was a
no-op (and could reset field order for non-nullable fields).
VariantReconstruction and HoodieAvroFileWriterFactory each hardcoded the
provider FQN and a Class.forName lookup. Move the candidate list and detection
into a shared VariantShreddingProvider.detectProviderClassOnClasspath(), so a
new provider impl is registered in one place.
…ruction

Every other class in org.apache.hudi.io.storage.hadoop carries the Hoodie
prefix (HoodieAvroParquetReader, HoodieHadoopIOFactory, ...); this was the lone
exception. Package-private with a single caller, so the rename is contained.
…onstruction

The helper unwraps a nullable union type, not a null-assertion guard; the new
name reflects what it does at each call site.
…on provider

When a requested column is shredded in the base file and reading shredded
variants is enabled but no provider is loadable, create() returned null and the
reader fell back to the unshredded requested schema. parquet-avro then reads
value/metadata and drops typed_value, silently corrupting variants whose payload
lived in typed_value. Throw HoodieException instead, mirroring the write path.
Removes the now-unused logger.
metadata is REQUIRED in the shredded parquet schema, so a null metadata on the
read path means a malformed base file. Returning null let the caller null out the
whole variant column, silently dropping data; throw HoodieException instead. The
separate null-record guard stays (genuine null variant passes through).
…back comment

The Spark SQL read-back is gated on Spark 4.1+ because Spark 4.0's native parquet
reader rejects the 3-field shredded layout (SPARK-54410) - not because of apache#18931.
apache#18931 is the AVRO reader reconstruction and does not affect the Spark native read,
so drop the misattributed pointer.
@voonhous voonhous force-pushed the feat-18937-variant-shredding-inference branch from 5e10346 to 86061dd Compare June 19, 2026 02:29
…d_value directly

The scalar getters ignore ordinal and read typed_value directly while
isNullAt/getBinary go through fieldNameFor(ordinal). Spark only calls the
scalar getters for the scalar typed_value, so the asymmetry is intentional;
add a comment so it does not read as an oversight.
@voonhous voonhous force-pushed the feat-18937-variant-shredding-inference branch from 6ebf27f to 0f7027b Compare June 20, 2026 08:33
voonhous added 8 commits June 20, 2026 18:57
…ype mapping

avroTypeToScalarType mapped timestamp-millis / local-timestamp-millis to the
micros-based TimestampType / TimestampNTZType. The Variant binary spec stores
timestamps in microseconds, so a millis-precision typed_value would be read back
as micros and scaled 1000x. Unreachable today (Hudi/Spark only produce micros
typed_value), but decline to shred millis as a scalar so it can never silently
corrupt; the value stays in the residual unshredded binary.
Today typed_value comes only from an explicit table schema or the test-only
force-shredding DDL, so production tables never shred variants. This adds
per-file inference of the shredding schema from the first records of each
base file, for both HoodieRecordType paths and the bulk-insert row writer,
reusing Spark 4.1's InferVariantShreddingSchema heuristics verbatim
(SPARK-53659). Gated by hoodie.parquet.variant.shredding.schema.inference.enabled
(default off); on Spark 4.0/Flink/Java the inferrer is absent from the
classpath and writes silently stay unshredded.

A buffering HoodieFileWriter decorator (mirroring Spark's
ParquetOutputWriterWithVariantShredding: 4096 records / 64MB, infer once,
replay in order) defers the parquet writer until the schema is known. The
AVRO factory splices the inferred typed_value into the schema argument; the
SPARK and row-writer factories splice a copied config because the row write
support resolves its schema from hoodie.write.schema/hoodie.avro.schema.
Inference failures decline to unshredded (a throwing inference must not fail
compaction); writer-creation/replay failures latch and rethrow through
close() so buffered records cannot be dropped silently. The Spark 4.1
inferrer batches all variant columns into one call (global width budget) and
drops avro-illegal object keys, which legally fall back to the residual
value column.

Also fixes latent issues this feature would trip: recursive
Variant.getPlainTypedValueSchema (depth>=2 objects, arrays, value-only
wrappers), avro field-reuse (Field already used) in stripVariantShredding
and VariantReconstruction, and the table-schema footer fallback now strips
typed_value by shape so per-file layouts never leak into the resolved
table schema.

Stacked on apache#18938 (read-side reconstruction); part of apache#18937.
…i-variant tables

Fixes surfaced by the auto-inference COW test, which is the first to read a
shredded base file end to end:

- VariantReconstruction never engaged on real files: the reader's file schema
  comes from converting the parquet footer MessageType, which loses the
  variant logical type, so the shredded group was projected down to
  {metadata, value} and typed_value was silently dropped on the AVRO
  read-then-rewrite path. Detect the on-disk side by shape, anchored on the
  requested column being a variant.
- The inferred-shredding config splice aliased every variant column: columns
  share one record type named 'variant', which Avro serializes as name
  references after the first occurrence, so replacing one column's record
  with a same-named shredded definition shredded all of them on re-parse.
  Spliced records now get a per-column unique name.
- getPlainTypedValueSchema named every nesting level '<name>_plain' with a
  null namespace; for nested objects (every spec level is named typed_value)
  the inner record became an Avro self-reference of its ancestor, which
  Spark rejects as recursion. Names now chain the field path.
- HoodieRowParquetWriteSupport warned 'no corresponding HoodieSchema' for
  every nullable variant column because the top-level check did not unwrap
  the field's nullable union; shredding still happened via the nested
  fallthrough, so this only silenced a misleading warning.
- Test fixes: cast(variant as string) on a string-typed variant extracts the
  raw string (not its JSON form), and the decline column now uses per-row
  empty objects: inference is per file and a multi-row insert can fan out to
  one file per row, so cross-row type conflicts never reach one inference
  call and cannot decline deterministically.
…ence is off

resolveConfigSchema parses the avro schema string; gate it behind the
inference flag so the default-off path adds no per-file cost.
…rk version

PR CI runs on the merge ref against current master, whose spark4.2 profile
satisfies gteqSpark4_1 but builds only hudi-spark4.2.x, which has no
shredding-schema inferrer yet; the three inference tests then ran without
one and failed their typed_value assertions on silently-unshredded files.
Gate them on VariantShreddingRuntime.lookupInferrer() instead so any
profile without an inferrer cancels rather than fails, and starts running
again as soon as that version module ships its inferrer.
…riantShreddingRuntime

The inference work centralizes engine-component classpath detection in
VariantShreddingRuntime (provider + schema inferrer, memoized). The merge
converged both call sites on it, leaving the interface helper added earlier in
this stack unused; remove it and its CLASSPATH_CANDIDATES constant.
… assertions

The inference tests cherry-picked here called the groupContainsField test helper,
which the merged apache#18065 inlined to GroupType.containsField. The helper no longer
exists, so the calls failed to compile; switch them to the inlined form.
…ose()

On the success path the try already called delegate.close(); if it threw, the
catch closed it again, relying on delegate close() being idempotent. Track a
delegateClosed flag (set before the try-path close) so the catch only cleans up
a delegate that materialize() created but never closed. Applied to both
VariantShreddingInferenceFileWriter and its InternalRow sibling.
@voonhous voonhous force-pushed the feat-18937-variant-shredding-inference branch from 9aa0ac1 to 1593306 Compare June 20, 2026 11:09
@hudi-bot

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auto-infer variant shredding schema and enable shredding by default (SPARK + AVRO record types) Add auto shredding field inference logic

4 participants