feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference by voonhous · Pull Request #18961 · apache/hudi

voonhous · 2026-06-10T09:36:07Z

Describe the issue this Pull Request addresses

Hudi can write and read shredded variants, but typed_value only ever comes from an explicit table schema or the test-only force-shredding DDL, so production tables never shred. Spark 4.1 infers a per-file shredding schema from the data (SPARK-53659, on by default), but that lives in Spark's own writer stack which Hudi bypasses.

Stacked on #18938 (read-side reconstruction), which stacks on #18065. Do not merge until both land.

Summary and Changelog

When hoodie.parquet.variant.shredding.schema.inference.enabled is set (default false), Hudi infers a shredding schema per base file from a sample of the records written to it, for both record types (SPARK, AVRO) and the bulk-insert row writer. Requires Spark 4.1+ on the writer classpath; Spark 4.0/Flink/Java silently keep writing unshredded.

New config hoodie.parquet.variant.shredding.schema.inference.enabled in HoodieStorageConfig.
New VariantShreddingSchemaInferrer SPI in hudi-common, loaded by classpath detection (VariantShreddingRuntime, which also consolidates the duplicated provider-candidate arrays).
Spark41VariantShreddingSchemaInferrer (hudi-spark4.1.x) delegates to Spark's InferVariantShreddingSchema, so Hudi inherits Spark's heuristics verbatim (no code copied). One call per file covers all variant columns (preserves the global width budget); object keys that are not valid Avro names are dropped and legally fall back to the residual value column.
VariantShreddingInferenceFileWriter (+ a row-writer sibling): buffers up to 4096 records / 64MB (mirrors Spark's ParquetOutputWriterWithVariantShredding), infers once, creates the real writer with the inferred typed_value spliced in, then replays in order. Inference failures fall back to unshredded (a throwing inference must not fail compaction); writer-creation or replay failures latch and rethrow through close() so buffered records cannot be dropped silently.
Wiring: the AVRO factory splices the schema argument; the SPARK and row-writer factories splice a copied config, since HoodieRowParquetWriteSupport resolves its schema from hoodie.write.schema / hoodie.avro.schema rather than the factory argument.
Fixes latent issues this feature trips: Variant.getPlainTypedValueSchema is now recursive (nested objects, arrays, value-only wrappers), Avro "Field already used" in stripVariantShredding / VariantReconstruction, and the table-schema footer fallback now strips typed_value by shape so per-file layouts never leak into the resolved table schema.
Tests: unit coverage for the decorator, schema utils and HoodieSchema recursion; functional tests in TestVariantDataType for COW (multi-column with declines, update over a shredded base), MOR inline compaction, and the bulk-insert row writer, all gated on Spark 4.1+.

Impact

No behavior change unless the new config is enabled. When enabled on Spark 4.1+, base files carry a per-file inferred typed_value; readers already handle shredded files (#18938 on the AVRO path, Spark native otherwise). MOR log files always stay unshredded; shredding materializes at compaction. Flipping the default on is a follow-up tracked in #18937.

Risk Level

Low. The feature is off by default and engines without the Spark 4.1 inferrer on the classpath are unaffected even when it is on. Verified with new unit and functional tests across both record types and all three write paths, plus compile checks under the spark3.5, spark4.0 and spark4.1 profiles.

Documentation Update

New config documented via its withDocumentation text (picked up by the generated config reference). Website updates deferred to the default-flip follow-up.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

voonhous · 2026-06-11T08:45:07Z

Heads up on a latent bug inherited from the #18938 read path, found while testing inference here and fixed in this PR:

VariantReconstruction never engaged on real files. The reader derives the file schema by converting the parquet footer MessageType, and that conversion loses the variant logical type, so the getType() == VARIANT check never matched. The shredded base file was then read with the unshredded {metadata, value} projection, silently dropping all typed_value fields on the AVRO read path (a reconstructed row came back as just the residual, e.g. {"bad-key":false}).

Fixed in this PR with shape-based detection anchored on the requested column: the requested side (from the table schema, logical type intact) must be a variant, and the on-disk side is matched by the shredded shape {metadata: bytes, value: [nullable] bytes, typed_value}, so a plain footer-derived record still triggers reconstruction.

Worth noting why #18938 did not catch it: the new COW inference test here is the first to read a shredded base file end to end through the AVRO reader. The existing MOR compaction test compacts a log-only file group (no shredded base to read), and post-compaction queries go through the Spark native reader.

codecov-commenter · 2026-06-11T12:08:54Z

Codecov Report

❌ Patch coverage is 70.35611% with 308 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.29%. Comparing base (c40f765) to head (5e10346).
⚠️ Report is 24 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/hudi/variant/Spark4VariantShreddingProvider.java	48.10%	95 Missing and 28 partials ⚠️
...ariantShreddingInferenceInternalRowFileWriter.java	57.42%	29 Missing and 14 partials ⚠️
...a/org/apache/hudi/avro/HoodieAvroWriteSupport.java	73.68%	19 Missing and 16 partials ⚠️
...o/storage/VariantShreddingInferenceFileWriter.java	82.79%	13 Missing and 3 partials ⚠️
...ariant/Spark41VariantShreddingSchemaInferrer.scala	65.21%	5 Missing and 11 partials ⚠️
.../hudi/io/storage/hadoop/VariantReconstruction.java	79.10%	7 Missing and 7 partials ⚠️
...va/org/apache/hudi/common/schema/HoodieSchema.java	71.73%	3 Missing and 10 partials ⚠️
...g/apache/hudi/avro/AvroVariantSampleExtractor.java	64.70%	7 Missing and 5 partials ⚠️
.../java/org/apache/hudi/avro/VariantSchemaUtils.java	90.00%	2 Missing and 10 partials ⚠️
...e/hudi/io/storage/SparkVariantSampleExtractor.java	73.68%	2 Missing and 3 partials ⚠️
... and 8 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18961      +/-   ##
============================================
- Coverage     68.77%   68.29%   -0.49%     
- Complexity    29091    29721     +630     
============================================
  Files          2517     2552      +35     
  Lines        139822   143673    +3851     
  Branches      17209    18002     +793     
============================================
+ Hits          96167    98116    +1949     
- Misses        35883    37444    +1561     
- Partials       7772     8113     +341

Flag	Coverage Δ
common-and-other-modules	`44.80% <41.63%> (+0.54%)`	⬆️
hadoop-mr-java-client	`44.55% <18.83%> (-0.20%)`	⬇️
spark-client-hadoop-common	`47.75% <14.79%> (-0.30%)`	⬇️
spark-java-tests	`48.65% <33.49%> (-0.67%)`	⬇️
spark-scala-tests	`45.09% <63.90%> (-0.13%)`	⬇️
utilities	`37.13% <15.64%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ache/hudi/avro/VariantShreddingSchemaInferrer.java	`100.00% <100.00%> (ø)`
...apache/hudi/common/config/HoodieStorageConfig.java	`89.48% <100.00%> (+0.63%)`	⬆️
.../apache/hudi/common/table/TableSchemaResolver.java	`87.84% <100.00%> (+0.13%)`	⬆️
...udi/io/storage/hadoop/HoodieAvroParquetReader.java	`93.44% <100.00%> (+0.46%)`	⬆️
...rces/parquet/Spark40HoodieParquetReadSupport.scala	`78.94% <ø> (ø)`
...ion/datasources/parquet/Spark40ParquetReader.scala	`95.73% <100.00%> (+0.73%)`	⬆️
...ion/datasources/parquet/Spark41ParquetReader.scala	`95.53% <100.00%> (+0.10%)`	⬆️
.../hudi/io/storage/HoodieSparkFileWriterFactory.java	`90.14% <90.90%> (+0.14%)`	⬆️
...i/io/storage/row/HoodieRowParquetWriteSupport.java	`78.36% <75.00%> (+2.09%)`	⬆️
...scala/org/apache/spark/sql/hudi/SparkAdapter.scala	`56.25% <0.00%> (-3.75%)`	⬇️
... and 15 more

... and 119 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds per-file variant shredding schema inference for both the Avro record write path and the row writer, with a decorator-based file writer, an Avro<->VariantSchema bridge for write/rebuild, and classpath-detection of the Spark 4.1 InferVariantShreddingSchema provider. The implementation is generally careful (defensive copies, null/missing-field handling, identifier sanitization, precedence-aware schema splicing), but a few spots are worth a second look: a session-level SQLConf mutation inside HoodieHadoopFsRelationFactory that could affect other queries, a redundant delegate.close() pattern in VariantShreddingInferenceFileWriter, and an ambiguity in the Avro timestamp/local-timestamp logical-type mapping in Spark4VariantShreddingProvider. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming and ordering nits below, code is otherwise clean and well-documented. a couple of minor readability nits in Spark4VariantShreddingProvider — rest of the code is clean and well-commented.

hudi-agent · 2026-06-14T00:53:57Z

+      }
+      if ("timestamp-micros".equals(name)) {
+        return new VariantSchema.TimestampType();
+      }


🤖 Mapping both timestamp-millis and timestamp-micros (and similarly local-* variants) to the same VariantSchema.TimestampType()/TimestampNTZType() looks suspicious — if Spark's VariantSchema.TimestampType has fixed micros precision (variant binary format stores µs), then a millis Long would be interpreted as µs and be off by 1000x. In the inferred-write flow this branch is dead (Hudi always generates timestamp-micros via HoodieSchema.createTimestampMicros()), but the read/rebuild path could in principle hit an Avro schema with millis. Is this branch intended as a true equivalence, or should millis schemas decline shredding (return null) so rebuild doesn't silently misinterpret values?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Good catch -- it's currently unreachable but a real latent trap, so I hardened it. The Variant binary spec stores timestamps in microseconds, so a timestamp-millis typed_value can't represent a variant timestamp; mapping it to the micros TimestampType would scale the value 1000x on rebuild. Made millis decline instead, so the value stays in the residual unshredded binary:

// The Variant binary spec stores timestamps in microseconds, so a millisecond-precision // typed_value cannot represent a variant timestamp. Decline to shred it as a scalar (the value // stays in the residual unshredded binary) rather than mapping it to the micros TimestampType, // which would silently scale the value by 1000x. if ("timestamp-millis".equals(name) || "local-timestamp-millis".equals(name)) { return null; }

In practice this never fires (Hudi writes typed_value via createTimestampMicros, the read path only reads Hudi-shredded files, and the Spark inferrer produces micros from micros-only variant samples), so behavior is unchanged -- this just removes the silent-corruption path.

avroTypeToScalarType is shared write/read code and the concern is sharpest on the rebuild path, so this landed in the base PR #18938; it'll flow into this PR on the next rebase.

…pache#18931) Compaction/clustering reading an already-shredded base file via the AVRO record path now rebuilds the unshredded {metadata, value} variant before records reach the merger/writer, replacing the prior fail-fast guard. - VariantShreddingProvider: add rebuildVariantRecord (inverse of shredVariantRecord); Spark4VariantShreddingProvider implements it via Spark's ShreddingUtils.rebuild over Avro-backed ShreddedRow rows. - HoodieAvroParquetReader: detect shredded variant columns, read them at the file's shredded schema, and reconstruct to unshredded per record (VariantReconstruction); provider loaded via config or classpath, gated on hoodie.parquet.variant.allow.reading.shredded. - Extract stripVariantShredding into VariantSchemaUtils (shared by reader/writer). - Remove the read-then-reshred guard from HoodieAvroWriteSupport and its unit test. - Extend the MOR compaction test in TestVariantDataType to write shredded and read back (AVRO reconstruction + SPARK native via withRecordType).

The compaction base file is written by the AVRO shredding writer as [metadata, value, typed_value]. Spark 4.0's reorderVariantFields rebuilds that group as [value, metadata] and drops typed_value, so the native read after compaction fails with MALFORMED_VARIANT. Spark 4.1+ reads variant fields by name (SPARK-54410) and reconstructs correctly. Fixes the spark4.0 leg of apache#18931.

withSchema(replacement) rebuilds the field from replacement, which already mirrors the original nullability, so the intermediate makeNullable() was a no-op (and could reset field order for non-nullable fields).

VariantReconstruction and HoodieAvroFileWriterFactory each hardcoded the provider FQN and a Class.forName lookup. Move the candidate list and detection into a shared VariantShreddingProvider.detectProviderClassOnClasspath(), so a new provider impl is registered in one place.

…ruction Every other class in org.apache.hudi.io.storage.hadoop carries the Hoodie prefix (HoodieAvroParquetReader, HoodieHadoopIOFactory, ...); this was the lone exception. Package-private with a single caller, so the rename is contained.

…onstruction The helper unwraps a nullable union type, not a null-assertion guard; the new name reflects what it does at each call site.

…on provider When a requested column is shredded in the base file and reading shredded variants is enabled but no provider is loadable, create() returned null and the reader fell back to the unshredded requested schema. parquet-avro then reads value/metadata and drops typed_value, silently corrupting variants whose payload lived in typed_value. Throw HoodieException instead, mirroring the write path. Removes the now-unused logger.

metadata is REQUIRED in the shredded parquet schema, so a null metadata on the read path means a malformed base file. Returning null let the caller null out the whole variant column, silently dropping data; throw HoodieException instead. The separate null-record guard stays (genuine null variant passes through).

…back comment The Spark SQL read-back is gated on Spark 4.1+ because Spark 4.0's native parquet reader rejects the 3-field shredded layout (SPARK-54410) - not because of apache#18931. apache#18931 is the AVRO reader reconstruction and does not affect the Spark native read, so drop the misattributed pointer.

…d_value directly The scalar getters ignore ordinal and read typed_value directly while isNullAt/getBinary go through fieldNameFor(ordinal). Spark only calls the scalar getters for the scalar typed_value, so the asymmetry is intentional; add a comment so it does not read as an oversight.

…ype mapping avroTypeToScalarType mapped timestamp-millis / local-timestamp-millis to the micros-based TimestampType / TimestampNTZType. The Variant binary spec stores timestamps in microseconds, so a millis-precision typed_value would be read back as micros and scaled 1000x. Unreachable today (Hudi/Spark only produce micros typed_value), but decline to shred millis as a scalar so it can never silently corrupt; the value stays in the residual unshredded binary.

Today typed_value comes only from an explicit table schema or the test-only force-shredding DDL, so production tables never shred variants. This adds per-file inference of the shredding schema from the first records of each base file, for both HoodieRecordType paths and the bulk-insert row writer, reusing Spark 4.1's InferVariantShreddingSchema heuristics verbatim (SPARK-53659). Gated by hoodie.parquet.variant.shredding.schema.inference.enabled (default off); on Spark 4.0/Flink/Java the inferrer is absent from the classpath and writes silently stay unshredded. A buffering HoodieFileWriter decorator (mirroring Spark's ParquetOutputWriterWithVariantShredding: 4096 records / 64MB, infer once, replay in order) defers the parquet writer until the schema is known. The AVRO factory splices the inferred typed_value into the schema argument; the SPARK and row-writer factories splice a copied config because the row write support resolves its schema from hoodie.write.schema/hoodie.avro.schema. Inference failures decline to unshredded (a throwing inference must not fail compaction); writer-creation/replay failures latch and rethrow through close() so buffered records cannot be dropped silently. The Spark 4.1 inferrer batches all variant columns into one call (global width budget) and drops avro-illegal object keys, which legally fall back to the residual value column. Also fixes latent issues this feature would trip: recursive Variant.getPlainTypedValueSchema (depth>=2 objects, arrays, value-only wrappers), avro field-reuse (Field already used) in stripVariantShredding and VariantReconstruction, and the table-schema footer fallback now strips typed_value by shape so per-file layouts never leak into the resolved table schema. Stacked on apache#18938 (read-side reconstruction); part of apache#18937.

…i-variant tables Fixes surfaced by the auto-inference COW test, which is the first to read a shredded base file end to end: - VariantReconstruction never engaged on real files: the reader's file schema comes from converting the parquet footer MessageType, which loses the variant logical type, so the shredded group was projected down to {metadata, value} and typed_value was silently dropped on the AVRO read-then-rewrite path. Detect the on-disk side by shape, anchored on the requested column being a variant. - The inferred-shredding config splice aliased every variant column: columns share one record type named 'variant', which Avro serializes as name references after the first occurrence, so replacing one column's record with a same-named shredded definition shredded all of them on re-parse. Spliced records now get a per-column unique name. - getPlainTypedValueSchema named every nesting level '<name>_plain' with a null namespace; for nested objects (every spec level is named typed_value) the inner record became an Avro self-reference of its ancestor, which Spark rejects as recursion. Names now chain the field path. - HoodieRowParquetWriteSupport warned 'no corresponding HoodieSchema' for every nullable variant column because the top-level check did not unwrap the field's nullable union; shredding still happened via the nested fallthrough, so this only silenced a misleading warning. - Test fixes: cast(variant as string) on a string-typed variant extracts the raw string (not its JSON form), and the decline column now uses per-row empty objects: inference is per file and a multi-row insert can fan out to one file per row, so cross-row type conflicts never reach one inference call and cannot decline deterministically.

…ence is off resolveConfigSchema parses the avro schema string; gate it behind the inference flag so the default-off path adds no per-file cost.

…rk version PR CI runs on the merge ref against current master, whose spark4.2 profile satisfies gteqSpark4_1 but builds only hudi-spark4.2.x, which has no shredding-schema inferrer yet; the three inference tests then ran without one and failed their typed_value assertions on silently-unshredded files. Gate them on VariantShreddingRuntime.lookupInferrer() instead so any profile without an inferrer cancels rather than fails, and starts running again as soon as that version module ships its inferrer.

…riantShreddingRuntime The inference work centralizes engine-component classpath detection in VariantShreddingRuntime (provider + schema inferrer, memoized). The merge converged both call sites on it, leaving the interface helper added earlier in this stack unused; remove it and its CLASSPATH_CANDIDATES constant.

… assertions The inference tests cherry-picked here called the groupContainsField test helper, which the merged apache#18065 inlined to GroupType.containsField. The helper no longer exists, so the calls failed to compile; switch them to the inlined form.

…ose() On the success path the try already called delegate.close(); if it threw, the catch closed it again, relying on delegate close() being idempotent. Track a delegateClosed flag (set before the try-path close) so the catch only cleans up a delegate that materialize() created but never closed. Applied to both VariantShreddingInferenceFileWriter and its InternalRow sibling.

hudi-bot · 2026-06-20T13:22:30Z

CI report:

1593306 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous force-pushed the feat-18937-variant-shredding-inference branch from 6e26ed9 to 7e260ad Compare June 10, 2026 10:20

github-actions Bot added the size:XL PR with lines of changes > 1000 label Jun 10, 2026

voonhous changed the title ~~feat(variant): [DNM] auto-infer per-file variant shredding schemas on writedding inference~~ feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference Jun 12, 2026

hudi-agent reviewed Jun 14, 2026

View reviewed changes

voonhous added 9 commits June 18, 2026 17:06

review(variant): drop no-op makeNullable in applyForcedShreddingSchema

c8168ff

withSchema(replacement) rebuilds the field from replacement, which already mirrors the original nullability, so the intermediate makeNullable() was a no-op (and could reset field order for non-nullable fields).

review(variant): rename nonNull to unwrapNullable in HoodieVariantRec…

436cd15

…onstruction The helper unwraps a nullable union type, not a null-assertion guard; the new name reflects what it does at each call site.

voonhous force-pushed the feat-18937-variant-shredding-inference branch from 5e10346 to 86061dd Compare June 19, 2026 02:29

voonhous force-pushed the feat-18937-variant-shredding-inference branch from 6ebf27f to 0f7027b Compare June 20, 2026 08:33

voonhous added 8 commits June 20, 2026 18:57

perf(variant): skip schema parse in the row-writer factory when infer…

b447332

…ence is off resolveConfigSchema parses the avro schema string; gate it behind the inference flag so the default-off path adds no per-file cost.

voonhous force-pushed the feat-18937-variant-shredding-inference branch from 9aa0ac1 to 1593306 Compare June 20, 2026 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference#18961

feat(variant): [DNM] auto-infer per-file variant shredding schemas on shredding inference#18961
voonhous wants to merge 18 commits into
apache:masterfrom
voonhous:feat-18937-variant-shredding-inference

voonhous commented Jun 10, 2026 •

edited

Loading

Uh oh!

voonhous commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

Uh oh!

Uh oh!

hudi-agent Jun 14, 2026

Uh oh!

voonhous Jun 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

voonhous commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

voonhous commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026

Codecov Report

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hudi-agent Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Jun 20, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Jun 10, 2026 •

edited

Loading