Skip to content

Revisit projection for variant type columns on Spark #18739

@yihua

Description

@yihua

Context

PR #18674 added HoodieReaderContext.getLogBlockRecordProjection, a per-row InternalRow → InternalRow projector that runs inside FileGroupRecordBuffer to rewrite VariantType cells in log-block records into the struct<0: binary value, 1: binary metadata> shape that Spark 4.1's PushVariantIntoScan produces on the base-file side. Without it, log and base rows reach the merger with mismatched physical layouts and corrupt downstream UnsafeRow reads.

Why revisit

The merge buffer should not have to know about variant alignment. Today FileGroupRecordBuffer.getProjectedTransformer composes schema evolution with an engine-specific log-block projection hook — a leak of Spark-4.1 / PushVariantIntoScan concerns into engine-neutral merge code. Each log block format already has a natural place to align records to the projected read schema; pushing the rewrite down to the readers lets the buffer stay format-agnostic.

Goal

Remove getLogBlockRecordProjection and the buffer-level projection composition. Each log reader is responsible for emitting rows aligned to the projected read schema:

  1. Parquet log blocks — thread the projected StructType down to HoodieSparkParquetReader.getUnsafeRowIterator (mirroring what the base-file path already does via parquetReadStructType in SparkFileFormatInternalRowReaderContext.getFileRecordIterator) so parquet-mr natively decodes variants into the projected struct shape.
  2. Avro log blocks — perform the variant-cell rewrite inside HoodieAvroDataBlock / its convertAvroRecord path, so rows leave the Avro reader already in the projected shape. The Avro deserializer has no native variant-projection hook, but a rewrite localized to the Avro reader is simpler than a global buffer hook and keeps the format-specific logic with the format.
  3. FileGroupRecordBuffer — drop getProjectedTransformer; getSchemaTransformerWithEvolvedSchema is the only transformer the buffer needs to compose.
  4. Investigate whether the HoodieSchema ↔ Spark StructType round-trip can preserve the variant-projection struct so the parallel sparkRequiredSchema overlay can also go away.

Related

Metadata

Metadata

Assignees

Labels

type:devtaskDevelopment tasks and maintenance work

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions