Context
PR #18674 added HoodieReaderContext.getLogBlockRecordProjection, a per-row InternalRow → InternalRow projector that runs inside FileGroupRecordBuffer to rewrite VariantType cells in log-block records into the struct<0: binary value, 1: binary metadata> shape that Spark 4.1's PushVariantIntoScan produces on the base-file side. Without it, log and base rows reach the merger with mismatched physical layouts and corrupt downstream UnsafeRow reads.
Why revisit
The merge buffer should not have to know about variant alignment. Today FileGroupRecordBuffer.getProjectedTransformer composes schema evolution with an engine-specific log-block projection hook — a leak of Spark-4.1 / PushVariantIntoScan concerns into engine-neutral merge code. Each log block format already has a natural place to align records to the projected read schema; pushing the rewrite down to the readers lets the buffer stay format-agnostic.
Goal
Remove getLogBlockRecordProjection and the buffer-level projection composition. Each log reader is responsible for emitting rows aligned to the projected read schema:
- Parquet log blocks — thread the projected
StructType down to HoodieSparkParquetReader.getUnsafeRowIterator (mirroring what the base-file path already does via parquetReadStructType in SparkFileFormatInternalRowReaderContext.getFileRecordIterator) so parquet-mr natively decodes variants into the projected struct shape.
- Avro log blocks — perform the variant-cell rewrite inside
HoodieAvroDataBlock / its convertAvroRecord path, so rows leave the Avro reader already in the projected shape. The Avro deserializer has no native variant-projection hook, but a rewrite localized to the Avro reader is simpler than a global buffer hook and keeps the format-specific logic with the format.
FileGroupRecordBuffer — drop getProjectedTransformer; getSchemaTransformerWithEvolvedSchema is the only transformer the buffer needs to compose.
- Investigate whether the
HoodieSchema ↔ Spark StructType round-trip can preserve the variant-projection struct so the parallel sparkRequiredSchema overlay can also go away.
Related
Context
PR #18674 added
HoodieReaderContext.getLogBlockRecordProjection, a per-rowInternalRow → InternalRowprojector that runs insideFileGroupRecordBufferto rewriteVariantTypecells in log-block records into thestruct<0: binary value, 1: binary metadata>shape that Spark 4.1'sPushVariantIntoScanproduces on the base-file side. Without it, log and base rows reach the merger with mismatched physical layouts and corrupt downstreamUnsafeRowreads.Why revisit
The merge buffer should not have to know about variant alignment. Today
FileGroupRecordBuffer.getProjectedTransformercomposes schema evolution with an engine-specific log-block projection hook — a leak of Spark-4.1 /PushVariantIntoScanconcerns into engine-neutral merge code. Each log block format already has a natural place to align records to the projected read schema; pushing the rewrite down to the readers lets the buffer stay format-agnostic.Goal
Remove
getLogBlockRecordProjectionand the buffer-level projection composition. Each log reader is responsible for emitting rows aligned to the projected read schema:StructTypedown toHoodieSparkParquetReader.getUnsafeRowIterator(mirroring what the base-file path already does viaparquetReadStructTypeinSparkFileFormatInternalRowReaderContext.getFileRecordIterator) so parquet-mr natively decodes variants into the projected struct shape.HoodieAvroDataBlock/ itsconvertAvroRecordpath, so rows leave the Avro reader already in the projected shape. The Avro deserializer has no native variant-projection hook, but a rewrite localized to the Avro reader is simpler than a global buffer hook and keeps the format-specific logic with the format.FileGroupRecordBuffer— dropgetProjectedTransformer;getSchemaTransformerWithEvolvedSchemais the only transformer the buffer needs to compose.HoodieSchema↔ SparkStructTyperound-trip can preserve the variant-projection struct so the parallelsparkRequiredSchemaoverlay can also go away.Related