Revisit projection for variant type columns on Spark

### Context

PR #18674 added `HoodieReaderContext.getLogBlockRecordProjection`, a per-row `InternalRow → InternalRow` projector that runs inside `FileGroupRecordBuffer` to rewrite `VariantType` cells in log-block records into the `struct<0: binary value, 1: binary metadata>` shape that Spark 4.1's `PushVariantIntoScan` produces on the base-file side. Without it, log and base rows reach the merger with mismatched physical layouts and corrupt downstream `UnsafeRow` reads.

### Why revisit

The merge buffer should not have to know about variant alignment. Today `FileGroupRecordBuffer.getProjectedTransformer` composes schema evolution with an engine-specific log-block projection hook — a leak of Spark-4.1 / `PushVariantIntoScan` concerns into engine-neutral merge code. Each log block format already has a natural place to align records to the projected read schema; pushing the rewrite down to the readers lets the buffer stay format-agnostic.

### Goal

Remove `getLogBlockRecordProjection` and the buffer-level projection composition. Each log reader is responsible for emitting rows aligned to the projected read schema:

1. **Parquet log blocks** — thread the projected `StructType` down to `HoodieSparkParquetReader.getUnsafeRowIterator` (mirroring what the base-file path already does via `parquetReadStructType` in `SparkFileFormatInternalRowReaderContext.getFileRecordIterator`) so parquet-mr natively decodes variants into the projected struct shape.
2. **Avro log blocks** — perform the variant-cell rewrite inside `HoodieAvroDataBlock` / its `convertAvroRecord` path, so rows leave the Avro reader already in the projected shape. The Avro deserializer has no native variant-projection hook, but a rewrite localized to the Avro reader is simpler than a global buffer hook and keeps the format-specific logic with the format.
3. **`FileGroupRecordBuffer`** — drop `getProjectedTransformer`; `getSchemaTransformerWithEvolvedSchema` is the only transformer the buffer needs to compose.
4. Investigate whether the `HoodieSchema` ↔ Spark `StructType` round-trip can preserve the variant-projection struct so the parallel `sparkRequiredSchema` overlay can also go away.

### Related

- PR #18674 — original fix
- Review thread: https://github.com/apache/hudi/pull/18674#discussion_r3243720956

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit projection for variant type columns on Spark #18739

Context

Why revisit

Goal

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Revisit projection for variant type columns on Spark #18739

Description

Context

Why revisit

Goal

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions