perf: AvroRecordContext rebuilds a HoodieSchema wrapper and field map on every field access

### Describe the problem

`AvroRecordContext#getFieldValueFromIndexedRecord` is the implementation behind `RecordContext#getValue` for the Avro engine and runs once per record per accessed field in the file group reader flow: MOR snapshot reads, compaction, upsert log merging, and metadata table reads (ordering values, delete-flag checks, column value access).

Per invocation it:

- calls `HoodieSchema.fromAvroSchema(record.getSchema())`, allocating a fresh wrapper and re-deriving the schema type
- splits the field path with regex-based `String.split`
- calls `HoodieSchema#getField` on the fresh wrapper, which lazily rebuilds the entire field list and field map: one new `HoodieSchemaField` per column plus a HashMap collect, i.e. O(schema width) allocations per call
- wraps every union branch into another new `HoodieSchema` via `getNonNullType`

None of it is cached because the wrapper is thrown away after each call. For a 200-column table this is hundreds of allocations per record per accessed field.

The same per-record `HoodieSchema.fromAvroSchema(...)` rebuild also appears on other hot read / merge paths: Spark / Flink `RecordContext#convertAvroRecord`, the Hive MOR `RealtimeCompactedRecordReader` merge, `HoodieAvroUtils#getRecordColumnValues`, `HoodieJsonPayload#getInsertValue`, and the MERGE-INTO `ExpressionPayload` evaluators.

### Proposed fix

Intern the Avro-schema -> `HoodieSchema` conversion so the canonical wrapper (and its lazily built field list / map) is reused across calls instead of allocating a fresh wrapper per record, keeping `HoodieSchema` as the facade (the lookup path is unchanged) rather than bypassing it with raw Avro traversal.

- Add an Avro-`Schema`-keyed cache (`AvroToHoodieSchemaCache`, identity / `weakKeys`) that converts and value-interns through `HoodieSchemaCache` on a miss, so equal-but-distinct Avro schema instances converge on one canonical `HoodieSchema`.
- Route the per-record call sites above through it; hoist the loop-invariant `fromAvroSchema` out of the per-record write loop in `HoodieAvroDataBlock#getBytes`. Leave cold / one-time and per-block sites unchanged.
- Make `HoodieSchema#getFields` / `#getFieldMap` publish their lazily built caches safely (`volatile` fields plus immutable wrappers), since interned instances are shared across executor task threads.

Results are identical for all valid inputs; the schema must keep coming from the record itself since log block schemas can differ from the reader schema.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: AvroRecordContext rebuilds a HoodieSchema wrapper and field map on every field access #18966

Describe the problem

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: AvroRecordContext rebuilds a HoodieSchema wrapper and field map on every field access #18966

Description

Describe the problem

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions