Skip to content

perf: AvroRecordContext rebuilds a HoodieSchema wrapper and field map on every field access #18966

@voonhous

Description

@voonhous

Describe the problem

AvroRecordContext#getFieldValueFromIndexedRecord is the implementation behind RecordContext#getValue for the Avro engine and runs once per record per accessed field in the file group reader flow: MOR snapshot reads, compaction, upsert log merging, and metadata table reads (ordering values, delete-flag checks, column value access).

Per invocation it:

  • calls HoodieSchema.fromAvroSchema(record.getSchema()), allocating a fresh wrapper and re-deriving the schema type
  • splits the field path with regex-based String.split
  • calls HoodieSchema#getField on the fresh wrapper, which lazily rebuilds the entire field list and field map: one new HoodieSchemaField per column plus a HashMap collect, i.e. O(schema width) allocations per call
  • wraps every union branch into another new HoodieSchema via getNonNullType

None of it is cached because the wrapper is thrown away after each call. For a 200-column table this is hundreds of allocations per record per accessed field.

The same per-record HoodieSchema.fromAvroSchema(...) rebuild also appears on other hot read / merge paths: Spark / Flink RecordContext#convertAvroRecord, the Hive MOR RealtimeCompactedRecordReader merge, HoodieAvroUtils#getRecordColumnValues, HoodieJsonPayload#getInsertValue, and the MERGE-INTO ExpressionPayload evaluators.

Proposed fix

Intern the Avro-schema -> HoodieSchema conversion so the canonical wrapper (and its lazily built field list / map) is reused across calls instead of allocating a fresh wrapper per record, keeping HoodieSchema as the facade (the lookup path is unchanged) rather than bypassing it with raw Avro traversal.

  • Add an Avro-Schema-keyed cache (AvroToHoodieSchemaCache, identity / weakKeys) that converts and value-interns through HoodieSchemaCache on a miss, so equal-but-distinct Avro schema instances converge on one canonical HoodieSchema.
  • Route the per-record call sites above through it; hoist the loop-invariant fromAvroSchema out of the per-record write loop in HoodieAvroDataBlock#getBytes. Leave cold / one-time and per-block sites unchanged.
  • Make HoodieSchema#getFields / #getFieldMap publish their lazily built caches safely (volatile fields plus immutable wrappers), since interned instances are shared across executor task threads.

Results are identical for all valid inputs; the schema must keep coming from the record itself since log block schemas can differ from the reader schema.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions