Skip to content

HoodieSchema collapses TINYINT/SMALLINT into INT and loses engine type width needed by writer paths #18974

@cshuo

Description

@cshuo

Task Description

Describe the problem

HoodieSchema currently collapses engine-level small integer types into a single INT type, which loses integer width information needed to faithfully reconstruct engine-native schemas.

Examples:

  • Spark: ByteType | ShortType -> HoodieSchemaType.INT
  • Flink: TINYINT | SMALLINT -> HoodieSchemaType.INT

This becomes a problem because writer paths are built from HoodieSchema, and later reconstruct engine-native schemas from it.

On Spark, the path is roughly:

  1. HoodieSparkSchemaConverters.toHoodieType(...) maps ByteType/ShortType/IntegerType to HoodieSchemaType.INT
  2. HoodieSparkFileWriterFactory reconstructs StructType from HoodieSchema
  3. HoodieSparkSchemaConverters.toSqlType(...) maps HoodieSchemaType.INT back to IntegerType
  4. HoodieRowParquetWriteSupport makes type-dependent row accessor decisions from that reconstructed StructType

At that point, the original Spark type width is already lost. For example, an original ShortType field is reconstructed as IntegerType, and writer code may go through row.getInt(...) instead of row.getShort(...).

Flink has the same issue in principle:

  • TINYINT/SMALLINT are also collapsed into HoodieSchemaType.INT
  • writer code later reconstructs RowType from HoodieSchema
  • the original integer width is no longer recoverable from HoodieSchema alone

So this is not just a schema round-trip fidelity issue. HoodieSchema currently does not preserve enough information for engine-native writer construction.

To Reproduce

  1. Create a Spark StructType or Flink RowType containing TINYINT or SMALLINT
  2. Convert it to HoodieSchema
  3. Reconstruct engine-native schema from that HoodieSchema
  4. Observe that the field has already been widened to INT / IntegerType

Expected behavior

HoodieSchema should preserve integer width information so that:

  • Spark ByteType, ShortType
  • Flink TINYINT, SMALLINT

do not all collapse into the same schema type.

Environment Description

Hudi version: current master / current branch

Possible solutions

  1. Add new types:
    add native integer-width types such as TINYINT and SMALLINT to HoodieSchema, and preserve them across Spark/Flink converters.

  2. Fix the writer builders:
    avoid reconstructing engine writer schema from HoodieSchema alone in paths that need engine-native getter semantics, and instead pass the original engine schema through the writer boundary.

The first direction seems more consistent with the long-term schema-system design. rfc-99 already describes primitive integer widths such as TINYINT and SMALLINT as first-class logical types.

Task Type

Code improvement/refactoring

Related Issues

Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.

Metadata

Metadata

Assignees

Labels

type:devtaskDevelopment tasks and maintenance work

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions