Task Description
Describe the problem
HoodieSchema currently collapses engine-level small integer types into a single INT type, which loses integer width information needed to faithfully reconstruct engine-native schemas.
Examples:
- Spark:
ByteType | ShortType -> HoodieSchemaType.INT
- Flink:
TINYINT | SMALLINT -> HoodieSchemaType.INT
This becomes a problem because writer paths are built from HoodieSchema, and later reconstruct engine-native schemas from it.
On Spark, the path is roughly:
HoodieSparkSchemaConverters.toHoodieType(...) maps ByteType/ShortType/IntegerType to HoodieSchemaType.INT
HoodieSparkFileWriterFactory reconstructs StructType from HoodieSchema
HoodieSparkSchemaConverters.toSqlType(...) maps HoodieSchemaType.INT back to IntegerType
HoodieRowParquetWriteSupport makes type-dependent row accessor decisions from that reconstructed StructType
At that point, the original Spark type width is already lost. For example, an original ShortType field is reconstructed as IntegerType, and writer code may go through row.getInt(...) instead of row.getShort(...).
Flink has the same issue in principle:
TINYINT/SMALLINT are also collapsed into HoodieSchemaType.INT
- writer code later reconstructs
RowType from HoodieSchema
- the original integer width is no longer recoverable from
HoodieSchema alone
So this is not just a schema round-trip fidelity issue. HoodieSchema currently does not preserve enough information for engine-native writer construction.
To Reproduce
- Create a Spark
StructType or Flink RowType containing TINYINT or SMALLINT
- Convert it to
HoodieSchema
- Reconstruct engine-native schema from that
HoodieSchema
- Observe that the field has already been widened to
INT / IntegerType
Expected behavior
HoodieSchema should preserve integer width information so that:
- Spark
ByteType, ShortType
- Flink
TINYINT, SMALLINT
do not all collapse into the same schema type.
Environment Description
Hudi version: current master / current branch
Possible solutions
-
Add new types:
add native integer-width types such as TINYINT and SMALLINT to HoodieSchema, and preserve them across Spark/Flink converters.
-
Fix the writer builders:
avoid reconstructing engine writer schema from HoodieSchema alone in paths that need engine-native getter semantics, and instead pass the original engine schema through the writer boundary.
The first direction seems more consistent with the long-term schema-system design. rfc-99 already describes primitive integer widths such as TINYINT and SMALLINT as first-class logical types.
Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.
Task Description
Describe the problem
HoodieSchemacurrently collapses engine-level small integer types into a singleINTtype, which loses integer width information needed to faithfully reconstruct engine-native schemas.Examples:
ByteType | ShortType -> HoodieSchemaType.INTTINYINT | SMALLINT -> HoodieSchemaType.INTThis becomes a problem because writer paths are built from
HoodieSchema, and later reconstruct engine-native schemas from it.On Spark, the path is roughly:
HoodieSparkSchemaConverters.toHoodieType(...)mapsByteType/ShortType/IntegerTypetoHoodieSchemaType.INTHoodieSparkFileWriterFactoryreconstructsStructTypefromHoodieSchemaHoodieSparkSchemaConverters.toSqlType(...)mapsHoodieSchemaType.INTback toIntegerTypeHoodieRowParquetWriteSupportmakes type-dependent row accessor decisions from that reconstructedStructTypeAt that point, the original Spark type width is already lost. For example, an original
ShortTypefield is reconstructed asIntegerType, and writer code may go throughrow.getInt(...)instead ofrow.getShort(...).Flink has the same issue in principle:
TINYINT/SMALLINTare also collapsed intoHoodieSchemaType.INTRowTypefromHoodieSchemaHoodieSchemaaloneSo this is not just a schema round-trip fidelity issue.
HoodieSchemacurrently does not preserve enough information for engine-native writer construction.To Reproduce
StructTypeor FlinkRowTypecontainingTINYINTorSMALLINTHoodieSchemaHoodieSchemaINT/IntegerTypeExpected behavior
HoodieSchemashould preserve integer width information so that:ByteType,ShortTypeTINYINT,SMALLINTdo not all collapse into the same schema type.
Environment Description
Hudi version: current master / current branch
Possible solutions
Add new types:
add native integer-width types such as
TINYINTandSMALLINTtoHoodieSchema, and preserve them across Spark/Flink converters.Fix the writer builders:
avoid reconstructing engine writer schema from
HoodieSchemaalone in paths that need engine-native getter semantics, and instead pass the original engine schema through the writer boundary.The first direction seems more consistent with the long-term schema-system design.
rfc-99already describes primitive integer widths such asTINYINTandSMALLINTas first-class logical types.Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use
Relationshipsbutton to add parent/blocking issues after issue is created.