The validator code
(descriptorType, f.dataType) match {
case (HoodieSchemaType.BLOB, st: StructType) => validateBlobStructure(st)
case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)
case _ => // <-- silently no-op
}
The pattern only matches when both the tag says BLOB and the data type is a
StructType. Anything else falls into case _ and does nothing.
The bug, concretely
Suppose a user (or a buggy upstream transform) builds this schema:
val blobMetadata = new MetadataBuilder()
.putString(HoodieSchema.TYPE_METADATA_FIELD, HoodieSchemaType.BLOB.name())
.build()
val schema = new StructType()
.add("id", LongType)
.add("payload", LongType, nullable = true, metadata = blobMetadata)
// ^^^^^^^^ ^^^^^^^^^^^^
// wrong type says "I'm a BLOB"
The user is asserting "payload is a BLOB" via the metadata, but the data type is a
LongType, not the canonical BLOB struct.
What happens today
1. validateCustomTypeStructures(schema) runs.
2. It sees the hudi_type=BLOB tag on payload.
3. The match tuple is (BLOB, LongType) — neither pattern matches → falls into case _
→ returns without throwing.
4. Then convertStructTypeToHoodieSchema runs.
5. The BLOB case in toHoodieTypeNested is case blobStruct: StructType if
metadata.contains(...) && ...isCanonicalBlobStruct(blobStruct) => — requires a
StructType, so it doesn't match either.
6. LongType falls through to the normal case LongType => HoodieSchema.create(LONG)
arm.
7. Result: the field is silently written as a plain LONG. The BLOB tag is ignored,
no error.
The user thinks they wrote a BLOB column; the table actually has a LONG column.
The fix
Add an explicit reject for "tag says BLOB/VARIANT but the type is wrong":
(descriptorType, f.dataType) match {
case (HoodieSchemaType.BLOB, st: StructType) => validateBlobStructure(st)
case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)
case (HoodieSchemaType.BLOB, other) =>
throw new IllegalArgumentException(
s"Field '${f.name}' is tagged hudi_type=BLOB but has type $other; expected a
StructType.")
case (HoodieSchemaType.VARIANT, other) =>
throw new IllegalArgumentException(
s"Field '${f.name}' is tagged hudi_type=VARIANT but has type $other; expected
a StructType.")
case _ =>
}
Now the misuse fails fast at the write boundary instead of silently producing the
wrong on-disk schema.
@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below):
Originally posted by @rahil-c in #18566 (comment)