Skip to content

Check custom type edge case #18603

@voonhous

Description

@voonhous

@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below):


The validator code
                                                                                      
  (descriptorType, f.dataType) match {
    case (HoodieSchemaType.BLOB,    st: StructType) => validateBlobStructure(st)      
    case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)   
    case _ =>   // <-- silently no-op                                                 
  }                                                                                   
                                                                                      
  The pattern only matches when both the tag says BLOB and the data type is a         
  StructType. Anything else falls into case _ and does nothing.
                                                                                      
  The bug, concretely                                                                 
   
  Suppose a user (or a buggy upstream transform) builds this schema:                  
   
  val blobMetadata = new MetadataBuilder()                                            
    .putString(HoodieSchema.TYPE_METADATA_FIELD, HoodieSchemaType.BLOB.name())        
    .build()                                                                          
                                                                                      
  val schema = new StructType()                                                       
    .add("id",      LongType)
    .add("payload", LongType, nullable = true, metadata = blobMetadata)               
    //              ^^^^^^^^                              ^^^^^^^^^^^^
    //              wrong type                            says "I'm a BLOB"           
                                                                                      
  The user is asserting "payload is a BLOB" via the metadata, but the data type is a  
  LongType, not the canonical BLOB struct.                                            
                                                                                      
  What happens today

  1. validateCustomTypeStructures(schema) runs.                                       
  2. It sees the hudi_type=BLOB tag on payload.
  3. The match tuple is (BLOB, LongType) — neither pattern matches → falls into case _
   → returns without throwing.                                                        
  4. Then convertStructTypeToHoodieSchema runs.                                       
  5. The BLOB case in toHoodieTypeNested is case blobStruct: StructType if            
  metadata.contains(...) && ...isCanonicalBlobStruct(blobStruct) => — requires a      
  StructType, so it doesn't match either.
  6. LongType falls through to the normal case LongType => HoodieSchema.create(LONG)  
  arm.                                                                                
  7. Result: the field is silently written as a plain LONG. The BLOB tag is ignored, 
  no error.                                                                           
   
  The user thinks they wrote a BLOB column; the table actually has a LONG column.     
   
  The fix                                                                             

  Add an explicit reject for "tag says BLOB/VARIANT but the type is wrong":           
   
  (descriptorType, f.dataType) match {                                                
    case (HoodieSchemaType.BLOB,    st: StructType) => validateBlobStructure(st)
    case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)   
    case (HoodieSchemaType.BLOB,    other) =>                                         
      throw new IllegalArgumentException(                                             
        s"Field '${f.name}' is tagged hudi_type=BLOB but has type $other; expected a  
  StructType.")                                                                       
    case (HoodieSchemaType.VARIANT, other) =>
      throw new IllegalArgumentException(                                             
        s"Field '${f.name}' is tagged hudi_type=VARIANT but has type $other; expected
  a StructType.")                                                                     
    case _ =>
  }                                                                                   

  Now the misuse fails fast at the write boundary instead of silently producing the   
  wrong on-disk schema.

Originally posted by @rahil-c in #18566 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions