Description
When a Parquet file stores timestamps as INT96 (Spark's TimestampType with UTC-adjusted local-time semantics) and the read schema requests TimestampNTZ, the native_datafusion scan silently returns wall-clock values that disagree with what was written.
Spark 3.x itself raises an error in this scenario (SPARK-36182) to prevent silent reinterpretation of an LTZ instant as NTZ. Comet's native scan should either match Spark's behavior by raising an error, or correctly handle the timezone conversion.
Steps to Reproduce
val sessionTz = "America/Los_Angeles"
val written = "2020-01-01 12:00:00"
withSQLConf(
SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz,
SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> "INT96",
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
// Write "2020-01-01 12:00:00" America/Los_Angeles as INT96.
// The bits encode the UTC instant 2020-01-01 20:00:00.
Seq(Timestamp.valueOf(written)).toDF("ts").write.parquet(path)
// Spark refuses to read INT96 as TimestampNTZ (SPARK-36182)
withSQLConf(CometConf.COMET_ENABLED.key -> "false") {
intercept[SparkException] {
spark.read.schema("ts timestamp_ntz").parquet(path).collect()
}
}
// native_datafusion silently returns a shifted value
withSQLConf(CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION) {
val rows = spark.read.schema("ts timestamp_ntz").parquet(path).collect()
val actual = rows.head.getAs[LocalDateTime](0)
// actual != LocalDateTime.parse("2020-01-01T12:00:00")
// The value is silently wrong — shifted by the timezone offset
}
}
}
Expected Behavior
Comet should match Spark's behavior and raise an error when asked to read INT96 timestamps as TimestampNTZ, since the LTZ→NTZ reinterpretation cannot be done safely without explicit conversion.
Actual Behavior
The native DataFusion scan returns a result without error, but the timestamp value is silently incorrect (shifted by the session timezone offset).
Related
Description
When a Parquet file stores timestamps as INT96 (Spark's
TimestampTypewith UTC-adjusted local-time semantics) and the read schema requestsTimestampNTZ, thenative_datafusionscan silently returns wall-clock values that disagree with what was written.Spark 3.x itself raises an error in this scenario (SPARK-36182) to prevent silent reinterpretation of an LTZ instant as NTZ. Comet's native scan should either match Spark's behavior by raising an error, or correctly handle the timezone conversion.
Steps to Reproduce
Expected Behavior
Comet should match Spark's behavior and raise an error when asked to read INT96 timestamps as TimestampNTZ, since the LTZ→NTZ reinterpretation cannot be done safely without explicit conversion.
Actual Behavior
The native DataFusion scan returns a result without error, but the timestamp value is silently incorrect (shifted by the session timezone offset).
Related