Skip to content

fix: reject disallowed type promotions in native_datafusion scan#4229

Open
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:native-df-type-promotion-validation
Open

fix: reject disallowed type promotions in native_datafusion scan#4229
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:native-df-type-promotion-validation

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #3720.

Rationale for this change

When spark.comet.scan.impl=native_datafusion and COMET_SCHEMA_EVOLUTION_ENABLED is false, several Spark SQL tests that expect SchemaColumnConvertNotSupportedException on incompatible Parquet reads pass silently because DataFusion's reader is more permissive and coerces mismatched numeric types instead of erroring.

This PR makes the native_datafusion scan path reject the same numeric widening cases that Spark's vectorized reader rejects, and formats the resulting error so it matches Spark's _LEGACY_ERROR_TEMP_2063 template byte-for-byte.

What changes are included in this PR?

  • Pass COMET_SCHEMA_EVOLUTION_ENABLED from JVM to native via protobuf (allow_type_promotion on the common scan options).
  • In replace_with_spark_cast, reject INT32 -> INT64, FLOAT -> DOUBLE, and INT32 -> DOUBLE when allow_type_promotion is false, raising SparkError::ParquetSchemaConvert (mirrors TypeUtil.checkParquetType in the JVM code).
  • Format the column name as [name] and emit Spark catalog names (bigint, int) plus Parquet primitive names (INT32, INT64) so the message matches Spark's vectorized reader output exactly.
  • Update dev/diffs/3.4.3.diff:
  • Revert an incidental #[cfg(test)] gate on parquet/util/test_common so the parquet_read benchmark builds.

How are these changes tested?

Verified locally against Apache Spark 3.4.3 with the regenerated diff and ENABLE_COMET=true ENABLE_COMET_ONHEAP=true build/sbt sql/testOnly:

  • ParquetIOSuite > SPARK-35640: int as long should throw schema incompatible error passes.
  • ParquetV1QuerySuite > row group skipping doesn't overflow when reading into larger type passes.
  • ParquetV1QuerySuite > SPARK-36182 and ParquetV1QuerySuite > SPARK-34212 are correctly ignored under the kept tags.
  • Existing schema-adapter Rust tests still pass under cargo clippy --all-targets --workspace -- -D warnings.

andygrove and others added 5 commits May 5, 2026 12:52
When COMET_SCHEMA_EVOLUTION_ENABLED is false, the native_datafusion scan
path now rejects reading Parquet INT32 as INT64, FLOAT as DOUBLE, and
INT32 as DOUBLE — matching the existing validation in native_iceberg_compat.

The allow_type_promotion flag is passed from JVM via protobuf and checked
in replace_with_spark_cast() before allowing widening casts.

Closes apache#3720

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Format the SchemaColumnConvertNotSupportedException message produced by
the type-promotion check so it matches Spark's vectorized reader output:
column rendered as [name], expected as Spark catalog string (bigint),
found as Parquet primitive name (INT32). This lets the SPARK-35640 and
"row group skipping doesn't overflow" tests pass, and updates 3.4.3.diff
to remove their IgnoreCometNativeDataFusion tags.

The TimestampLTZ to TimestampNTZ case (SPARK-36182) and decimal
precision/scale case (SPARK-34212) remain ignored, tracked under apache#4219
and apache#3720 respectively. Also reverts the cfg(test) gate on
parquet/util/test_common so the parquet_read benchmark builds.
Run the 3720-tagged tests in dev/diffs/3.5.8.diff, 4.0.2.diff, and
4.1.1.diff against patched Spark trees with the type-promotion fix
applied, then drop the IgnoreCometNativeDataFusion tag for tests that
now pass and keep it on tests that still fail.

3.5.8: Drop tags for SPARK-35640 (int as long) and "row group skipping
doesn't overflow", repoint SPARK-36182 at issue apache#4219. Same scope as
3.4.3, since the test source matches.

4.0.2 and 4.1.1: Drop tags for SPARK-47447 (TimestampLTZ as TimestampNTZ)
and "row group skipping". 4.1.1 also drops the tag for SPARK-45604
(timestamp_ntz to array<timestamp_ntz>). Tests for SPARK-35640 (binary as
timestamp), SPARK-34212 (decimal precision/scale), the schema-mismatch
vectorized-reader test, and the parameterized ParquetTypeWideningSuite
cases (unsupported parquet conversion, unsupported parquet timestamp
conversion, parquet decimal precision change, parquet decimal precision
and scale change) still fail and remain ignored under apache#3720.
The test reads a partitioned dataset where one partition is an
empty parquet file written with INT32 schema and the other has 10
rows of INT64. Spark's vectorized reader silently skips the type
check for the empty file because no row groups are scanned. The
native_datafusion adapter rejects the INT32 to INT64 promotion at
plan time regardless of file row count, so the test now fails
when allow_type_promotion is false (Spark 3.x default).

Tag the test with IgnoreCometNativeDataFusion under the existing
3720 umbrella in 3.4.3.diff and 3.5.8.diff. Spark 4.x defaults
allow_type_promotion to true so its diffs are unaffected.
@andygrove andygrove marked this pull request as ready for review May 5, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

native_datafusion: no error thrown for schema mismatch when reading Parquet with incompatible types

1 participant