fix: reject disallowed type promotions in native_datafusion scan by andygrove · Pull Request #4229 · apache/datafusion-comet

andygrove · 2026-05-05T19:29:35Z

Which issue does this PR close?

Closes #3720.

Rationale for this change

When spark.comet.scan.impl=native_datafusion and COMET_SCHEMA_EVOLUTION_ENABLED is false, several Spark SQL tests that expect SchemaColumnConvertNotSupportedException on incompatible Parquet reads pass silently because DataFusion's reader is more permissive and coerces mismatched numeric types instead of erroring.

This PR makes the native_datafusion scan path reject the same numeric widening cases that Spark's vectorized reader rejects, and formats the resulting error so it matches Spark's _LEGACY_ERROR_TEMP_2063 template byte-for-byte.

What changes are included in this PR?

Pass COMET_SCHEMA_EVOLUTION_ENABLED from JVM to native via protobuf (allow_type_promotion on the common scan options).
In replace_with_spark_cast, reject INT32 -> INT64, FLOAT -> DOUBLE, and INT32 -> DOUBLE when allow_type_promotion is false, raising SparkError::ParquetSchemaConvert (mirrors TypeUtil.checkParquetType in the JVM code).
Format the column name as [name] and emit Spark catalog names (bigint, int) plus Parquet primitive names (INT32, INT64) so the message matches Spark's vectorized reader output exactly.
Update dev/diffs/3.4.3.diff:
- Remove IgnoreCometNativeDataFusion for SPARK-35640: int as long and row group skipping doesn't overflow when reading into larger type (now passing).
- Repoint SPARK-36182: can't read TimestampLTZ as TimestampNTZ at native_datafusion more permissive than Spark 3.x when reading Parquet TimestampNTZ columns #4219 (out of scope here).
Revert an incidental #[cfg(test)] gate on parquet/util/test_common so the parquet_read benchmark builds.

How are these changes tested?

Verified locally against Apache Spark 3.4.3 with the regenerated diff and ENABLE_COMET=true ENABLE_COMET_ONHEAP=true build/sbt sql/testOnly:

ParquetIOSuite > SPARK-35640: int as long should throw schema incompatible error passes.
ParquetV1QuerySuite > row group skipping doesn't overflow when reading into larger type passes.
ParquetV1QuerySuite > SPARK-36182 and ParquetV1QuerySuite > SPARK-34212 are correctly ignored under the kept tags.
Existing schema-adapter Rust tests still pass under cargo clippy --all-targets --workspace -- -D warnings.

When COMET_SCHEMA_EVOLUTION_ENABLED is false, the native_datafusion scan path now rejects reading Parquet INT32 as INT64, FLOAT as DOUBLE, and INT32 as DOUBLE — matching the existing validation in native_iceberg_compat. The allow_type_promotion flag is passed from JVM via protobuf and checked in replace_with_spark_cast() before allowing widening casts. Closes apache#3720 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Format the SchemaColumnConvertNotSupportedException message produced by the type-promotion check so it matches Spark's vectorized reader output: column rendered as [name], expected as Spark catalog string (bigint), found as Parquet primitive name (INT32). This lets the SPARK-35640 and "row group skipping doesn't overflow" tests pass, and updates 3.4.3.diff to remove their IgnoreCometNativeDataFusion tags. The TimestampLTZ to TimestampNTZ case (SPARK-36182) and decimal precision/scale case (SPARK-34212) remain ignored, tracked under apache#4219 and apache#3720 respectively. Also reverts the cfg(test) gate on parquet/util/test_common so the parquet_read benchmark builds.

Run the 3720-tagged tests in dev/diffs/3.5.8.diff, 4.0.2.diff, and 4.1.1.diff against patched Spark trees with the type-promotion fix applied, then drop the IgnoreCometNativeDataFusion tag for tests that now pass and keep it on tests that still fail. 3.5.8: Drop tags for SPARK-35640 (int as long) and "row group skipping doesn't overflow", repoint SPARK-36182 at issue apache#4219. Same scope as 3.4.3, since the test source matches. 4.0.2 and 4.1.1: Drop tags for SPARK-47447 (TimestampLTZ as TimestampNTZ) and "row group skipping". 4.1.1 also drops the tag for SPARK-45604 (timestamp_ntz to array<timestamp_ntz>). Tests for SPARK-35640 (binary as timestamp), SPARK-34212 (decimal precision/scale), the schema-mismatch vectorized-reader test, and the parameterized ParquetTypeWideningSuite cases (unsupported parquet conversion, unsupported parquet timestamp conversion, parquet decimal precision change, parquet decimal precision and scale change) still fail and remain ignored under apache#3720.

The test reads a partitioned dataset where one partition is an empty parquet file written with INT32 schema and the other has 10 rows of INT64. Spark's vectorized reader silently skips the type check for the empty file because no row groups are scanned. The native_datafusion adapter rejects the INT32 to INT64 promotion at plan time regardless of file row count, so the test now fails when allow_type_promotion is false (Spark 3.x default). Tag the test with IgnoreCometNativeDataFusion under the existing 3720 umbrella in 3.4.3.diff and 3.5.8.diff. Spark 4.x defaults allow_type_promotion to true so its diffs are unaffected.

andygrove and others added 5 commits May 5, 2026 12:52

format

35da54d

andygrove requested review from mbutrovich and parthchandra May 5, 2026 23:04

andygrove marked this pull request as ready for review May 5, 2026 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reject disallowed type promotions in native_datafusion scan#4229

fix: reject disallowed type promotions in native_datafusion scan#4229
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:native-df-type-promotion-validation

andygrove commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 5, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant