fix: reject disallowed type promotions in native_datafusion scan#4229
Open
andygrove wants to merge 5 commits intoapache:mainfrom
Open
fix: reject disallowed type promotions in native_datafusion scan#4229andygrove wants to merge 5 commits intoapache:mainfrom
andygrove wants to merge 5 commits intoapache:mainfrom
Conversation
When COMET_SCHEMA_EVOLUTION_ENABLED is false, the native_datafusion scan path now rejects reading Parquet INT32 as INT64, FLOAT as DOUBLE, and INT32 as DOUBLE — matching the existing validation in native_iceberg_compat. The allow_type_promotion flag is passed from JVM via protobuf and checked in replace_with_spark_cast() before allowing widening casts. Closes apache#3720 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Format the SchemaColumnConvertNotSupportedException message produced by the type-promotion check so it matches Spark's vectorized reader output: column rendered as [name], expected as Spark catalog string (bigint), found as Parquet primitive name (INT32). This lets the SPARK-35640 and "row group skipping doesn't overflow" tests pass, and updates 3.4.3.diff to remove their IgnoreCometNativeDataFusion tags. The TimestampLTZ to TimestampNTZ case (SPARK-36182) and decimal precision/scale case (SPARK-34212) remain ignored, tracked under apache#4219 and apache#3720 respectively. Also reverts the cfg(test) gate on parquet/util/test_common so the parquet_read benchmark builds.
Run the 3720-tagged tests in dev/diffs/3.5.8.diff, 4.0.2.diff, and 4.1.1.diff against patched Spark trees with the type-promotion fix applied, then drop the IgnoreCometNativeDataFusion tag for tests that now pass and keep it on tests that still fail. 3.5.8: Drop tags for SPARK-35640 (int as long) and "row group skipping doesn't overflow", repoint SPARK-36182 at issue apache#4219. Same scope as 3.4.3, since the test source matches. 4.0.2 and 4.1.1: Drop tags for SPARK-47447 (TimestampLTZ as TimestampNTZ) and "row group skipping". 4.1.1 also drops the tag for SPARK-45604 (timestamp_ntz to array<timestamp_ntz>). Tests for SPARK-35640 (binary as timestamp), SPARK-34212 (decimal precision/scale), the schema-mismatch vectorized-reader test, and the parameterized ParquetTypeWideningSuite cases (unsupported parquet conversion, unsupported parquet timestamp conversion, parquet decimal precision change, parquet decimal precision and scale change) still fail and remain ignored under apache#3720.
The test reads a partitioned dataset where one partition is an empty parquet file written with INT32 schema and the other has 10 rows of INT64. Spark's vectorized reader silently skips the type check for the empty file because no row groups are scanned. The native_datafusion adapter rejects the INT32 to INT64 promotion at plan time regardless of file row count, so the test now fails when allow_type_promotion is false (Spark 3.x default). Tag the test with IgnoreCometNativeDataFusion under the existing 3720 umbrella in 3.4.3.diff and 3.5.8.diff. Spark 4.x defaults allow_type_promotion to true so its diffs are unaffected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #3720.
Rationale for this change
When
spark.comet.scan.impl=native_datafusionandCOMET_SCHEMA_EVOLUTION_ENABLEDis false, several Spark SQL tests that expectSchemaColumnConvertNotSupportedExceptionon incompatible Parquet reads pass silently because DataFusion's reader is more permissive and coerces mismatched numeric types instead of erroring.This PR makes the native_datafusion scan path reject the same numeric widening cases that Spark's vectorized reader rejects, and formats the resulting error so it matches Spark's
_LEGACY_ERROR_TEMP_2063template byte-for-byte.What changes are included in this PR?
COMET_SCHEMA_EVOLUTION_ENABLEDfrom JVM to native via protobuf (allow_type_promotionon the common scan options).replace_with_spark_cast, rejectINT32 -> INT64,FLOAT -> DOUBLE, andINT32 -> DOUBLEwhenallow_type_promotionis false, raisingSparkError::ParquetSchemaConvert(mirrorsTypeUtil.checkParquetTypein the JVM code).[name]and emit Spark catalog names (bigint,int) plus Parquet primitive names (INT32,INT64) so the message matches Spark's vectorized reader output exactly.dev/diffs/3.4.3.diff:IgnoreCometNativeDataFusionforSPARK-35640: int as longandrow group skipping doesn't overflow when reading into larger type(now passing).SPARK-36182: can't read TimestampLTZ as TimestampNTZat native_datafusion more permissive than Spark 3.x when reading Parquet TimestampNTZ columns #4219 (out of scope here).#[cfg(test)]gate onparquet/util/test_commonso theparquet_readbenchmark builds.How are these changes tested?
Verified locally against Apache Spark 3.4.3 with the regenerated diff and
ENABLE_COMET=true ENABLE_COMET_ONHEAP=true build/sbt sql/testOnly:ParquetIOSuite > SPARK-35640: int as long should throw schema incompatible errorpasses.ParquetV1QuerySuite > row group skipping doesn't overflow when reading into larger typepasses.ParquetV1QuerySuite > SPARK-36182andParquetV1QuerySuite > SPARK-34212are correctly ignored under the kept tags.cargo clippy --all-targets --workspace -- -D warnings.