Skip to content

feat: support Parquet field ID matching in native_datafusion scan#4216

Open
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:fix_3434
Open

feat: support Parquet field ID matching in native_datafusion scan#4216
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:fix_3434

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented May 4, 2026

Which issue does this PR close?

Closes #3434. Closes #4189.

Rationale for this change

The native_datafusion scan was the only Comet read path that did not honor Parquet field IDs. When spark.sql.parquet.fieldId.read.enabled=true and the requested schema carried IDs, CometScanRule fell back to Spark's reader (added in #3415). The legacy V1 reader and native_iceberg_compat already handle field-ID matching.

This change resolves columns by parquet.field.id end-to-end on the native_datafusion path, matching Spark's ParquetReadSupport.clipParquetGroupFields precedence rules so behavior stays consistent with the V1 reader.

What changes are included in this PR?

  • Drop the fallback gate in CometScanRule.transformV1Scan that forced Spark to handle reads with field IDs.
  • Propagate field IDs through the proto by adding a metadata map on SparkStructField and a parallel field_metadata on StructInfo, plus use_field_id/ignore_missing_field_id flags on NativeScanCommon. The JVM serde emits parquet.field.id under arrow-rs's PARQUET_FIELD_ID_META_KEY so both sides agree on the key. A new CometParquetUtils.PARQUET_FIELD_ID_META_KEY constant centralizes the string.
  • Match Spark's per-field precedence on the native side: at the top level, SparkPhysicalExprAdapterFactory extends its existing case-insensitive remap with ID-first matching; at nested struct levels, parquet_convert_struct_to_struct does the same. ID-bearing logical fields match only by ID (a missing ID resolves to NULL); non-ID-bearing fields fall back to name. This mirrors clipParquetGroupFields.
  • Spark-compatible error handling for the field-ID edge cases:
    • Multiple Parquet fields sharing one requested ID raise _LEGACY_ERROR_TEMP_2094 "Found duplicate field(s)" (mirrors foundDuplicateFieldInFieldIdLookupModeError).
    • Reading a file with no field IDs while the read schema requests them raises a RuntimeException ("Spark read schema expects field Ids, but Parquet file schema doesn't contain any field Ids"), unless spark.sql.parquet.fieldId.read.ignoreMissing=true.
    • When ignoreMissing=true and the file has no IDs, ID-bearing logical fields resolve to NULL (matching Spark's generateFakeColumnName trick that prevents accidental name fallback).
    • Two new SparkError variants (DuplicateFieldByFieldId, ParquetMissingFieldIds) plus shim cases in all three ShimSparkErrorConverter files (3.4 / 3.5 / 4.x).
  • Common case stays free: the new flags are set only when both spark.sql.parquet.fieldId.read.enabled=true and the requested schema actually carries IDs. When false, all paths take the same code as before.

How are these changes tested?

Four new tests in ParquetReadSuite exercise the path under SCAN_NATIVE_DATAFUSION with checkSparkAnswerAndOperator:

  • a primitive case where the read-schema names differ from the file's,
  • a nested case (struct + array + map) with renames at every level,
  • a duplicate-ID case asserting Found duplicate field(s),
  • a missing-IDs case asserting both the strict error and the ignoreMissing=true NULL behavior.

Spark's ParquetFieldIdIOSuite and ParquetFieldIdSchemaSuite are already unignored in dev/diffs/{3.4.3,3.5.8,4.0.2,4.1.1}.diff and currently pass via the Spark fallback; with this change they exercise the native path.

mbutrovich added 8 commits May 4, 2026 17:14
# Conflicts:
#	native/core/src/parquet/parquet_support.rs
#	native/proto/src/proto/operator.proto
#	spark/src/main/scala/org/apache/comet/serde/operator/CometNativeScan.scala
@mbutrovich mbutrovich marked this pull request as ready for review May 5, 2026 14:45
@mbutrovich mbutrovich requested a review from andygrove May 5, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant