feat: support Parquet field ID matching in native_datafusion scan by mbutrovich · Pull Request #4216 · apache/datafusion-comet

mbutrovich · 2026-05-04T21:26:05Z

Which issue does this PR close?

Closes #3434. Closes #4189.

Rationale for this change

The native_datafusion scan was the only Comet read path that did not honor Parquet field IDs. When spark.sql.parquet.fieldId.read.enabled=true and the requested schema carried IDs, CometScanRule fell back to Spark's reader (added in #3415). The legacy V1 reader and native_iceberg_compat already handle field-ID matching.

This change resolves columns by parquet.field.id end-to-end on the native_datafusion path, matching Spark's ParquetReadSupport.clipParquetGroupFields precedence rules so behavior stays consistent with the V1 reader.

What changes are included in this PR?

Drop the fallback gate in CometScanRule.transformV1Scan that forced Spark to handle reads with field IDs.
Propagate field IDs through the proto by adding a metadata map on SparkStructField and a parallel field_metadata on StructInfo, plus use_field_id/ignore_missing_field_id flags on NativeScanCommon. The JVM serde emits parquet.field.id under arrow-rs's PARQUET_FIELD_ID_META_KEY so both sides agree on the key. A new CometParquetUtils.PARQUET_FIELD_ID_META_KEY constant centralizes the string.
Match Spark's per-field precedence on the native side: at the top level, SparkPhysicalExprAdapterFactory extends its existing case-insensitive remap with ID-first matching; at nested struct levels, parquet_convert_struct_to_struct does the same. ID-bearing logical fields match only by ID (a missing ID resolves to NULL); non-ID-bearing fields fall back to name. This mirrors clipParquetGroupFields.
Spark-compatible error handling for the field-ID edge cases:
- Multiple Parquet fields sharing one requested ID raise _LEGACY_ERROR_TEMP_2094 "Found duplicate field(s)" (mirrors foundDuplicateFieldInFieldIdLookupModeError).
- Reading a file with no field IDs while the read schema requests them raises a RuntimeException ("Spark read schema expects field Ids, but Parquet file schema doesn't contain any field Ids"), unless spark.sql.parquet.fieldId.read.ignoreMissing=true.
- When ignoreMissing=true and the file has no IDs, ID-bearing logical fields resolve to NULL (matching Spark's generateFakeColumnName trick that prevents accidental name fallback).
- Two new SparkError variants (DuplicateFieldByFieldId, ParquetMissingFieldIds) plus shim cases in all three ShimSparkErrorConverter files (3.4 / 3.5 / 4.x).
Common case stays free: the new flags are set only when both spark.sql.parquet.fieldId.read.enabled=true and the requested schema actually carries IDs. When false, all paths take the same code as before.

How are these changes tested?

Four new tests in ParquetReadSuite exercise the path under SCAN_NATIVE_DATAFUSION with checkSparkAnswerAndOperator:

a primitive case where the read-schema names differ from the file's,
a nested case (struct + array + map) with renames at every level,
a duplicate-ID case asserting Found duplicate field(s),
a missing-IDs case asserting both the strict error and the ignoreMissing=true NULL behavior.

Spark's ParquetFieldIdIOSuite and ParquetFieldIdSchemaSuite are already unignored in dev/diffs/{3.4.3,3.5.8,4.0.2,4.1.1}.diff and currently pass via the Spark fallback; with this change they exercise the native path.

# Conflicts: # native/core/src/parquet/parquet_support.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/serde/operator/CometNativeScan.scala

mbutrovich added 8 commits May 4, 2026 17:14

Stash with two passing tests.

db887dc

Cleanup.

2e3b464

Respect ignoreMissing and generate correct errors.

9214d7b

Fix tests for Spark 4.x.

c20b2fa

Fix tests for Spark 4.x.

2a3672b

Fix tests for Spark 4.x.

6f56b0a

Merge branch 'main' into fix_3434

dcb7259

# Conflicts: # native/core/src/parquet/parquet_support.rs # native/proto/src/proto/operator.proto # spark/src/main/scala/org/apache/comet/serde/operator/CometNativeScan.scala

Adjust 4.x shim.

ebd164e

mbutrovich marked this pull request as ready for review May 5, 2026 14:45

mbutrovich requested a review from andygrove May 5, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support Parquet field ID matching in native_datafusion scan#4216

feat: support Parquet field ID matching in native_datafusion scan#4216
mbutrovich wants to merge 8 commits intoapache:mainfrom
mbutrovich:fix_3434

mbutrovich commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mbutrovich commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mbutrovich commented May 4, 2026 •

edited

Loading