Skip to content

type:devtask empty projection (COUNT(*)) fails on Lance-backed tables #18727

@rahil-c

Description

@rahil-c

TL;DR

SELECT COUNT(*) FROM <lance-backed hudi table> fails with:

Lance batch column count 14 does not match expected Spark schema size 0
  for file: .../category=Abyssinian/....lance
  at org.apache.hudi.io.storage.LanceRecordIterator.hasNext(LanceRecordIterator.java:124)

Any query shape that triggers Spark's "no columns needed, just count rows" optimization (COUNT(*), EXISTS, CREATE TABLE AS SELECT 1 FROM ...) blows up on a Lance-backed Hudi table. Parquet-backed tables work fine.

Why it happens

LanceRecordIterator.java:122-127 has a strict equality check when building ColumnVector[]:

StructField[] sparkFields = sparkSchema.fields();
if (sparkFields.length != fieldVectors.size()) {
  throw new HoodieException("Lance batch column count " + fieldVectors.size()
      + " does not match expected Spark schema size " + sparkFields.length + ...);
}

When Spark's optimizer prunes all columns for an aggregate-only read (COUNT, EXISTS), the request arrives with sparkSchema.fields().length == 0, but the Lance file's batch always has the full column set. The reader sees 0 != 14 and throws.

The Parquet reader handles this naturally — ParquetFileFormat has a zero-column fast path where it just yields N empty rows (where N is the row count) so the aggregate can count them without reading any data. Lance needs the equivalent.

Workaround

Use COUNT(<named_col>) instead of COUNT(*). On a non-null primary key the two are semantically equivalent, but the former forces Spark to request one column, satisfying the check.

Proposed fix

In LanceRecordIterator.hasNext():

  • If sparkSchema.fields().length == 0, skip the ColumnVector[] build entirely.
  • Still call arrowReader.loadNextBatch() to advance, and yield empty rows matching the Arrow VectorSchemaRoot.getRowCount() so downstream count aggregators work.
  • Add a test in TestLanceDataSource.scala exercising spark.sql("SELECT COUNT(*) FROM …") over a Lance-backed table and df.count() on the same.

Related code paths

Environment

  • Hudi master @ commit 4d0e9cd47f9e
  • Spark datasource path with Lance-backed base files

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:devtaskDevelopment tasks and maintenance work

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions