type:devtask empty projection (COUNT(*)) fails on Lance-backed tables

## TL;DR

`SELECT COUNT(*) FROM <lance-backed hudi table>` fails with:

```
Lance batch column count 14 does not match expected Spark schema size 0
  for file: .../category=Abyssinian/....lance
  at org.apache.hudi.io.storage.LanceRecordIterator.hasNext(LanceRecordIterator.java:124)
```

Any query shape that triggers Spark's "no columns needed, just count rows" optimization (`COUNT(*)`, `EXISTS`, `CREATE TABLE AS SELECT 1 FROM ...`) blows up on a Lance-backed Hudi table. Parquet-backed tables work fine.

## Why it happens

[`LanceRecordIterator.java:122-127`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java#L122-L127) has a strict equality check when building `ColumnVector[]`:

```java
StructField[] sparkFields = sparkSchema.fields();
if (sparkFields.length != fieldVectors.size()) {
  throw new HoodieException("Lance batch column count " + fieldVectors.size()
      + " does not match expected Spark schema size " + sparkFields.length + ...);
}
```

When Spark's optimizer prunes all columns for an aggregate-only read (`COUNT`, `EXISTS`), the request arrives with `sparkSchema.fields().length == 0`, but the Lance file's batch always has the full column set. The reader sees `0 != 14` and throws.

The Parquet reader handles this naturally — `ParquetFileFormat` has a zero-column fast path where it just yields N empty rows (where N is the row count) so the aggregate can count them without reading any data. Lance needs the equivalent.

## Workaround

Use `COUNT(<named_col>)` instead of `COUNT(*)`. On a non-null primary key the two are semantically equivalent, but the former forces Spark to request one column, satisfying the check.

## Proposed fix

In `LanceRecordIterator.hasNext()`:
- If `sparkSchema.fields().length == 0`, skip the `ColumnVector[]` build entirely.
- Still call `arrowReader.loadNextBatch()` to advance, and yield empty rows matching the Arrow `VectorSchemaRoot.getRowCount()` so downstream count aggregators work.
- Add a test in [`TestLanceDataSource.scala`](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala) exercising `spark.sql("SELECT COUNT(*) FROM …")` over a Lance-backed table and `df.count()` on the same.

## Related code paths

- [`LanceRecordIterator.java`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java)
- [`HoodieSparkLanceReader.java`](https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceReader.java)
- [`TestLanceDataSource.scala`](https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala)

## Environment

- Hudi `master` @ commit `4d0e9cd47f9e`
- Spark datasource path with Lance-backed base files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

type:devtask empty projection (COUNT(*)) fails on Lance-backed tables #18727

TL;DR

Why it happens

Workaround

Proposed fix

Related code paths

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

type:devtask empty projection (COUNT(*)) fails on Lance-backed tables #18727

Description

TL;DR

Why it happens

Workaround

Proposed fix

Related code paths

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions