TL;DR
SELECT COUNT(*) FROM <lance-backed hudi table> fails with:
Lance batch column count 14 does not match expected Spark schema size 0
for file: .../category=Abyssinian/....lance
at org.apache.hudi.io.storage.LanceRecordIterator.hasNext(LanceRecordIterator.java:124)
Any query shape that triggers Spark's "no columns needed, just count rows" optimization (COUNT(*), EXISTS, CREATE TABLE AS SELECT 1 FROM ...) blows up on a Lance-backed Hudi table. Parquet-backed tables work fine.
Why it happens
LanceRecordIterator.java:122-127 has a strict equality check when building ColumnVector[]:
StructField[] sparkFields = sparkSchema.fields();
if (sparkFields.length != fieldVectors.size()) {
throw new HoodieException("Lance batch column count " + fieldVectors.size()
+ " does not match expected Spark schema size " + sparkFields.length + ...);
}
When Spark's optimizer prunes all columns for an aggregate-only read (COUNT, EXISTS), the request arrives with sparkSchema.fields().length == 0, but the Lance file's batch always has the full column set. The reader sees 0 != 14 and throws.
The Parquet reader handles this naturally — ParquetFileFormat has a zero-column fast path where it just yields N empty rows (where N is the row count) so the aggregate can count them without reading any data. Lance needs the equivalent.
Workaround
Use COUNT(<named_col>) instead of COUNT(*). On a non-null primary key the two are semantically equivalent, but the former forces Spark to request one column, satisfying the check.
Proposed fix
In LanceRecordIterator.hasNext():
- If
sparkSchema.fields().length == 0, skip the ColumnVector[] build entirely.
- Still call
arrowReader.loadNextBatch() to advance, and yield empty rows matching the Arrow VectorSchemaRoot.getRowCount() so downstream count aggregators work.
- Add a test in
TestLanceDataSource.scala exercising spark.sql("SELECT COUNT(*) FROM …") over a Lance-backed table and df.count() on the same.
Related code paths
Environment
- Hudi
master @ commit 4d0e9cd47f9e
- Spark datasource path with Lance-backed base files
TL;DR
SELECT COUNT(*) FROM <lance-backed hudi table>fails with:Any query shape that triggers Spark's "no columns needed, just count rows" optimization (
COUNT(*),EXISTS,CREATE TABLE AS SELECT 1 FROM ...) blows up on a Lance-backed Hudi table. Parquet-backed tables work fine.Why it happens
LanceRecordIterator.java:122-127has a strict equality check when buildingColumnVector[]:When Spark's optimizer prunes all columns for an aggregate-only read (
COUNT,EXISTS), the request arrives withsparkSchema.fields().length == 0, but the Lance file's batch always has the full column set. The reader sees0 != 14and throws.The Parquet reader handles this naturally —
ParquetFileFormathas a zero-column fast path where it just yields N empty rows (where N is the row count) so the aggregate can count them without reading any data. Lance needs the equivalent.Workaround
Use
COUNT(<named_col>)instead ofCOUNT(*). On a non-null primary key the two are semantically equivalent, but the former forces Spark to request one column, satisfying the check.Proposed fix
In
LanceRecordIterator.hasNext():sparkSchema.fields().length == 0, skip theColumnVector[]build entirely.arrowReader.loadNextBatch()to advance, and yield empty rows matching the ArrowVectorSchemaRoot.getRowCount()so downstream count aggregators work.TestLanceDataSource.scalaexercisingspark.sql("SELECT COUNT(*) FROM …")over a Lance-backed table anddf.count()on the same.Related code paths
LanceRecordIterator.javaHoodieSparkLanceReader.javaTestLanceDataSource.scalaEnvironment
master@ commit4d0e9cd47f9e