fix: support multivector IVF centroids in segment builds#6995
Conversation
|
@claude review |
|
|
||
| def _vector_dimension(self, data_type): | ||
| if pa.types.is_fixed_size_list(data_type): | ||
| return data_type.list_size | ||
| if pa.types.is_list(data_type) and pa.types.is_fixed_size_list( | ||
| data_type.value_type | ||
| ): | ||
| return data_type.value_type.list_size | ||
| if isinstance(data_type, pa.FixedShapeTensorType) and len(data_type.shape) == 1: | ||
| return data_type.shape[0] | ||
| raise TypeError( | ||
| "Vector column must be FixedSizeListArray " | ||
| f"1-dimensional FixedShapeTensorArray, got {data_type}" |
There was a problem hiding this comment.
🔴 Three call sites in IndicesBuilder still use self.dataset.schema.field(self.column[0]).type.list_size directly instead of the new _vector_dimension helper, so calling train_pq (line 204), transform_vectors (line 364), or load_shuffled_vectors (line 457) on a multivector column raises AttributeError: ListType object has no attribute list_size. These are exactly the distributed-build steps the PR description says it enables for multivectors — the new test only exercises create_index_uncommitted (Rust path), so CI doesn't catch the regression. Fix: replace each with self.dimension (already set correctly via _vector_dimension in __init__) and delete the dead expression statement at line 204.
Extended reasoning...
What's wrong
This PR adds a _vector_dimension helper (builder.py:599-611) to support multivector columns whose type is pa.list_(pa.list_(pa.float32(), N)). It's correctly invoked from __init__ to set self.dimension. However, three other places in IndicesBuilder still reach into the raw column type with the pre-PR pattern self.dataset.schema.field(self.column[0]).type.list_size:
- builder.py:204 in
train_pq— a bare expression whose value is discarded (dead code), but the attribute access still executes. - builder.py:364 in
transform_vectors— assigned todimensionand passed toindices.transform_vectors. - builder.py:457 in
load_shuffled_vectors— assigned topq_dimensionand passed toindices.load_shuffled_vectors.
For a multivector column the outer arrow type is a variable-size ListType (not FixedSizeListType). ListType has no list_size attribute — only the inner FixedSizeListType does — so this raises AttributeError.
Why CI doesn't catch it
The new test_ivf_centroids_multivector_fragment_ids only constructs an IndicesBuilder (to check builder.dimension == 4) and then calls ds.create_index_uncommitted(...), which goes through the Rust path prepare_vector_index_params rather than the Python distributed-build helpers. None of train_pq, transform_vectors, or load_shuffled_vectors are exercised on a multivector column anywhere in the test suite.
Impact
The PR description explicitly states that the change "allows IndicesBuilder to infer the inner vector type for distributed builds." But transform_vectors → shuffle_transformed_vectors → load_shuffled_vectors is the distributed build pipeline, and train_pq is the PQ training step for it. With these three sites still using the broken pattern, the PR's stated goal is only half-achieved — the Rust-side centroid cast is fixed, but a Python user invoking the distributed-build helpers on a multivector dataset will crash before reaching the Rust code.
Step-by-step proof
import pyarrow as pa
# Multivector column type (matches the new test):
mv_type = pa.list_(pa.list_(pa.float32(), 4))
print(type(mv_type).__name__) # 'ListType'
mv_type.list_size # AttributeError: 'pyarrow.lib.ListType' object has no attribute 'list_size'So when a user runs the documented distributed-build flow on a multivector dataset:
builder = IndicesBuilder(ds, "embeddings") # ok — uses _vector_dimension
ivf = builder.train_ivf(num_partitions=2)
pq = builder.train_pq(ivf) # AttributeError at builder.py:204…the call fails at the very first distributed-build step. Even if line 204 is deleted, transform_vectors and load_shuffled_vectors would hit the same error at lines 364 and 457.
Fix
Each of the three sites should use self.dimension, which __init__ already populates correctly via the new _vector_dimension helper. The line at builder.py:204 is a vestigial expression statement (its result is unused) and should simply be deleted.
🔬 also observed by verify-runtime
| f"1-dimensional FixedShapeTensorArray, got {data_type}" | ||
| ) | ||
|
|
||
| def _vector_value_type(self, data_type): |
There was a problem hiding this comment.
The two method bodies have the same judgment logic. Maybe we can extract them into a common place, e.g. _describe_vector_type(data_type) -> Optional[(dim, scalar_type)].
There was a problem hiding this comment.
Could there still be any problems here?
yanghua
left a comment
There was a problem hiding this comment.
One comment left, otherwise LGTM.
| "Vector column must be FixedSizeListArray " | ||
| f"1-dimensional FixedShapeTensorArray, got {data_type}" |
There was a problem hiding this comment.
What about this? Vector column must be FixedSizeListArray, list<FixedSizeList> (multivector), or 1-dimensional FixedShapeTensorArray, got {data_type}?
There was a problem hiding this comment.
Thanks for reviewing, fixed this.
| } | ||
|
|
||
| let centroid_type = match column_type { | ||
| DataType::List(field) |
There was a problem hiding this comment.
Maybe we can also support LargeList<FixedSizeList>. But not must have it in this PR.
There was a problem hiding this comment.
Thanks for pointing this out. I checked the Rust and Python vector index paths, and LargeList is not supported end-to-end today. Most multivector code only handles List.
This PR keeps the scope to fixing the existing List distributed/segment index build issue. LargeList support should be handled separately.
Distributed IVF index builds failed for multivector columns when using precomputed centroids:
This fixes the centroid type handling for
list<fixed_size_list<...>>columns and allowsIndicesBuilderto infer the inner vector type for distributed builds.