-
Notifications
You must be signed in to change notification settings - Fork 687
fix: support multivector IVF centroids in segment builds #6995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9af404d
f1f3939
df54f7b
690bd8f
2382b6f
eab5e8f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4337,17 +4337,27 @@ fn prepare_vector_index_params( | |
| )); | ||
| } | ||
|
|
||
| let centroid_type = match column_type { | ||
| DataType::List(field) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can also support
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing this out. I checked the Rust and Python vector index paths, and LargeList is not supported end-to-end today. Most multivector code only handles List. This PR keeps the scope to fixing the existing List distributed/segment index build issue. LargeList support should be handled separately. |
||
| if matches!(field.data_type(), DataType::FixedSizeList(_, _)) => | ||
| { | ||
| field.data_type() | ||
| } | ||
| _ => column_type, | ||
| }; | ||
|
|
||
| // It's important that the centroids are the same data type | ||
| // as the vectors that will be indexed. | ||
| let mut centroids: Arc<dyn Array> = batch.column(0).clone(); | ||
| if centroids.data_type() != column_type { | ||
| centroids = cast_with_options(centroids.as_ref(), column_type, &Default::default()) | ||
| .map_err(|e| { | ||
| PyValueError::new_err(format!( | ||
| "Failed to cast centroids to column type: {}", | ||
| e | ||
| )) | ||
| })?; | ||
| if centroids.data_type() != centroid_type { | ||
| centroids = | ||
| cast_with_options(centroids.as_ref(), centroid_type, &Default::default()) | ||
| .map_err(|e| { | ||
| PyValueError::new_err(format!( | ||
| "Failed to cast centroids to vector type: {}", | ||
| e | ||
| )) | ||
| })?; | ||
| } | ||
| let centroids = as_fixed_size_list_array(centroids.as_ref()); | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Three call sites in IndicesBuilder still use
self.dataset.schema.field(self.column[0]).type.list_sizedirectly instead of the new_vector_dimensionhelper, so callingtrain_pq(line 204),transform_vectors(line 364), orload_shuffled_vectors(line 457) on a multivector column raisesAttributeError: ListType object has no attribute list_size. These are exactly the distributed-build steps the PR description says it enables for multivectors — the new test only exercisescreate_index_uncommitted(Rust path), so CI doesn't catch the regression. Fix: replace each withself.dimension(already set correctly via_vector_dimensionin__init__) and delete the dead expression statement at line 204.Extended reasoning...
What's wrong
This PR adds a
_vector_dimensionhelper (builder.py:599-611) to support multivector columns whose type ispa.list_(pa.list_(pa.float32(), N)). It's correctly invoked from__init__to setself.dimension. However, three other places inIndicesBuilderstill reach into the raw column type with the pre-PR patternself.dataset.schema.field(self.column[0]).type.list_size:train_pq— a bare expression whose value is discarded (dead code), but the attribute access still executes.transform_vectors— assigned todimensionand passed toindices.transform_vectors.load_shuffled_vectors— assigned topq_dimensionand passed toindices.load_shuffled_vectors.For a multivector column the outer arrow type is a variable-size
ListType(notFixedSizeListType).ListTypehas nolist_sizeattribute — only the innerFixedSizeListTypedoes — so this raisesAttributeError.Why CI doesn't catch it
The new
test_ivf_centroids_multivector_fragment_idsonly constructs anIndicesBuilder(to checkbuilder.dimension == 4) and then callsds.create_index_uncommitted(...), which goes through the Rust pathprepare_vector_index_paramsrather than the Python distributed-build helpers. None oftrain_pq,transform_vectors, orload_shuffled_vectorsare exercised on a multivector column anywhere in the test suite.Impact
The PR description explicitly states that the change "allows IndicesBuilder to infer the inner vector type for distributed builds." But
transform_vectors → shuffle_transformed_vectors → load_shuffled_vectorsis the distributed build pipeline, andtrain_pqis the PQ training step for it. With these three sites still using the broken pattern, the PR's stated goal is only half-achieved — the Rust-side centroid cast is fixed, but a Python user invoking the distributed-build helpers on a multivector dataset will crash before reaching the Rust code.Step-by-step proof
So when a user runs the documented distributed-build flow on a multivector dataset:
…the call fails at the very first distributed-build step. Even if line 204 is deleted,
transform_vectorsandload_shuffled_vectorswould hit the same error at lines 364 and 457.Fix
Each of the three sites should use
self.dimension, which__init__already populates correctly via the new_vector_dimensionhelper. The line at builder.py:204 is a vestigial expression statement (its result is unused) and should simply be deleted.🔬 also observed by verify-runtime