Multi segment cagra search by jamxia155 · Pull Request #133 · NVIDIA/cuvs-lucene

jamxia155 · 2026-04-22T23:26:04Z

Companion PR to cuvs!2035, addresses #124.

Existing CAGRA search code path for each search query:

Call CAGRA search API on one index segment
Copy results back to host
Add results into host-side global top-k priority queue
Repeat for all index segments

Proposed change:

Leverage new multi-segment CAGRA search API to launch all per-segment searches in one API call
Leave results on device and run GPU-accelerated select-k API to compute global top-k
Copy final top-k results to host

Overrides rewrite() to run all segment searches into a shared device buffer and merge with cuvsSelectK entirely on GPU, eliminating per-segment D2H copies and CPU-side TopDocs.merge(). Falls back to the standard Lucene per-segment path when any segment lacks a CAGRA index, an explicit filter is set, or k > 1024. Also adds ordToDoc() and getCagraIndexForField() helpers to CuVS2510GPUVectorsReader to support result decoding. Fixes for Lucene 10.2 API changes: CodecReader moved to org.apache.lucene.index; createRewrittenQuery() removed and replaced with an inline docAndScoreQuery() implementation using the public Weight/ScorerSupplier API.

- CuVS2510GPUVectorsFormat: call CuVSProvider.provider().enableRMMAsyncMemory() in the static initializer so that cuda_async_memory_resource is active for the lifetime of the codec. This makes CAGRA workspace deallocations stream-ordered and non-blocking, which is required for the CudaStreamPool to provide any parallelism benefit. - GPUKnnFloatVectorQuery: upload the query vector to device once before the per-segment loop and share the resulting CuVSMatrix across all CagraQuery instances, reducing host-to-device copies from O(numSegments) to 1 per query. Wrap the shared device matrix in try-with-resources to close the RMM allocation promptly after MultiSegmentCagraSearch.search() returns. - FilterCuVSProvider: delegate enableRMMAsyncMemory() to the wrapped provider.

GPUKnnFloatVectorQuery / GPUPerLeafCuVSKnnCollector: - Add persistent, persistentLifetime, and persistentDeviceUsage parameters, threaded through all constructor overloads and forwarded to CagraSearchParams.Builder in both the multi-segment rewrite() path and the per-segment approximateSearch() fallback path. - Add threadBlockSize parameter (0 = auto) to allow tuning of the persistent kernel's worker_queue_size, which determines how many concurrent query threads can run without latency increase. Fix persistent-runner hash instability across segments (rewrite() path): - When max_iterations is 0 (auto), CAGRA computes it from each segment's dataset size. Different-sized segments produce different values, causing a distinct runner hash per segment and a destroy/recreate cycle on every search call. - Add computeMaxIterations(), which mirrors adjust_search_params() from search_plan.cuh, and call it once using the largest segment's graph size and degree. All segments then share the same max_iterations, producing a stable runner hash across the full multi-segment query. CuVS2510GPUVectorsReader: - Forward threadBlockSize, persistent, persistentLifetime, and persistentDeviceUsage from GPUPerLeafCuVSKnnCollector to CagraSearchParams.Builder in the per-segment fallback path.

Remove persistent kernel mode: - Drop persistent, persistentLifetime, and persistentDeviceUsage fields and parameters from GPUKnnFloatVectorQuery, GPUPerLeafCuVSKnnCollector, and CuVS2510GPUVectorsReader. The persistent kernel is superseded by the native multi-segment kernel (cuvsCagraSearchMultiSegment) which achieves better concurrency without the per-runner lifecycle overhead. - Collapse the 11-argument GPUKnnFloatVectorQuery constructor (which only existed to accept persistent parameters) into the standard 8-argument form. - Remove stale comments that described max_iterations uniformity in terms of persistent-runner hash stability; replace with accurate explanation (consistent search quality across segments of different sizes). Add workspace pool configuration: - Add WORKSPACE_POOL_SIZE_PROPERTY constant to ThreadLocalCuVSResourcesProvider. - On resources creation, read com.nvidia.cuvs.workspacePoolSize system property and call setWorkspacePool() if set, so callers can pre-warm the per-thread RMM pool without modifying cuvs-lucene source.

copy-pr-bot · 2026-04-22T23:26:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…earch

GPUKnnFloatVectorQuery previously fell back to the per-segment CPU path whenever a filter query was present. This change extends the GPU fast path to handle filters by packing the accepted-ordinal bitset for every segment into a single FilterBitsetHandle and passing it to MultiSegmentCagraSearch. FilterBitsetCache (new): 16-entry shared LRU cache keyed by (filter Query, per-segment reader-cache keys, field). Per-reader keys are used rather than core keys so that liveDocs changes caused by a reader reopen automatically invalidate the cached bitset. GPUKnnFloatVectorQuery changes: - Remove the early bail-out for filter != null - Add buildOrGetCachedFilterHandle: checks FilterBitsetCache, falls back to buildFilterHandle on miss - Add buildFilterHandle: evaluates the filter Weight per segment via FloatVectorValues.getAcceptOrds (intersecting with liveDocs), then packs the result into combinedLongs / segBitOffsets for FilterBitsetHandle - When a handle is present, CagraQuery is built without a per-query prefilter (the bitset already encodes the intersection)

cuVS's search_multi_partition was refactored to take a single queries matrix and return globally merged top-k outputs (partition_ids, neighbors, distances) instead of per-partition vectors of matrix views. This commit updates GPUKnnFloatVectorQuery to match. Changes in rewrite(): - Build one shared CagraQuery (queryVector + searchParams), not one per Lucene segment; drop the per-segment construction loop. - The remaining loop only collects the CagraIndex per segment for the cuVS partition list. - Single call to MultiPartitionCagraSearch.search(...), passing the optional FilterBitsetHandle as the last argument (no more filterHandle != null branching between overloads). Filter handling: - Since a single shared CagraQuery cannot carry per-segment liveDocs via withPrefilter anymore, fold liveDocs into the FilterBitsetHandle alongside any explicit Lucene filter. The handle is now built whenever filter != null OR any segment has deletes. - buildFilterHandle extended to accept filter == null, in which case acceptDocs is just liveDocs per segment. - Removed buildCagraQuery helper that constructed per-segment liveDocs bitsets; the unused java.util.BitSet import is removed with it.

The k>1024 short-circuit at the top of rewrite() forced large-k queries through Lucene's per-leaf approximateSearch fallback, where they fail either at the codec reader's k<=1024 gate or — when that gate doesn't fire — by trying to use a brute-force index that callers typically do not build. The multi-partition path is the right place to serve these queries: with N segments at SINGLE_CTA's itopk_size cap of 512, the per-partition × num_segments candidate pool is more than enough to assemble a global top-k well above 1024. Companion to the cuVS change that decouples the multi-partition kernel's per-partition output count from the global topk, allowing each partition to emit up to itopk_size candidates and merging across partitions to produce the final top-k. - Drop the k>1024 early-return in rewrite(). Feasibility is now enforced by cuVS's itopk_size * num_partitions >= topk RAFT_EXPECTS in search_multi_partition. - Clamp itopk_size to 512 when building CagraSearchParams, since SINGLE_CTA rejects larger values. The clamp is a no-op when the caller-requested itopk_size is already <= 512. - Update the class Javadoc to reflect that the multi-partition path is no longer k-capped. The per-leaf approximateSearch fallback's k<=1024 gate and brute-force- index assumption are unchanged; they fire only when some segment lacks a GPU reader, which is independent of this change.

This reverts commit 7f394de.

computeMaxIterations was added to stabilize the persistent-kernel runner hash across segments by pre-computing max_iterations on the Java side from the largest segment, so every segment received the same value. Persistent-kernel support was subsequently removed, eliminating that rationale. The cuVS multi-partition C++ entry (search_multi_partition in cagra_search.cuh) already iterates partitions to find max_dataset_size and max_graph_degree, then calls adjust_search_params(), which produces the same max_iterations value the Java side was computing. Passing max_iterations = 0 (the CagraSearchParams default) lets that path run naturally. - Delete the per-segment loop that computed max_dataset_size and graph_degree. - Drop .withMaxIterations() from the multi-partition search builder. - Delete the computeMaxIterations helper and its Javadoc. No behavior change; the C++ side computes the identical value.

The gate forced k>1024 queries through Lucene's per-leaf fallback, where they hit the codec reader's k<=1024 check and tried to invoke a brute-force index that GPU-indexed segments typically do not write — producing a confusing NPE. cuVS now handles k>1024 in the multi-partition path via MULTI_KERNEL, routed automatically by the AUTO heuristic on itopk_size or selected explicitly via cagraSearchAlgo. Feasibility is validated at the C++ side per algo: SINGLE_CTA still requires itopk_size * num_partitions >= topk, MULTI_KERNEL requires topk <= itopk_size. Failures surface as RAFT_EXPECTS messages instead of NPEs.

…earch

jamxia155 added 5 commits April 22, 2026 16:05

Expose CAGRA SearchAlgo parameter

7a8b26b

jamxia155 mentioned this pull request Apr 22, 2026

Multi partition cagra search NVIDIA/cuvs#2035

Draft

jamxia155 and others added 10 commits May 8, 2026 17:54

Merge remote-tracking branch 'origin/main' into multi-segment-cagra-s…

19773fa

…earch

Modify example to demo prefilter.

9abe58a

Revert "Allow rewrite() to use multi-partition path for k beyond 1024"

1b0355a

This reverts commit 7f394de.

Merge remote-tracking branch 'origin/main' into multi-segment-cagra-s…

32e07bf

…earch

Updates to pom to buidl snapshot

51e03b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi segment cagra search#133

Multi segment cagra search#133
jamxia155 wants to merge 15 commits into
NVIDIA:mainfrom
jamxia155:multi-segment-cagra-search

jamxia155 commented Apr 22, 2026

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jamxia155 commented Apr 22, 2026

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants