Skip to content

Multi segment cagra search#133

Draft
jamxia155 wants to merge 15 commits into
NVIDIA:mainfrom
jamxia155:multi-segment-cagra-search
Draft

Multi segment cagra search#133
jamxia155 wants to merge 15 commits into
NVIDIA:mainfrom
jamxia155:multi-segment-cagra-search

Conversation

@jamxia155

Copy link
Copy Markdown

Companion PR to cuvs!2035, addresses #124.

Existing CAGRA search code path for each search query:

  • Call CAGRA search API on one index segment
  • Copy results back to host
  • Add results into host-side global top-k priority queue
  • Repeat for all index segments

Proposed change:

  • Leverage new multi-segment CAGRA search API to launch all per-segment searches in one API call
  • Leave results on device and run GPU-accelerated select-k API to compute global top-k
  • Copy final top-k results to host

  Overrides rewrite() to run all segment searches into a shared device
  buffer and merge with cuvsSelectK entirely on GPU, eliminating
  per-segment D2H copies and CPU-side TopDocs.merge(). Falls back to the
  standard Lucene per-segment path when any segment lacks a CAGRA index,
  an explicit filter is set, or k > 1024.

  Also adds ordToDoc() and getCagraIndexForField() helpers to
  CuVS2510GPUVectorsReader to support result decoding.

  Fixes for Lucene 10.2 API changes: CodecReader moved to
  org.apache.lucene.index; createRewrittenQuery() removed and replaced
  with an inline docAndScoreQuery() implementation using the public
  Weight/ScorerSupplier API.
- CuVS2510GPUVectorsFormat: call CuVSProvider.provider().enableRMMAsyncMemory()
  in the static initializer so that cuda_async_memory_resource is active for
  the lifetime of the codec. This makes CAGRA workspace deallocations
  stream-ordered and non-blocking, which is required for the CudaStreamPool
  to provide any parallelism benefit.

- GPUKnnFloatVectorQuery: upload the query vector to device once before the
  per-segment loop and share the resulting CuVSMatrix across all CagraQuery
  instances, reducing host-to-device copies from O(numSegments) to 1 per
  query. Wrap the shared device matrix in try-with-resources to close the
  RMM allocation promptly after MultiSegmentCagraSearch.search() returns.

- FilterCuVSProvider: delegate enableRMMAsyncMemory() to the wrapped provider.
GPUKnnFloatVectorQuery / GPUPerLeafCuVSKnnCollector:
- Add persistent, persistentLifetime, and persistentDeviceUsage
  parameters, threaded through all constructor overloads and forwarded
  to CagraSearchParams.Builder in both the multi-segment rewrite() path
  and the per-segment approximateSearch() fallback path.
- Add threadBlockSize parameter (0 = auto) to allow tuning of the
  persistent kernel's worker_queue_size, which determines how many
  concurrent query threads can run without latency increase.

Fix persistent-runner hash instability across segments (rewrite() path):
- When max_iterations is 0 (auto), CAGRA computes it from each
  segment's dataset size. Different-sized segments produce different
  values, causing a distinct runner hash per segment and a
  destroy/recreate cycle on every search call.
- Add computeMaxIterations(), which mirrors adjust_search_params() from
  search_plan.cuh, and call it once using the largest segment's graph
  size and degree. All segments then share the same max_iterations,
  producing a stable runner hash across the full multi-segment query.

CuVS2510GPUVectorsReader:
- Forward threadBlockSize, persistent, persistentLifetime, and
  persistentDeviceUsage from GPUPerLeafCuVSKnnCollector to
  CagraSearchParams.Builder in the per-segment fallback path.
Remove persistent kernel mode:
- Drop persistent, persistentLifetime, and persistentDeviceUsage fields
  and parameters from GPUKnnFloatVectorQuery, GPUPerLeafCuVSKnnCollector,
  and CuVS2510GPUVectorsReader. The persistent kernel is superseded by
  the native multi-segment kernel (cuvsCagraSearchMultiSegment) which
  achieves better concurrency without the per-runner lifecycle overhead.
- Collapse the 11-argument GPUKnnFloatVectorQuery constructor (which only
  existed to accept persistent parameters) into the standard 8-argument
  form.
- Remove stale comments that described max_iterations uniformity in terms
  of persistent-runner hash stability; replace with accurate explanation
  (consistent search quality across segments of different sizes).

Add workspace pool configuration:
- Add WORKSPACE_POOL_SIZE_PROPERTY constant to
  ThreadLocalCuVSResourcesProvider.
- On resources creation, read com.nvidia.cuvs.workspacePoolSize system
  property and call setWorkspacePool() if set, so callers can pre-warm
  the per-thread RMM pool without modifying cuvs-lucene source.
@copy-pr-bot

copy-pr-bot Bot commented Apr 22, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

jamxia155 and others added 10 commits May 8, 2026 17:54
GPUKnnFloatVectorQuery previously fell back to the per-segment CPU path
whenever a filter query was present. This change extends the GPU fast path
to handle filters by packing the accepted-ordinal bitset for every segment
into a single FilterBitsetHandle and passing it to MultiSegmentCagraSearch.

FilterBitsetCache (new): 16-entry shared LRU cache keyed by (filter Query,
per-segment reader-cache keys, field). Per-reader keys are used rather than
core keys so that liveDocs changes caused by a reader reopen automatically
invalidate the cached bitset.

GPUKnnFloatVectorQuery changes:
- Remove the early bail-out for filter != null
- Add buildOrGetCachedFilterHandle: checks FilterBitsetCache, falls back
  to buildFilterHandle on miss
- Add buildFilterHandle: evaluates the filter Weight per segment via
  FloatVectorValues.getAcceptOrds (intersecting with liveDocs), then packs
  the result into combinedLongs / segBitOffsets for FilterBitsetHandle
- When a handle is present, CagraQuery is built without a per-query
  prefilter (the bitset already encodes the intersection)
cuVS's search_multi_partition was refactored to take a single queries
matrix and return globally merged top-k outputs (partition_ids,
neighbors, distances) instead of per-partition vectors of matrix views.
This commit updates GPUKnnFloatVectorQuery to match.

Changes in rewrite():
- Build one shared CagraQuery (queryVector + searchParams), not one
  per Lucene segment; drop the per-segment construction loop.
- The remaining loop only collects the CagraIndex per segment for the
  cuVS partition list.
- Single call to MultiPartitionCagraSearch.search(...), passing the
  optional FilterBitsetHandle as the last argument (no more
  filterHandle != null branching between overloads).

Filter handling:
- Since a single shared CagraQuery cannot carry per-segment liveDocs
  via withPrefilter anymore, fold liveDocs into the FilterBitsetHandle
  alongside any explicit Lucene filter. The handle is now built whenever
  filter != null OR any segment has deletes.
- buildFilterHandle extended to accept filter == null, in which case
  acceptDocs is just liveDocs per segment.
- Removed buildCagraQuery helper that constructed per-segment liveDocs
  bitsets; the unused java.util.BitSet import is removed with it.
The k>1024 short-circuit at the top of rewrite() forced large-k queries
through Lucene's per-leaf approximateSearch fallback, where they fail
either at the codec reader's k<=1024 gate or — when that gate doesn't
fire — by trying to use a brute-force index that callers typically do
not build. The multi-partition path is the right place to serve these
queries: with N segments at SINGLE_CTA's itopk_size cap of 512, the
per-partition × num_segments candidate pool is more than enough to
assemble a global top-k well above 1024.

Companion to the cuVS change that decouples the multi-partition kernel's
per-partition output count from the global topk, allowing each partition
to emit up to itopk_size candidates and merging across partitions to
produce the final top-k.

- Drop the k>1024 early-return in rewrite(). Feasibility is now enforced
  by cuVS's itopk_size * num_partitions >= topk RAFT_EXPECTS in
  search_multi_partition.
- Clamp itopk_size to 512 when building CagraSearchParams, since
  SINGLE_CTA rejects larger values. The clamp is a no-op when the
  caller-requested itopk_size is already <= 512.
- Update the class Javadoc to reflect that the multi-partition path is
  no longer k-capped.

The per-leaf approximateSearch fallback's k<=1024 gate and brute-force-
index assumption are unchanged; they fire only when some segment lacks
a GPU reader, which is independent of this change.
computeMaxIterations was added to stabilize the persistent-kernel
runner hash across segments by pre-computing max_iterations on the
Java side from the largest segment, so every segment received the
same value. Persistent-kernel support was subsequently removed,
eliminating that rationale.

The cuVS multi-partition C++ entry (search_multi_partition in
cagra_search.cuh) already iterates partitions to find max_dataset_size
and max_graph_degree, then calls adjust_search_params(), which produces
the same max_iterations value the Java side was computing. Passing
max_iterations = 0 (the CagraSearchParams default) lets that path run
naturally.

- Delete the per-segment loop that computed max_dataset_size and
  graph_degree.
- Drop .withMaxIterations() from the multi-partition search builder.
- Delete the computeMaxIterations helper and its Javadoc.

No behavior change; the C++ side computes the identical value.
The gate forced k>1024 queries through Lucene's per-leaf fallback,
where they hit the codec reader's k<=1024 check and tried to invoke
a brute-force index that GPU-indexed segments typically do not
write — producing a confusing NPE.

cuVS now handles k>1024 in the multi-partition path via MULTI_KERNEL,
routed automatically by the AUTO heuristic on itopk_size or selected
explicitly via cagraSearchAlgo. Feasibility is validated at the C++
side per algo: SINGLE_CTA still requires itopk_size * num_partitions
>= topk, MULTI_KERNEL requires topk <= itopk_size. Failures surface
as RAFT_EXPECTS messages instead of NPEs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants