Improve CAGRA-Q performance and add support for PQ_LEN=8 by enp1s0 · Pull Request #1533 · NVIDIA/cuvs

enp1s0 · 2025-11-12T04:25:03Z

This PR:

Introduces the E5M2 data type as the internal storage type to improve performance
Adds support for PQ_LEN = 8 to achieve a higher compression ratio

E5M2 as smem data type

Using a lower-precision data type helps reduce shared memory bank conflicts and can improve throughput.
Since the quantization error from VQ+PQ is typically larger than the representation error of E5M2, the impact on search recall is expected to be negligible.

Support for PQ_LEN=8

The current cuVS implementation supports only PQ_LEN = 2 (4 bits per vector element) and 4 (2 bits per vector element).
This PR adds support for PQ_LEN = 8 to enable a higher compression ratio (1 bit per vector element).

coderabbitai · 2026-04-26T13:23:31Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Add a configurable shared-memory dtype (F16/E5M2) for CAGRA VPQ: new enum and search param, packed-fp8 utilities, descriptor and cache updates, JIT-LTO planner dispatch and fragment-tagging, threaded SmemDType through setup_workspace and compute_distance device kernels, build/template instantiation updates, kernel matrix metadata, and tests.

Changes

CAGRA VPQ Shared-Memory Dtype Support

Layer / File(s)	Summary
Enum and search parameters for smem dtype selection `cpp/include/cuvs/neighbors/cagra.hpp`, `cpp/src/neighbors/detail/cagra/cagra_search.cuh`	Adds `internal_dtype` and `search_params::smem_dtype`; strided search enforces F16.
Packed fp8 utilities and JIT fragment tags `cpp/src/neighbors/detail/cagra/packed_type.hpp`, `cpp/include/cuvs/detail/jit_lto/cagra/cagra_fragments.hpp`	Adds fp8 packing utilities (`fp8xN`) and `tag_smem_f16`/`tag_smem_e5m2`; fragment tag templates accept SmemTag.
Descriptor host and shared-memory ops `cpp/src/neighbors/detail/cagra/compute_distance.hpp`, `cpp/src/neighbors/detail/cagra/device_memory_ops.hpp`	`dataset_descriptor_host` gains `smem_dtype`; add 32/64-bit shared-memory load/store helpers.
VPQ smem value config and descriptor types `cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh`	Introduce `vpq_smem_value_config` trait, add `SmemDType` template arg to `cagra_q_dataset_descriptor_t`, and update SMEM sizing and descriptor init/launch wiring.
vpq_descriptor_spec template and priority filtering `cpp/src/neighbors/detail/cagra/compute_distance_vpq.hpp`	`vpq_descriptor_spec` gains `SmemDType` template parameter and rejects candidates whose SmemDType doesn't match runtime `params.smem_dtype`.
Build system and template instantiation `cpp/CMakeLists.txt`, `cpp/src/neighbors/detail/cagra/compute_distance_vpq_inst.cu.in`	Thread `@smem_dtype@` into explicit instantiations and extend output filename / JIT kernel name/fragment tag formats to include smem/codebook tokens.
Descriptor cache key update `cpp/src/neighbors/detail/cagra/factory.cuh`	Add `smem_dtype` to cache key and include it in make_key/operator==/key_hash.
JIT-LTO launcher factory wiring `cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp`	Pass `dataset_desc.smem_dtype` into planner registrations for VPQ tag paths.
JIT-LTO planner dispatch helpers `cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp`	Add smem dtype->tag mapping, split standard vs VPQ team-dim dispatchers, accept smem_dtype in VPQ setup/compute overloads, extend pq_len to {2,4,8}.
Compute-distance device kernel implementation `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in`	Parameterize with SmemDType, derive PQ_CODEBOOK_LOAD_T and packed shared-memory types, rewrite even-PQ_LEN distance accumulation to use packed loads and generalized indexing; add k_smem_dtype specialization.
Setup-workspace device kernel implementation `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_impl.cuh`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in`	Parameterize with SmemDType; generalize codebook/query staging to use smem_val_pack_t and num_packed_elements with transpose/write variants.
Kernel matrices and metadata `cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json`, `cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json`, `cpp/src/neighbors/detail/cagra/compute_distance_vpq_matrix.json`	Add `_smem` metadata entries (F16/E5M2) and expand pq_len/team-dim coverage (including pq_len=8).
Tests and utilities `cpp/tests/neighbors/vpq_utils.cuh`, `cpp/tests/neighbors/ann_cagra.cuh`, `cpp/tests/neighbors/ann_utils.cuh`	Add VPQ decode utility, make ann_cagra VPQ-aware (compute reference_recall from decoded dataset), propagate smem_dtype in tests, and expand default pq_len choices.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

rapidsai/cuvs#1807: Related JIT-LTO CAGRA planner/fragment infrastructure changes referenced by this PR.

Suggested labels

improvement, C++, cpp

Suggested reviewers

divyegala
KyleFromNVIDIA
dantegd

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main objectives: introducing E5M2 support and adding PQ_LEN=8 support for improved CAGRA-Q performance.
Description check	✅ Passed	The description provides relevant context for both major changes: E5M2 as internal storage type and PQ_LEN=8 support, with rationale for each.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh (1)
489-507: ⚠️ Potential issue | 🟡 Minor

Match the _CLK_BREAKDOWN placeholders with arguments.

The format string now prints both distance and hash, but this call only passes one uint64_t after clk_pickup_parents. With _CLK_BREAKDOWN enabled, that is undefined behavior and will emit garbage timing data.
💡 Suggested fix
       clk_init,
       clk_compute_1st_distance,
       clk_topk,
       clk_pickup_parents,
+      clk_compute_actual_distance,
       clk_compute_distance - clk_compute_actual_distance);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh` around lines
489 - 507, The printf call in search_multi_cta_kernel-inl.cuh prints both
"distance" and "hash" but only passes one uint64_t after clk_pickup_parents,
causing mismatched arguments when _CLK_BREAKDOWN is enabled; update the printf
argument list used in the printf near the debug block (the printf that
references __FILE__, __LINE__, query_id, threadIdx.x and clk_* variables) to
pass both timing values (e.g., clk_compute_distance and
clk_compute_actual_distance or the intended clk_hash value) so the number and
order of format specifiers match the provided arguments; ensure the variables
clk_init, clk_compute_1st_distance, clk_topk, clk_pickup_parents,
clk_compute_distance, and clk_compute_actual_distance (or the correct hash
timing variable) are all supplied in the same order as the format string.
cpp/include/cuvs/neighbors/cagra.hpp (1)
273-351: ⚠️ Potential issue | 🟡 Minor

Document the new internal_dtype API surface.

internal_dtype and search_params::smem_dtype are now public API, but this header does not document the enum values or the key constraints users need to know: FP8 is VPQ-only, E5M2 is ignored for strided datasets (cpp/src/neighbors/detail/cagra/cagra_search.cuh Lines 155-159), and AUTO is device-dependent for VPQ selection. Please add Doxygen on the enum/field and flag this new knob in the user-facing CAGRA docs.

As per coding guidelines, "For public C++ API headers, additionally check: Doxygen documentation for all public functions/classes" and "API changes flagged for docs/ updates".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuvs/neighbors/cagra.hpp` around lines 273 - 351, The public enum
internal_dtype and search_params::smem_dtype lack Doxygen and user-facing
guidance; add Doxygen comments to the internal_dtype enum (document each value:
F16, E5M2, AUTO) and to search_params::smem_dtype explaining constraints: FP8
(if present) is VPQ-only, E5M2 is ignored for strided datasets, AUTO behavior is
device-dependent and selects VPQ when appropriate, and valid value
ranges/compatibility with other params (e.g., smem usage, VPQ). Also add a short
API-note in the CAGRA user docs indicating this new knob, its VPQ-only/strided
dataset caveats, and guidance on when to choose AUTO vs explicit types.
Reference symbols: internal_dtype, search_params::smem_dtype, and VPQ behavior
in cagra_search logic.
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh (1)
1086-1103: ⚠️ Potential issue | 🟡 Minor

Minor: misleading column ordering in _CLK_BREAKDOWN printf.

The header strings are emitted in the order ..., distance, hash but the corresponding values are clk_compute_actual_distance, clk_compute_distance - clk_compute_actual_distance. Since clk_compute_distance accumulates the wall time of compute_distance_to_child_nodes (which includes both the hashmap inserts and the actual distance computation), calling the residual "hash" is approximately right, but the column label "distance" now refers to the actual distance kernel time rather than the previously reported total. This is a behavioral break for anyone parsing the existing _CLK_BREAKDOWN output. Consider renaming the labels (e.g. actual_distance, non_distance) so log consumers don't silently misinterpret the numbers.

This is debug-only instrumentation gated by #ifdef _CLK_BREAKDOWN, so it does not affect release builds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh` around lines
1086 - 1103, The `_CLK_BREAKDOWN` printf currently labels the last two columns
as "distance, hash" but passes clk_compute_actual_distance and
(clk_compute_distance - clk_compute_actual_distance), which is misleading;
update the header labels to match the values (for example change "distance,
hash" to "actual_distance, non_distance" or similar) so the printed column names
align with the passed variables, and ensure the printf string near
_CLK_BREAKDOWN and the associated argument list (using
clk_compute_actual_distance and clk_compute_distance) are kept consistent.

🧹 Nitpick comments (13)

cpp/tests/neighbors/ann_utils.cuh (2)
289-290: index_based_actual_recall is destructured but never used here.

eval_neighbours only checks actual_recall; the new index-only recall is consumed by callers (e.g., the VPQ path in ann_cagra.cuh via std::get<1>). To keep this internal binding intentional and avoid a potential -Wunused-variable on some toolchains, mark it [[maybe_unused]] or use a discard.
Proposed change
-  auto [actual_recall, index_based_actual_recall, match_count, total_count] =
-    calc_recall(expected_idx, actual_idx, expected_dist, actual_dist, rows, cols, eps);
+  auto [actual_recall, index_based_actual_recall, match_count, total_count] =
+    calc_recall(expected_idx, actual_idx, expected_dist, actual_dist, rows, cols, eps);
+  (void)index_based_actual_recall;  // currently unused in eval_neighbours
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_utils.cuh` around lines 289 - 290, The destructured
variable index_based_actual_recall from the calc_recall call in eval_neighbours
is not used and can trigger -Wunused-variable; update the destructuring to
indicate it's intentionally unused (e.g., mark index_based_actual_recall with
[[maybe_unused]] or replace it with a discard/ignore) so the binding remains for
callers that consume the second return (calc_recall) but avoids unused-variable
warnings in eval_neighbours.
252-273: Avoid the duplicate O(rows·cols²) pass for index-only recall.

The new "Index based recall" loop reproduces the structure of the loop above, just without the distance check. Folding the index-only counter into the first loop saves one full O(rows·cols²) traversal and keeps the two recall metrics consistent by construction.
Proposed merge
   for (size_t i = 0; i < rows; ++i) {
     for (size_t k = 0; k < cols; ++k) {
       size_t idx_k  = i * cols + k;  // row major assumption!
       auto act_idx  = actual_idx[idx_k];
       auto act_dist = actual_dist[idx_k];
+      bool idx_matched = false;
       for (size_t j = 0; j < cols; ++j) {
         size_t idx    = i * cols + j;  // row major assumption!
         auto exp_idx  = expected_idx[idx];
         auto exp_dist = expected_dist[idx];
+        if (!idx_matched && act_idx == exp_idx) {
+          index_match_count++;
+          idx_matched = true;
+        }
         idx_dist_pair exp_kvp(exp_idx, exp_dist, cuvs::CompareApprox<DistT>(eps));
         idx_dist_pair act_kvp(act_idx, act_dist, cuvs::CompareApprox<DistT>(eps));
         if (exp_kvp == act_kvp) {
           match_count++;
           break;
         }
       }
     }
   }
-
-  // Index based recall
-  for (size_t i = 0; i < rows; ++i) {
-    ...
-  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_utils.cuh` around lines 252 - 273, The second "Index
based recall" triple loop duplicates the earlier O(rows·cols²) traversal;
instead, inside the original nested loops where you compute match_count (the
first double loop that iterates i, k and inner j over expected_idx and checks
distances), also check for index equality (if act_idx == exp_idx) and increment
index_match_count and break just like the separate loop did, ensuring you only
count once per (i,k); then remove the duplicate loop entirely so
index_match_count is updated in the same pass as match_count (use the same
variables actual_idx, expected_idx, act_idx, exp_idx, index_match_count,
match_count, total_count to locate and modify the code).
cpp/tests/neighbors/ann_cagra.cuh (2)
563-563: Initialize reference_recall at declaration.

Currently the member is left uninitialized; it is assigned inside testCagra() before being read at line 514, but a default initializer makes the invariant explicit and avoids latent UB if any future code path reads it earlier.
Proposed change
-  double reference_recall;
+  double reference_recall = 1.0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_cagra.cuh` at line 563, The variable reference_recall
is declared uninitialized; initialize it at declaration (e.g., double
reference_recall = 0.0;) to make the invariant explicit and avoid potential UB
if read before assignment—update the declaration of reference_recall near the
top of the file so testCagra() can still assign it later but the variable has a
safe default value.
503-503: Replace raw printf with the RAFT logger.

Other test diagnostics use RAFT_LOG_INFO. A bare printf is harder to silence and bypasses the logger configuration.
Proposed change
-          printf("reference_recall = %e\n", reference_recall);
+          RAFT_LOG_INFO("reference_recall = %e", reference_recall);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/ann_cagra.cuh` at line 503, Replace the raw printf call
in ann_cagra.cuh that prints reference_recall with the RAFT logger: change
printf("reference_recall = %e\n", reference_recall); to a RAFT_LOG_INFO call
(e.g., RAFT_LOG_INFO("reference_recall = %e", reference_recall)); ensure the
RAFT logging header is included where needed so the logger symbol RAFT_LOG_INFO
is available.
cpp/tests/neighbors/vpq_utils.cuh (2)
25-28: Verify alignment of the 4-byte vq_code read.

reinterpret_cast<const uint32_t*>(local_data_ptr) requires data_ptr + ldi * batch_id to be 4-byte aligned. As long as the encoded row stride ldi is a multiple of 4 and the base pointer is aligned (which RMM/raft allocations typically are), this is fine — but nothing in this file enforces it. A RAFT_EXPECTS(vpq_dataset.data.stride(0) % 4 == 0, ...) in the host wrapper would make the requirement explicit and fail loudly instead of silently returning misaligned-load garbage on platforms that don't tolerate it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/vpq_utils.cuh` around lines 25 - 28, The code reads a
4-byte vq_code via reinterpret_cast<const uint32_t*>(local_data_ptr) (symbols:
local_data_ptr, vq_code, data_ptr, ldi, batch_id), but there is no guarantee
that local_data_ptr is 4-byte aligned; add an explicit runtime check in the host
wrapper that prepares vpq data (e.g., validate vpq_dataset.data.stride(0) / ldi)
using RAFT_EXPECTS(vpq_dataset.data.stride(0) % 4 == 0, "stride must be 4-byte
aligned") so misaligned strides fail loudly; ensure this check runs before any
device/kernel launch that uses local_data_ptr and document the alignment
requirement in the wrapper's API comment.
14-20: pq_table_size is computed and passed but never used inside the kernel.

The host wrapper computes 1u << vpq_dataset.pq_bits() and forwards it as pq_table_size, but decode_vpq_dataset_kernel doesn't reference that parameter anywhere. Either drop it from the signature or use it (e.g., to bounds-check pq_code against the codebook size).
Proposed cleanup
 __global__ void decode_vpq_dataset_kernel(data_t* const decoded_dataset_ptr,
                                           const uint32_t ldd,
                                           const math_t* const vq_codebook_ptr,
                                           const uint32_t ldv,
                                           const math_t* const pq_codebook_ptr,
                                           const uint32_t pq_subspace_dim,
-                                          const uint32_t pq_table_size,
                                           const uint32_t dataset_dim,
                                           const size_t dataset_size,
                                           const uint8_t* const data_ptr,
                                           const uint32_t ldi)
…and remove the corresponding argument at the launch site (line 60).
Also applies to: 53-64
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/neighbors/vpq_utils.cuh` around lines 14 - 20, The kernel
decode_vpq_dataset_kernel currently accepts pq_table_size but never uses it;
either remove pq_table_size from the kernel signature and from any launch-site
argument lists, or use it to validate decoded PQ indices (e.g., check pq_code <
pq_table_size before indexing the codebook) to prevent out-of-bounds access.
Locate the kernel function decode_vpq_dataset_kernel and all call sites that
pass the computed 1u << vpq_dataset.pq_bits(), then either delete the
pq_table_size parameter from the function signature and corresponding launches,
or add a bounds-check against pq_table_size where pq_code (or similar PQ index)
is used to index the codebook. Ensure consistency across all occurrences (also
apply same change to the analogous kernel referenced at lines 53-64).
cpp/src/neighbors/detail/cagra/device_common.hpp (3)
254-289: fp8xN only safely supports even NumPacked; document or constrain it.

uintN_t is only specialized for 32 and 64, so fp8xN<NumPacked, 5> is only instantiable for NumPacked ∈ {4, 8} (1 → uintN_t<8>, 2 → uintN_t<16> aren't defined). Additionally:

data.x2[num_elements / 2] silently truncates when num_elements is odd.

as_half2(i) indexes the x2 member, so it implicitly assumes pairs, i.e. even num_elements — there's no guard.

Today the only callers (in compute_distance_vpq-impl.cuh) pass PQ_LEN ∈ {4, 8}, so this is safe. To prevent surprises if someone later instantiates with an odd NumPacked, please add a static_assert(NumPacked % 2 == 0 && (NumPacked == 4 || NumPacked == 8), ...) (or specialize uintN_t for the supported widths only) in this struct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/device_common.hpp` around lines 254 - 289, The
fp8xN<NumPacked, 5> specialization currently assumes an even NumPacked and
relies on uintN_t being defined only for 32/64-bit widths; add a compile-time
guard to prevent accidental odd or unsupported instantiations by inserting a
static_assert in the fp8xN<NumPacked,5> body (near the union/data and methods)
that enforces NumPacked % 2 == 0 and restricts NumPacked to the supported sizes
(e.g., NumPacked == 4 || NumPacked == 8), and mention uintN_t and
as_half2/data.x2 in the assertion message so users know the reason;
alternatively, explicitly specialize uintN_t for the required widths and
document the even-element requirement in fp8xN.
357-365: Redundant reinterpret_cast in sts overloads.

x already has type const uint32_t& / const uint64_t&, so the reinterpret_cast<const uint32_t&>(x) / reinterpret_cast<const uint64_t&>(x) are no-ops — they can simply be x. Not a correctness issue, just dead casting that obscures the intent of these helpers and differs in style from the templated overloads above.
♻️ Suggested cleanup
 RAFT_DEVICE_INLINE_FUNCTION void sts(uint32_t addr, const uint32_t& x)
 {
-  asm volatile("st.shared.u32 [%0], %1;" : : "r"(addr), "r"(reinterpret_cast<const uint32_t&>(x)));
+  asm volatile("st.shared.u32 [%0], %1;" : : "r"(addr), "r"(x));
 }

 RAFT_DEVICE_INLINE_FUNCTION void sts(uint32_t addr, const uint64_t& x)
 {
-  asm volatile("st.shared.u64 [%0], %1;" : : "r"(addr), "l"(reinterpret_cast<const uint64_t&>(x)));
+  asm volatile("st.shared.u64 [%0], %1;" : : "r"(addr), "l"(x));
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/device_common.hpp` around lines 357 - 365, The
two sts overloads use redundant reinterpret_casts—change the asm operand
expressions to use x directly (i.e., replace reinterpret_cast<const
uint32_t&>(x) and reinterpret_cast<const uint64_t&>(x) with x) in the functions
sts(uint32_t, const uint32_t&) and sts(uint32_t, const uint64_t&) while keeping
the existing asm volatile constraints ("r" for u32 and "l" for u64) and
signatures unchanged.
323-328: Pre-existing bug surfaced by adjacent diff: lds(uint8_t&) cast is wrong.

This isn't a line you changed, but it sits one block above the new lds(uint64_t&) and sts additions, so flagging while the surrounding code is under review:
RAFT_DEVICE_INLINE_FUNCTION void lds(uint8_t& x, uint32_t addr)
{
  uint32_t res;
  asm volatile("ld.shared.u8 {%0}, [%1];" : "=r"(res) : "r"(addr));
  x = static_cast<uint32_t>(res);   // <- assigning uint32_t to uint8_t&; should narrow
}
The final assignment uses static_cast<uint32_t> instead of static_cast<uint8_t>. It works only because narrowing is implicit for fundamental types, but it's misleading and contradicts the function's declared semantics. Consider fixing while you're touching this region:
-  x = static_cast<uint32_t>(res);
+  x = static_cast<uint8_t>(res);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/device_common.hpp` around lines 323 - 328, The
lds(uint8_t& x, uint32_t addr) function casts the loaded 32-bit register to
uint32_t before assigning to the uint8_t reference, which is misleading; change
the assignment to narrow explicitly by using static_cast<uint8_t>(res) (or
declare a uint8_t temp and assign that) so the function's semantics match its
signature and the narrowing is explicit.
cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh (4)
211-231: Dead branch when PQ_LEN < num_elements_per_bank is silent.

The if constexpr (PQ_LEN >= num_elements_per_bank) guard on line 211 elides the codebook copy when the assumption is violated — but with no else and no static_assert, an inadvertent future combination of PQ_LEN / EnableFP8 would just silently skip writing the PQ codebook to SMEM, which would corrupt search results without any compile-time signal. All currently instantiated combinations satisfy the guard ({2,4,8} PQ_LEN with num_elements_per_bank ∈ {2,4}), so this is fine today.

Please add a static_assert instead of silent if constexpr to fail loud if someone adds a new PQ_LEN/FP8 combination later:
🛡️ Suggested hardening
-      if constexpr (PQ_LEN >= num_elements_per_bank) {  // safety
+      static_assert(PQ_LEN >= num_elements_per_bank,
+                    "PQ_LEN must be >= number of FP8 elements per 32-bit bank");
+      {
         constexpr auto num_banks_per_subspace = PQ_LEN / num_elements_per_bank;
         ...
       }
The same comment applies to the analogous if constexpr (PQ_LEN >= num_packed_elements) guard at line 329 of compute_distance_vpq_worker.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh` around lines
211 - 231, Replace the silent conditional guard with a compile-time check: add a
static_assert that PQ_LEN >= num_elements_per_bank (and similarly
static_assert(PQ_LEN >= num_packed_elements) in the analogous guard inside
compute_distance_vpq_worker) so any unsupported PQ_LEN / FP8 combinations fail
at compile time instead of silently eliding the SMEM codebook copy; keep the
existing copy logic (the body that uses PQ_LEN, num_elements_per_bank,
num_packed_elements, codebook_buf, smem_val_pack_t, device::sts, and
r->pq_code_book_ptr()) unchanged, but remove or tighten the surrounding if
constexpr to rely on the static_assert to enforce the invariant.
39-67: Adding EnableFP8 as the trailing template parameter is a non-breaking, clean extension.

Putting the new bool EnableFP8 template parameter at the end and exposing it as a kEnableFP8 constexpr keeps the descriptor type discoverable from kernels (e.g. DescriptorT::kEnableFP8 is what setup_workspace_vpq and compute_distance_vpq_worker rely on). The propagation through vpq_dataset_descriptor_init_kernel and vpq_descriptor_spec::init_ is consistent.

One follow-up worth considering: add a static_assert(!EnableFP8 || (PQ_LEN == 4 || PQ_LEN == 8), "FP8 SMEM is only supported for PQ_LEN in {4, 8}") here, so an accidental instantiation with EnableFP8=true, PQ_LEN=2 fails loudly at compile time instead of silently falling through to the half2 specialization in smem_val_type_t.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh` around lines 39
- 67, Add a compile-time guard to cagra_q_dataset_descriptor_t to prevent
invalid FP8 configurations: inside the template struct
cagra_q_dataset_descriptor_t (the type that defines kEnableFP8), add a
static_assert that checks !EnableFP8 || (PQ_LEN == 4 || PQ_LEN == 8) with a
clear message like "FP8 SMEM is only supported for PQ_LEN in {4, 8}" so
instantiations with EnableFP8=true and unsupported PQ_LEN (e.g., 2) fail at
compile time; reference cagra_q_dataset_descriptor_t, kEnableFP8, EnableFP8, and
PQ_LEN when locating where to add the assertion.
300-326: PQ_CODEBOOK_LOAD_T is hard-coded to uint32_t; the else branch on Line 323-325 is dead.

PQ_CODEBOOK_LOAD_T is locally using PQ_CODEBOOK_LOAD_T = uint32_t; (line 290) and never aliased, so the if constexpr (std::is_same_v<PQ_CODEBOOK_LOAD_T, uint32_t>) always takes the first branch and the fallback *reinterpret_cast<const PQ_CODEBOOK_LOAD_T*>(...) is unreachable. Either:

promote PQ_CODEBOOK_LOAD_T to a real template parameter / trait if the intent is to support other widths in the future, or

drop the dead else and the surrounding if constexpr until that flexibility is actually needed (KISS / YAGNI).

This is purely a maintainability nit — no functional impact.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh` around lines
300 - 326, The code has a dead conditional because PQ_CODEBOOK_LOAD_T is locally
typedef'd to uint32_t so the if constexpr(std::is_same_v<PQ_CODEBOOK_LOAD_T,
uint32_t>) always selects the device::ldg_cg path and the else branch is
unreachable; fix by either (A) making PQ_CODEBOOK_LOAD_T a template parameter or
trait used by compute_distance_vpq (so the reinterpret_cast fallback can be
meaningful for other widths), or (B) remove the if constexpr and else branch and
always use device::ldg_cg/pq_codes assignment for the current uint32_t type to
keep the code simple—update the code around PQ_CODEBOOK_LOAD_T, the loading loop
that writes pq_codes[e], and the device::ldg_cg/dataset_ptr usage accordingly.
222-230: Remove unnecessary intermediate float conversion in codebook setup.

The code currently performs a wasteful double conversion:
buf.data.x1[k] =
  static_cast<smem_val_t>(static_cast<float>(r->pq_code_book_ptr()[i + k]));
Since pq_code_book_ptr()[i + k] is half and smem_val_t is __nv_fp8_e5m2, the intermediate float conversion adds unnecessary overhead. __nv_fp8_e5m2 provides a direct constructor from __half, supported in CUDA 11.8+ (cuVS targets CUDA 12.9+).
♻️ Suggested simplification
-buf.data.x1[k] =
-  static_cast<smem_val_t>(static_cast<float>(r->pq_code_book_ptr()[i + k]));
+buf.data.x1[k] = static_cast<smem_val_t>(r->pq_code_book_ptr()[i + k]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh` around lines
222 - 230, The assignment in the codebook setup uses an unnecessary intermediate
float cast; change the line inside the loop that sets buf.data.x1[k] so it
converts directly from the source half value to smem_val_t (i.e., use a direct
static_cast or constructor from r->pq_code_book_ptr()[i + k] to smem_val_t)
instead of static_cast<smem_val_t>(static_cast<float>(...)); update the
assignment in the loop that writes into buf (within the branch handling
num_packed_elements == 4 || 8) so buf.data.x1, smem_val_t,
r->pq_code_book_ptr(), and device::sts remain used but the extra float
conversion is removed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh`:
- Around line 245-249: The branch for num_packed_elements == 2 contains dead
outer casts and a redundant inner if: the expressions
static_cast<smem_val_t>(static_cast<float>(buf.x = mapping(queries_ptr[i])))
discard the cast result and only perform buf.x = mapping(...); fix by applying
the intended cast to the stored value (e.g. buf.x =
static_cast<smem_val_t>(static_cast<float>(mapping(queries_ptr[i])))) or simply
remove the casts entirely if unnecessary, and drop the inner if (i < dim)
because the surrounding loop already ensures i < dim; update both buf.x and
buf.y assignments accordingly (references: num_packed_elements, smem_val_t,
buf.x, buf.y, mapping, queries_ptr, i, dim).
- Around line 242-269: The half2 branch leaves parts of local buf
(smem_val_pack_t) uninitialized when dim % num_packed_elements != 0; initialize
the unused lane(s) before writing buf to shared memory to match the fp8 path's
zeroed lanes. In the num_packed_elements == 2 path (where buf is a half2), after
assigning buf.x and buf.y conditionally based on i and i+1, explicitly set the
unused lane to zero (using smem_val_t/zero cast or equivalent) when i or i+1 >=
dim so that the stored buf (written via
reinterpret_cast<smem_val_pack_t*>(smem_query_ptr)[transpose<...>(...)] or the
fallback write) contains no garbage; keep existing mapping(queries_ptr[...] )
assignments and do not change the transpose, compute_distance_vpq_worker,
PQ_BITS or PQ_LEN logic.
- Around line 386-411: Add unit tests covering the FP8 VQ-PQ path to validate
indexing and recall: create tests that run the kernel exercising the branch
where num_packed_elements == 4 and == 8 (with PQ_LEN=4 and PQ_LEN=8), feed known
vq_vals and pq_codebook inputs, and assert correct consumption of FP8 elements
by checking final vq_half2_index and end-to-end recall against a dense baseline;
also add a short comment near the loop that mentions the E5M2 precision tradeoff
and include measured recall deltas (e.g., SIFT-1M or DEEP-100M results for
PQ_LEN=4 and 8) in the PR/body so reviewers can see empirical impact.
- Around line 18-37: The template specialization for smem_val_type_t (the
partial specialization keyed by "PQ_LEN == 2 || !EnableFP8") causes
smem_val_type_t<2,true> to compile the same half2 path, so EnableFP8 is
effectively ignored for PQ_LEN==2 and doubles compile artifacts; fix by either
removing EnableFP8=true entries for PQ_LEN==2 from the test/matrix generator
(compute_distance_vpq_matrix.json) so those instantiations are not emitted, or
add a concise explanatory comment immediately above the smem_val_type_t
specializations documenting that the PQ_LEN==2 branch intentionally covers both
EnableFP8 values and that compute_distance_vpq.hpp’s priority function (the if
(use_fp8 != EnableFP8) check) filters the unwanted runtime case—choose one of
these two actions to avoid redundant compilation.

In `@cpp/tests/neighbors/ann_cagra.cuh`:
- Around line 1697-1698: The inline comment next to the pq_len loop in
ann_cagra.cuh is stale — it says "only pq_len = 2 is supported" while the loop
now iterates {2,4,8}; update or remove that comment and similarly revise/remove
the identical stale remarks in the generate_addnode_inputs and
generate_filtering_inputs blocks so comments reflect actual supported PQ lengths
(or note why those functions still only use {2} if intentional); locate the
loops by searching for the pq_len variable and the function names
generate_addnode_inputs and generate_filtering_inputs to apply the fixes.
- Around line 466-504: The code assumes vpq datasets use half by doing
dynamic_cast to vpq_dataset<half, int64_t>& before decode_vpq_dataset; change
this to a type-safe check: replace the throwing reference dynamic_cast with a
pointer dynamic_cast to vpq_dataset<half, int64_t>* and verify it is non-null,
otherwise query the index's actual math type (or a provided math_type()/type()
accessor on index.data()) and call decode_vpq_dataset with the correct template
specialization (or emit a clear fatal error message stating the unexpected
codebook math type). Update the block around
decode_vpq_dataset/dynamic_cast/naive_knn (symbols: decode_vpq_dataset,
vpq_dataset, dynamic_cast, vpq_build, reference_recall) so the cast is validated
and the decode is chosen based on the runtime math type.

In `@cpp/tests/neighbors/vpq_utils.cuh`:
- Around line 1-7: This header file is missing an include guard: add a
top-of-file header guard (preferably a single-line `#pragma once`) to prevent
multiple inclusion of the test header (vpq_utils.cuh) which defines
kernel/function templates in namespace cuvs::neighbors; simply insert `#pragma
once` at the very beginning of vpq_utils.cuh so the kernel and template
definitions in the cuvs::neighbors namespace are not redefined when included by
multiple translation units.
- Around line 9-38: The decoder kernel decode_vpq_dataset_kernel wrongly assumes
pq_bits==8 by reading PQ codes with pq_code_ptr[i / pq_subspace_dim]; either
enforce pq_bits==8 up front or implement proper bit-packed reads: add a
precondition/assertion that the vpq_params.pq_bits == 8 (and log/error if not)
or replace the single-byte read with bitfield unpacking using the existing
bitfield_view_t (or equivalent) to extract the pq_code for each subspace index
before indexing pq_codebook_ptr; update references in decode_vpq_dataset_kernel
(pq_code_ptr usage, loop that computes pq_code) to use the chosen fix.

---

Outside diff comments:
In `@cpp/include/cuvs/neighbors/cagra.hpp`:
- Around line 273-351: The public enum internal_dtype and
search_params::smem_dtype lack Doxygen and user-facing guidance; add Doxygen
comments to the internal_dtype enum (document each value: F16, E5M2, AUTO) and
to search_params::smem_dtype explaining constraints: FP8 (if present) is
VPQ-only, E5M2 is ignored for strided datasets, AUTO behavior is
device-dependent and selects VPQ when appropriate, and valid value
ranges/compatibility with other params (e.g., smem usage, VPQ). Also add a short
API-note in the CAGRA user docs indicating this new knob, its VPQ-only/strided
dataset caveats, and guidance on when to choose AUTO vs explicit types.
Reference symbols: internal_dtype, search_params::smem_dtype, and VPQ behavior
in cagra_search logic.

In `@cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh`:
- Around line 489-507: The printf call in search_multi_cta_kernel-inl.cuh prints
both "distance" and "hash" but only passes one uint64_t after
clk_pickup_parents, causing mismatched arguments when _CLK_BREAKDOWN is enabled;
update the printf argument list used in the printf near the debug block (the
printf that references __FILE__, __LINE__, query_id, threadIdx.x and clk_*
variables) to pass both timing values (e.g., clk_compute_distance and
clk_compute_actual_distance or the intended clk_hash value) so the number and
order of format specifiers match the provided arguments; ensure the variables
clk_init, clk_compute_1st_distance, clk_topk, clk_pickup_parents,
clk_compute_distance, and clk_compute_actual_distance (or the correct hash
timing variable) are all supplied in the same order as the format string.

In `@cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh`:
- Around line 1086-1103: The `_CLK_BREAKDOWN` printf currently labels the last
two columns as "distance, hash" but passes clk_compute_actual_distance and
(clk_compute_distance - clk_compute_actual_distance), which is misleading;
update the header labels to match the values (for example change "distance,
hash" to "actual_distance, non_distance" or similar) so the printed column names
align with the passed variables, and ensure the printf string near
_CLK_BREAKDOWN and the associated argument list (using
clk_compute_actual_distance and clk_compute_distance) are kept consistent.

---

Nitpick comments:
In `@cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh`:
- Around line 211-231: Replace the silent conditional guard with a compile-time
check: add a static_assert that PQ_LEN >= num_elements_per_bank (and similarly
static_assert(PQ_LEN >= num_packed_elements) in the analogous guard inside
compute_distance_vpq_worker) so any unsupported PQ_LEN / FP8 combinations fail
at compile time instead of silently eliding the SMEM codebook copy; keep the
existing copy logic (the body that uses PQ_LEN, num_elements_per_bank,
num_packed_elements, codebook_buf, smem_val_pack_t, device::sts, and
r->pq_code_book_ptr()) unchanged, but remove or tighten the surrounding if
constexpr to rely on the static_assert to enforce the invariant.
- Around line 39-67: Add a compile-time guard to cagra_q_dataset_descriptor_t to
prevent invalid FP8 configurations: inside the template struct
cagra_q_dataset_descriptor_t (the type that defines kEnableFP8), add a
static_assert that checks !EnableFP8 || (PQ_LEN == 4 || PQ_LEN == 8) with a
clear message like "FP8 SMEM is only supported for PQ_LEN in {4, 8}" so
instantiations with EnableFP8=true and unsupported PQ_LEN (e.g., 2) fail at
compile time; reference cagra_q_dataset_descriptor_t, kEnableFP8, EnableFP8, and
PQ_LEN when locating where to add the assertion.
- Around line 300-326: The code has a dead conditional because
PQ_CODEBOOK_LOAD_T is locally typedef'd to uint32_t so the if
constexpr(std::is_same_v<PQ_CODEBOOK_LOAD_T, uint32_t>) always selects the
device::ldg_cg path and the else branch is unreachable; fix by either (A) making
PQ_CODEBOOK_LOAD_T a template parameter or trait used by compute_distance_vpq
(so the reinterpret_cast fallback can be meaningful for other widths), or (B)
remove the if constexpr and else branch and always use device::ldg_cg/pq_codes
assignment for the current uint32_t type to keep the code simple—update the code
around PQ_CODEBOOK_LOAD_T, the loading loop that writes pq_codes[e], and the
device::ldg_cg/dataset_ptr usage accordingly.
- Around line 222-230: The assignment in the codebook setup uses an unnecessary
intermediate float cast; change the line inside the loop that sets
buf.data.x1[k] so it converts directly from the source half value to smem_val_t
(i.e., use a direct static_cast or constructor from r->pq_code_book_ptr()[i + k]
to smem_val_t) instead of static_cast<smem_val_t>(static_cast<float>(...));
update the assignment in the loop that writes into buf (within the branch
handling num_packed_elements == 4 || 8) so buf.data.x1, smem_val_t,
r->pq_code_book_ptr(), and device::sts remain used but the extra float
conversion is removed.

In `@cpp/src/neighbors/detail/cagra/device_common.hpp`:
- Around line 254-289: The fp8xN<NumPacked, 5> specialization currently assumes
an even NumPacked and relies on uintN_t being defined only for 32/64-bit widths;
add a compile-time guard to prevent accidental odd or unsupported instantiations
by inserting a static_assert in the fp8xN<NumPacked,5> body (near the union/data
and methods) that enforces NumPacked % 2 == 0 and restricts NumPacked to the
supported sizes (e.g., NumPacked == 4 || NumPacked == 8), and mention uintN_t
and as_half2/data.x2 in the assertion message so users know the reason;
alternatively, explicitly specialize uintN_t for the required widths and
document the even-element requirement in fp8xN.
- Around line 357-365: The two sts overloads use redundant
reinterpret_casts—change the asm operand expressions to use x directly (i.e.,
replace reinterpret_cast<const uint32_t&>(x) and reinterpret_cast<const
uint64_t&>(x) with x) in the functions sts(uint32_t, const uint32_t&) and
sts(uint32_t, const uint64_t&) while keeping the existing asm volatile
constraints ("r" for u32 and "l" for u64) and signatures unchanged.
- Around line 323-328: The lds(uint8_t& x, uint32_t addr) function casts the
loaded 32-bit register to uint32_t before assigning to the uint8_t reference,
which is misleading; change the assignment to narrow explicitly by using
static_cast<uint8_t>(res) (or declare a uint8_t temp and assign that) so the
function's semantics match its signature and the narrowing is explicit.

In `@cpp/tests/neighbors/ann_cagra.cuh`:
- Line 563: The variable reference_recall is declared uninitialized; initialize
it at declaration (e.g., double reference_recall = 0.0;) to make the invariant
explicit and avoid potential UB if read before assignment—update the declaration
of reference_recall near the top of the file so testCagra() can still assign it
later but the variable has a safe default value.
- Line 503: Replace the raw printf call in ann_cagra.cuh that prints
reference_recall with the RAFT logger: change printf("reference_recall = %e\n",
reference_recall); to a RAFT_LOG_INFO call (e.g.,
RAFT_LOG_INFO("reference_recall = %e", reference_recall)); ensure the RAFT
logging header is included where needed so the logger symbol RAFT_LOG_INFO is
available.

In `@cpp/tests/neighbors/ann_utils.cuh`:
- Around line 289-290: The destructured variable index_based_actual_recall from
the calc_recall call in eval_neighbours is not used and can trigger
-Wunused-variable; update the destructuring to indicate it's intentionally
unused (e.g., mark index_based_actual_recall with [[maybe_unused]] or replace it
with a discard/ignore) so the binding remains for callers that consume the
second return (calc_recall) but avoids unused-variable warnings in
eval_neighbours.
- Around line 252-273: The second "Index based recall" triple loop duplicates
the earlier O(rows·cols²) traversal; instead, inside the original nested loops
where you compute match_count (the first double loop that iterates i, k and
inner j over expected_idx and checks distances), also check for index equality
(if act_idx == exp_idx) and increment index_match_count and break just like the
separate loop did, ensuring you only count once per (i,k); then remove the
duplicate loop entirely so index_match_count is updated in the same pass as
match_count (use the same variables actual_idx, expected_idx, act_idx, exp_idx,
index_match_count, match_count, total_count to locate and modify the code).

In `@cpp/tests/neighbors/vpq_utils.cuh`:
- Around line 25-28: The code reads a 4-byte vq_code via reinterpret_cast<const
uint32_t*>(local_data_ptr) (symbols: local_data_ptr, vq_code, data_ptr, ldi,
batch_id), but there is no guarantee that local_data_ptr is 4-byte aligned; add
an explicit runtime check in the host wrapper that prepares vpq data (e.g.,
validate vpq_dataset.data.stride(0) / ldi) using
RAFT_EXPECTS(vpq_dataset.data.stride(0) % 4 == 0, "stride must be 4-byte
aligned") so misaligned strides fail loudly; ensure this check runs before any
device/kernel launch that uses local_data_ptr and document the alignment
requirement in the wrapper's API comment.
- Around line 14-20: The kernel decode_vpq_dataset_kernel currently accepts
pq_table_size but never uses it; either remove pq_table_size from the kernel
signature and from any launch-site argument lists, or use it to validate decoded
PQ indices (e.g., check pq_code < pq_table_size before indexing the codebook) to
prevent out-of-bounds access. Locate the kernel function
decode_vpq_dataset_kernel and all call sites that pass the computed 1u <<
vpq_dataset.pq_bits(), then either delete the pq_table_size parameter from the
function signature and corresponding launches, or add a bounds-check against
pq_table_size where pq_code (or similar PQ index) is used to index the codebook.
Ensure consistency across all occurrences (also apply same change to the
analogous kernel referenced at lines 53-64).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a9bd4e85-0f4b-4457-b66b-991a60f896db

📥 Commits

Reviewing files that changed from the base of the PR and between f2bffb6 and 183082d.

📒 Files selected for processing (13)

cpp/CMakeLists.txt
cpp/include/cuvs/neighbors/cagra.hpp
cpp/src/neighbors/detail/cagra/cagra_search.cuh
cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh
cpp/src/neighbors/detail/cagra/compute_distance_vpq.hpp
cpp/src/neighbors/detail/cagra/compute_distance_vpq_inst.cu.in
cpp/src/neighbors/detail/cagra/compute_distance_vpq_matrix.json
cpp/src/neighbors/detail/cagra/device_common.hpp
cpp/src/neighbors/detail/cagra/search_multi_cta_kernel-inl.cuh
cpp/src/neighbors/detail/cagra/search_single_cta_kernel-inl.cuh
cpp/tests/neighbors/ann_cagra.cuh
cpp/tests/neighbors/ann_utils.cuh
cpp/tests/neighbors/vpq_utils.cuh

irina-resh-nvda

Reviewed mostly the inner nest in compute_distance_vpq_worker (lines 311–417): the loops, the FP8 unpack via as_half2(bi), and the codebook-transpose math in setup_workspace_vpq. Built a minimal Compiler-Explorer reproducer of the consumer-side lds + unpack chain and inspected the SASS on sm_100. Confirmed one LDS.64 per call (with the cbook/query offset folded into the immediate), one F2FP.F16.E5M2.UNPACK_B per as_half2, and no LDL/STL.

Learned a lot from this one. LGTM.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp`:
- Around line 163-176: The helper dispatch_cagra_smem_dtype must handle
internal_dtype::AUTO before the VPQ smem-tag switch; update
dispatch_cagra_smem_dtype to detect cuvs::neighbors::cagra::internal_dtype::AUTO
and remap it to the same concrete dtype used elsewhere (e.g.,
internal_dtype::E5M2) before or inside the switch, then continue to call the
lambda with tag_smem_e5m2 or tag_smem_f16 as appropriate (keep existing calls to
operator()<tag_smem_f16>() and operator()<tag_smem_e5m2>()); this ensures values
forwarded from cagra_jit_launcher_factory.hpp match the resolution logic in
compute_distance_vpq.hpp and avoid the RAFT_FAIL for AUTO.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e46b33e7-4a90-4f61-b1c3-f89f27d64bb7

📥 Commits

Reviewing files that changed from the base of the PR and between 183082d and a577563.

📒 Files selected for processing (20)

cpp/CMakeLists.txt
cpp/include/cuvs/detail/jit_lto/cagra/cagra_fragments.hpp
cpp/include/cuvs/neighbors/cagra.hpp
cpp/src/neighbors/detail/cagra/cagra_search.cuh
cpp/src/neighbors/detail/cagra/compute_distance.hpp
cpp/src/neighbors/detail/cagra/compute_distance_vpq-impl.cuh
cpp/src/neighbors/detail/cagra/compute_distance_vpq.hpp
cpp/src/neighbors/detail/cagra/compute_distance_vpq_inst.cu.in
cpp/src/neighbors/detail/cagra/compute_distance_vpq_matrix.json
cpp/src/neighbors/detail/cagra/device_memory_ops.hpp
cpp/src/neighbors/detail/cagra/factory.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_jit_launcher_factory.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json
cpp/src/neighbors/detail/cagra/packed_type.hpp

💤 Files with no reviewable changes (7)

cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_kernel.cu.in
cpp/src/neighbors/detail/cagra/packed_type.hpp
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_matrix.json
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_impl.cuh
cpp/src/neighbors/detail/cagra/jit_lto_kernels/compute_distance_kernel.cu.in
cpp/src/neighbors/detail/cagra/jit_lto_kernels/setup_workspace_matrix.json

🚧 Files skipped from review as they are similar to previous changes (3)

cpp/src/neighbors/detail/cagra/cagra_search.cuh
cpp/include/cuvs/neighbors/cagra.hpp
cpp/src/neighbors/detail/cagra/compute_distance_vpq_matrix.json

achirkin

Thanks for working on this and sorry for such a long delay in the review process!

Also I like the packed structure; maybe we should consolidate it with the vectorized types in raft in future.

achirkin · 2026-06-09T14:06:25Z

+        device::ldg_cg(pq_codes[e],
+                       reinterpret_cast<const PQ_CODEBOOK_LOAD_T*>(dataset_ptr + 4 + k));
+      } else {
+        pq_codes[e] = *reinterpret_cast<const PQ_CODEBOOK_LOAD_T*>(dataset_ptr + 4 + k);


Is there a specific reason to not use ldg_cg? If it's just the overload what's missing, we can that overload in https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/detail/cagra/device_memory_ops.hpp

achirkin · 2026-06-09T15:11:58Z

+        switch (dataset_block_dim) {
+          case 128: std::forward<Lambda>(l).template operator()<8u, 128u>(); return;
+          case 256: std::forward<Lambda>(l).template operator()<8u, 256u>(); return;
+          case 512: std::forward<Lambda>(l).template operator()<8u, 512u>(); return;
+          default: break;


Would be nice to reduce these repeated lines (e.g. by introducing a template struct for each template parameter like we do in https://github.com/rapidsai/cuvs/blob/main/cpp/src/neighbors/ivf_pq/ivf_pq_compute_similarity_impl.cuh).
But you follow an already established pattern in this file, so I don't insist we must do it here.

…q-pq_len-8

enp1s0 requested review from a team as code owners November 12, 2025 04:25

github-project-automation Bot added this to Unstructured Data Processing Nov 12, 2025

github-project-automation Bot moved this to Todo in Unstructured Data Processing Nov 12, 2025

enp1s0 self-assigned this Nov 12, 2025

enp1s0 added feature request New feature or request improvement Improves an existing functionality non-breaking Introduces a non-breaking change and removed improvement Improves an existing functionality labels Nov 12, 2025

cjnolet moved this from Todo to In Progress in Unstructured Data Processing Nov 17, 2025

enp1s0 changed the title ~~[WIP] Improve CAGRA-Q performance and add support for PQ_LEN=8~~ Improve CAGRA-Q performance and add support for PQ_LEN=8 Apr 6, 2026

coderabbitai Bot reviewed Apr 26, 2026

View reviewed changes

cjnolet added the stale-active label May 13, 2026

irina-resh-nvda approved these changes May 22, 2026

View reviewed changes

enp1s0 added 7 commits May 29, 2026 17:50

Add pq_len=8

2539833

Update cagra-q test

19d5a0a

Update the compute distance kernel

09deae5

Merge branch 'main' into cagra-q-pq_len-8-alpha

c1a2ce6

Add FP8 support

5fa5321

Update EnableFP8

c323fa1

Update vpq test

a577563

enp1s0 force-pushed the cagra-q-pq_len-8 branch from 183082d to a577563 Compare June 4, 2026 15:57

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread cpp/src/neighbors/detail/cagra/jit_lto_kernels/cagra_planner_base.hpp

enp1s0 and others added 6 commits June 5, 2026 14:55

Remove internal_dtype::AUTO

d05f552

Update fp8xN to used SW emulated FP8 when FP8 is not natively supported

9020739

Merge branch 'main' into cagra-q-pq_len-8

07d29d2

Fix VPQ test

627ee0d

Fix compilation error

e7e4205

Merge branch 'main' into cagra-q-pq_len-8

b788dbb

enp1s0 and others added 3 commits June 8, 2026 13:34

Update VPQ test to use VpqMathT

1032ffb

Add pq_bits assert

02e3726

Merge branch 'main' into cagra-q-pq_len-8

d8c8844

achirkin mentioned this pull request Jun 9, 2026

[BUG] Fix CAGRA search recall with a graph built by NN Descent #819

Open

achirkin approved these changes Jun 9, 2026

View reviewed changes

enp1s0 and others added 11 commits June 10, 2026 00:18

Remove SW emulated FP8

c608bd1

Update dispatch funcs

f706baa

Fix ldg_cg use

0eef38f

Merge branch 'main' into cagra-q-pq_len-8

f777809

Merge branch 'cagra-q-pq_len-8' of github.com:enp1s0/cuvs into cagra-…

0a74ac6

…q-pq_len-8

Merge branch 'main' into cagra-q-pq_len-8

dd2500a

Merge branch 'main' into cagra-q-pq_len-8

ba1a5cf

Merge branch 'main' into cagra-q-pq_len-8

025659a

Merge branch 'main' into cagra-q-pq_len-8

9fca67c

Merge branch 'main' into cagra-q-pq_len-8

7a21676

Merge branch 'main' into cagra-q-pq_len-8

59951d1

enp1s0 requested review from a team as code owners June 29, 2026 05:19

Uh oh!

Conversation

enp1s0 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E5M2 as smem data type

Support for PQ_LEN=8

Uh oh!

coderabbitai Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

irina-resh-nvda left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

achirkin Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

enp1s0 Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

achirkin Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enp1s0 commented Nov 12, 2025 •

edited

Loading

coderabbitai Bot commented Apr 26, 2026 •

edited

Loading