Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions cpp/src/neighbors/detail/cagra/compute_distance.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,7 @@ struct dataset_descriptor_host {
std::mutex mutex;
std::atomic<bool> ready; // Not sure if std::holds_alternative is thread-safe
std::variant<ready_t, init_f> value;
cudaEvent_t init_event{nullptr};

template <typename InitF>
state(InitF init, size_t size) : ready{false}, value{std::make_tuple(init, size)}
Expand All @@ -229,6 +230,7 @@ struct dataset_descriptor_host {
auto& [ptr, stream] = std::get<ready_t>(value);
RAFT_CUDA_TRY_NO_THROW(cudaFreeAsync(ptr, stream));
}
if (init_event != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaEventDestroy(init_event)); }
}

void eval(rmm::cuda_stream_view stream)
Expand All @@ -237,8 +239,12 @@ struct dataset_descriptor_host {
if (std::holds_alternative<init_f>(value)) {
auto& [fun, size] = std::get<init_f>(value);
dev_descriptor_t* ptr = nullptr;
RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the feedback received on my PR (https://github.com/rapidsai/cuvs/pull/1771/changes#diff-52f864438c5274d5d365954ba71d9988bb00f8c415e2b3c081aa532f3a1a8f07R34-R36), one wonders if debugging might be easier if one used RAFT_EXPECTS with a message here instead?

What is the generally accepted convention?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It looks like the convention is to use RAFT_CUDA_TRY for error handling of CUDA API calls while RAFT_EXPECTS would most often be used for assertion testing. RAFT_CUDA_TRY should point out the file and line where the error happened. But, I guess that RAFT_EXPECTS may indeed come useful when one wants to display a specific error message.

RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
fun(ptr, stream);
// Record an event after initialization so that other streams can establish
// a GPU-side dependency without expensive host synchronization.
RAFT_CUDA_TRY(cudaEventRecord(init_event, stream));
value = std::make_tuple(ptr, stream);
ready.store(true, std::memory_order_release);
Comment on lines +242 to 249

@coderabbitai coderabbitai Bot May 13, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle partial init failures without leaking event/device memory

If initialization fails after Line 242 (e.g., Line 243, Line 244, or Line 247), value stays in init_f, so the allocated ptr is not reclaimed, and a later retry can overwrite init_event and leak the previous handle. Please keep event/pointer local until success, and clean both in a failure path before rethrow.

Proposed fix
 void eval(rmm::cuda_stream_view stream)
 {
   std::lock_guard<std::mutex> lock(mutex);
   if (std::holds_alternative<init_f>(value)) {
     auto& [fun, size]     = std::get<init_f>(value);
     dev_descriptor_t* ptr = nullptr;
-    RAFT_CUDA_TRY(cudaEventCreateWithFlags(&init_event, cudaEventDisableTiming));
-    RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
-    fun(ptr, stream);
-    // Record an event after initialization so that other streams can establish
-    // a GPU-side dependency without expensive host synchronization.
-    RAFT_CUDA_TRY(cudaEventRecord(init_event, stream));
-    value = std::make_tuple(ptr, stream);
-    ready.store(true, std::memory_order_release);
+    cudaEvent_t local_event{nullptr};
+    try {
+      RAFT_CUDA_TRY(cudaEventCreateWithFlags(&local_event, cudaEventDisableTiming));
+      RAFT_CUDA_TRY(cudaMallocAsync(&ptr, size, stream));
+      fun(ptr, stream);
+      // Record an event after initialization so that other streams can establish
+      // a GPU-side dependency without expensive host synchronization.
+      RAFT_CUDA_TRY(cudaEventRecord(local_event, stream));
+      init_event = local_event;
+      value      = std::make_tuple(ptr, stream);
+      ready.store(true, std::memory_order_release);
+    } catch (...) {
+      if (ptr != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaFreeAsync(ptr, stream)); }
+      if (local_event != nullptr) { RAFT_CUDA_TRY_NO_THROW(cudaEventDestroy(local_event)); }
+      throw;
+    }
   }
 }

As per coding guidelines: "Device memory allocations (cudaMalloc, RMM functions) must have corresponding deallocations; use RAII patterns ... to prevent GPU memory leaks on error paths".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/neighbors/detail/cagra/compute_distance.hpp` around lines 242 - 249,
The code currently assigns init_event and ptr into outer state before all
subsequent calls complete, so failures in cudaMallocAsync, fun(ptr, stream), or
cudaEventRecord can leak the device pointer and event; change the routine to
keep local variables (e.g., local cudaEvent_t init_event_local and void*
ptr_local) and perform cudaEventCreateWithFlags, cudaMallocAsync, fun(ptr_local,
stream), and cudaEventRecord on those locals, and only after all succeed assign
value = std::make_tuple(ptr_local, stream) and ready.store(...); on any failure
catch/handle by destroying/freeing the local event and freeing ptr_local (using
RAII wrappers or explicit cudaEventDestroy/cudaFreeAsync calls) before
rethrowing so no GPU resources are leaked.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is a good point, IMHO. If the cudaEventRecord line throws, that's the last we'll see of the ptr allocation. :/

@viclafargue, do we have an RAII-wrapper specifically for cases like this one?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

}
Expand All @@ -247,6 +253,11 @@ struct dataset_descriptor_host {
auto get(rmm::cuda_stream_view stream) -> dev_descriptor_t*
{
if (!ready.load(std::memory_order_acquire)) { eval(stream); }
// Make the caller's stream wait for the init to complete. This is a
// lightweight GPU-side dependency with no host blocking. On the same
// stream that performed the init (or after the event has already
// completed) this is essentially a no-op.
if (init_event != nullptr) { RAFT_CUDA_TRY(cudaStreamWaitEvent(stream, init_event)); }
return std::get<0>(std::get<ready_t>(value));
}
};
Expand Down
Loading