MB-72262: Probe GPU devices for health by CascadingRadium · Pull Request #69 · blevesearch/go-faiss

CascadingRadium · 2026-06-03T11:09:58Z

Currently, we use the getNumDevices API to catch any CUDA initialization issues and disable GPU usage if we fail to get the number of usable GPUs. Example of caught errors are:
- CUDA error 100: cudaErrorNoDevice
- CUDA error 35: cudaErrorInsufficientDriver
- CUDA error 999: cudaErrorUnknown
However, in a test we observed a crash stemming from CUDA error 46 CUDA-capable device(s) is/are busy or unavailable, this is caused by trying to poll the GPUs to get the memory information, which in turn initializes CUDA driver context and promptly crashes the process.
To ensure that we do not crash the process, use the GPU probe API, which will initalize the CUDA driver context gracefully.
Requires: MB-72262: Add GPU Prober faiss#86

Copilot

Pull request overview

This PR hardens GPU initialization by probing each CUDA device for health before attempting to query free memory, with the goal of avoiding process crashes caused by CUDA context initialization failures (e.g., “device busy or unavailable”).

Changes:

Adds a per-device health probe (faiss_probe_gpu) prior to calling faiss_gpu_free_memory during GPU snapshot initialization.
Gates free-memory collection on a successful probe, defaulting snapshots to 0 bytes when probing/memory querying fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

gpu.go:232

The comment below is now slightly inaccurate: totMemory can remain 0 not only when free-memory lookup fails, but also when the GPU probe fails / reports the device as unusable. This makes it harder to understand why a device gets a 0-memory snapshot.

		// if we fail to get the free memory for the GPU,
		// we still create a snapshot with 0 total and free memory,
		// which will cause all reservation attempts to fail but won't cause any crashes.

Probe GPU devices for health

fc3c920

CascadingRadium requested a review from Copilot June 3, 2026 11:10

Copilot started reviewing on behalf of CascadingRadium June 3, 2026 11:10 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread gpu.go

fixup API

11119bb

CascadingRadium requested a review from Copilot June 3, 2026 12:40

Copilot started reviewing on behalf of CascadingRadium June 3, 2026 12:40 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread gpu.go

CascadingRadium requested review from a team, Likith101, Samsonnyyeet, Thejas-bhat, capemox, maneuvertomars and steveyen June 3, 2026 12:59

CascadingRadium changed the title ~~MB-71999: Probe GPU devices for health~~ MB-72262: Probe GPU devices for health Jun 5, 2026

capemox approved these changes Jun 5, 2026

View reviewed changes

Likith101 approved these changes Jun 5, 2026

View reviewed changes

CascadingRadium merged commit d0292eb into master Jun 5, 2026
1 check passed

CascadingRadium deleted the gpuProbe branch June 5, 2026 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MB-72262: Probe GPU devices for health#69

MB-72262: Probe GPU devices for health#69
CascadingRadium merged 2 commits into
masterfrom
gpuProbe

CascadingRadium commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

CascadingRadium commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CascadingRadium commented Jun 3, 2026 •

edited

Loading