Skip to content

MB-72262: Probe GPU devices for health#69

Merged
CascadingRadium merged 2 commits into
masterfrom
gpuProbe
Jun 5, 2026
Merged

MB-72262: Probe GPU devices for health#69
CascadingRadium merged 2 commits into
masterfrom
gpuProbe

Conversation

@CascadingRadium

@CascadingRadium CascadingRadium commented Jun 3, 2026

Copy link
Copy Markdown
Member
  • Currently, we use the getNumDevices API to catch any CUDA initialization issues and disable GPU usage if we fail to get the number of usable GPUs. Example of caught errors are:
    • CUDA error 100: cudaErrorNoDevice
    • CUDA error 35: cudaErrorInsufficientDriver
    • CUDA error 999: cudaErrorUnknown
  • However, in a test we observed a crash stemming from CUDA error 46 CUDA-capable device(s) is/are busy or unavailable, this is caused by trying to poll the GPUs to get the memory information, which in turn initializes CUDA driver context and promptly crashes the process.
  • To ensure that we do not crash the process, use the GPU probe API, which will initalize the CUDA driver context gracefully.
  • Requires: MB-72262: Add GPU Prober faiss#86

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens GPU initialization by probing each CUDA device for health before attempting to query free memory, with the goal of avoiding process crashes caused by CUDA context initialization failures (e.g., “device busy or unavailable”).

Changes:

  • Adds a per-device health probe (faiss_probe_gpu) prior to calling faiss_gpu_free_memory during GPU snapshot initialization.
  • Gates free-memory collection on a successful probe, defaulting snapshots to 0 bytes when probing/memory querying fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread gpu.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

gpu.go:232

  • The comment below is now slightly inaccurate: totMemory can remain 0 not only when free-memory lookup fails, but also when the GPU probe fails / reports the device as unusable. This makes it harder to understand why a device gets a 0-memory snapshot.
		// if we fail to get the free memory for the GPU,
		// we still create a snapshot with 0 total and free memory,
		// which will cause all reservation attempts to fail but won't cause any crashes.

Comment thread gpu.go
@CascadingRadium CascadingRadium changed the title MB-71999: Probe GPU devices for health MB-72262: Probe GPU devices for health Jun 5, 2026
@CascadingRadium CascadingRadium merged commit d0292eb into master Jun 5, 2026
1 check passed
@CascadingRadium CascadingRadium deleted the gpuProbe branch June 5, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants