MB-72262: Probe GPU devices for health#69
Merged
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens GPU initialization by probing each CUDA device for health before attempting to query free memory, with the goal of avoiding process crashes caused by CUDA context initialization failures (e.g., “device busy or unavailable”).
Changes:
- Adds a per-device health probe (
faiss_probe_gpu) prior to callingfaiss_gpu_free_memoryduring GPU snapshot initialization. - Gates free-memory collection on a successful probe, defaulting snapshots to 0 bytes when probing/memory querying fails.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
gpu.go:232
- The comment below is now slightly inaccurate:
totMemorycan remain 0 not only when free-memory lookup fails, but also when the GPU probe fails / reports the device as unusable. This makes it harder to understand why a device gets a 0-memory snapshot.
// if we fail to get the free memory for the GPU,
// we still create a snapshot with 0 total and free memory,
// which will cause all reservation attempts to fail but won't cause any crashes.
capemox
approved these changes
Jun 5, 2026
Likith101
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CUDA error 100: cudaErrorNoDeviceCUDA error 35: cudaErrorInsufficientDriverCUDA error 999: cudaErrorUnknownCUDA error 46 CUDA-capable device(s) is/are busy or unavailable, this is caused by trying to poll the GPUs to get the memory information, which in turn initializes CUDA driver context and promptly crashes the process.