Skip to content

Fix DCGM group cleanup on watch and health error paths#122

Merged
rvatkar merged 3 commits into
mainfrom
rvatkar/fix/group-cleanup-watch-errors
May 27, 2026
Merged

Fix DCGM group cleanup on watch and health error paths#122
rvatkar merged 3 commits into
mainfrom
rvatkar/fix/group-cleanup-watch-errors

Conversation

@rvatkar

@rvatkar rvatkar commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes DCGM group lifecycle issues:

  • DCGM-6901: destroy health-check groups on error from healthCheckByGpuId.
  • DCGM-6902: destroy PID-watch groups on error from watchPidFields.
  • DCGM-6903: destroy field-watch groups on error from WatchFields.
  • DCGM-6917: return UpdateAllFields errors from WatchFields and watchPidFields, matching WatchFieldsWithGroupEx.
  • DCGM-6921: destroy the REST sample PID-watch group after getProcessInfo.

The watch helpers now clean up only on failure and preserve group ownership on success. PID-watch DCGM errors still return *dcgm.Error, and the conditional getSupportedDevices() path is unchanged.

Tests

  • go test ./pkg/dcgm -run 'Test(GroupCapErrorMatcher|HealthCheckByGpuIdDoesNotLeakGroupOnError|WatchFieldsDoesNotLeakGroupOnError|WatchPidFieldsDoesNotLeakGroupOnError| WatchPidFieldsDoesNotLeakGroupOnNonRootWatchError|WatchFieldsDoesNotDestroyGroupOnSuccess|WatchPidFieldsDoesNotDestroyGroupOnSuccess|WatchFieldsReturnsUpdateAllFieldsErrorAndDestroysGroup| WatchPidFieldsReturnsUpdateAllFieldsErrorAndDestroysGroup)' -count=1 -v
  • go test ./samples/restApi/... -count=1

@rvatkar rvatkar requested a review from nccurry May 20, 2026 05:15
Comment thread pkg/dcgm/group_lifecycle_test.go Outdated
nccurry
nccurry previously approved these changes May 20, 2026
@nccurry

nccurry commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Applies to tests/processinfo_test.go:21-23, which is outside this PR diff:

group, err := dcgm.WatchPidFields()
if err != nil {
t.Fatalf("Failed to watch PID fields: %v", err)

The PR added non-root-aware coverage for the new package tests, but the existing process-info tests still hard-fail when WatchPidFields returns Host engine is running as non-root. Because this PR changes PID-watch cleanup, go test ./... cannot validate the repo on the same non-root environment where the new regression tests now skip correctly. Consider applying the same root-aware skip/helper here.

Comment thread pkg/dcgm/fields.go Outdated
Comment thread samples/restApi/handlers/dcgm.go
@rvatkar rvatkar merged commit 5753e4e into main May 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants