improve idle detection robustness by wseaton · Pull Request #4 · wseaton/gpu-pruner

wseaton · 2026-04-01T00:57:15Z

Summary

avg_over_time → max_over_time: If the peak GPU activity over the lookback window is 0, the GPU was truly dead the entire time. Eliminates false positives on bursty training workloads where averaging can smooth active jobs to look idle.
H100/H200 fallback: DCGM_FI_PROF_GR_ENGINE_ACTIVE can vanish entirely on newer GPUs when the profiling module fails to load (NVIDIA/DCGM#226). Added DCGM_FI_DEV_GPU_UTIL as a PromQL or fallback, normalized from 0-100 to 0-1 scale.
Optional power-draw corroboration: New --power-threshold flag (watts). When set, pods with peak power usage above the threshold are excluded from idle candidates via PromQL unless, even if compute utilization reads zero. Safe default: disabled. Missing power metrics don't block detection.

All three changes live in the PromQL template with one new CLI flag. No changes to Rust processing logic.

Test plan

8 new unit tests covering template rendering (max_over_time, fallback, unless toggling, filter propagation)
cargo clippy clean
All 44 tests pass
Manual validation against a live Prometheus with DCGM metrics

- propagate scale errors for notebook/inferenceservice instead of swallowing - replace .expect() on missing pod timestamp with warning + skip - fix QueryResposne typo -> QueryResponse - fix misleading "ReplicaSet" log message in statefulset context - eliminate double Cli::parse() in setup_logging - add delegate_resource_ext! macro to reduce match-arm boilerplate - parallelize pod processing with buffer_unordered(10) - remove unnecessary shutdown_events clone - safely handle missing pod status/phase

…hold - switch avg_over_time to max_over_time for stricter idle detection - add DCGM_FI_DEV_GPU_UTIL fallback via PromQL `or` for H100/H200 where profiling metrics can go missing (NVIDIA/DCGM#226) - add optional --power-threshold flag (watts) that excludes pods with high power draw from idle candidates via PromQL `unless` - add 8 unit tests for template rendering

- template uses jinja variables for pod/namespace/container label names - --honor-labels switches between exported_* (Prometheus default) and native dcgm-exporter labels (pod/namespace/container) - PodMetricData extraction tries both label conventions as safety net - node_type defaults to "unknown" when node_dmi_info is absent - 3 new tests for honor-labels template rendering

the node_dmi_info metric (from node-exporter) is not always available. use PromQL `or` fallback so idle detection works without it, node_type defaults to "unknown" when absent.

…essages

wseaton added 5 commits March 31, 2026 20:36

make node_dmi_info join optional for node_type enrichment

e1288e6

the node_dmi_info metric (from node-exporter) is not always available. use PromQL `or` fallback so idle detection works without it, node_type defaults to "unknown" when absent.

clean up logging, deduplicate daemon/single-run branches, fix error m…

116a3ff

…essages

wseaton merged commit 60c2ac7 into main Apr 1, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve idle detection robustness#4

improve idle detection robustness#4
wseaton merged 5 commits into
mainfrom
feat/idle-detection-robustness

wseaton commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wseaton commented Apr 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant