Skip to content

improve idle detection robustness#4

Merged
wseaton merged 5 commits into
mainfrom
feat/idle-detection-robustness
Apr 1, 2026
Merged

improve idle detection robustness#4
wseaton merged 5 commits into
mainfrom
feat/idle-detection-robustness

Conversation

@wseaton
Copy link
Copy Markdown
Owner

@wseaton wseaton commented Apr 1, 2026

Summary

  • avg_over_timemax_over_time: If the peak GPU activity over the lookback window is 0, the GPU was truly dead the entire time. Eliminates false positives on bursty training workloads where averaging can smooth active jobs to look idle.
  • H100/H200 fallback: DCGM_FI_PROF_GR_ENGINE_ACTIVE can vanish entirely on newer GPUs when the profiling module fails to load (NVIDIA/DCGM#226). Added DCGM_FI_DEV_GPU_UTIL as a PromQL or fallback, normalized from 0-100 to 0-1 scale.
  • Optional power-draw corroboration: New --power-threshold flag (watts). When set, pods with peak power usage above the threshold are excluded from idle candidates via PromQL unless, even if compute utilization reads zero. Safe default: disabled. Missing power metrics don't block detection.

All three changes live in the PromQL template with one new CLI flag. No changes to Rust processing logic.

Test plan

  • 8 new unit tests covering template rendering (max_over_time, fallback, unless toggling, filter propagation)
  • cargo clippy clean
  • All 44 tests pass
  • Manual validation against a live Prometheus with DCGM metrics

wseaton added 5 commits March 31, 2026 20:36
- propagate scale errors for notebook/inferenceservice instead of swallowing
- replace .expect() on missing pod timestamp with warning + skip
- fix QueryResposne typo -> QueryResponse
- fix misleading "ReplicaSet" log message in statefulset context
- eliminate double Cli::parse() in setup_logging
- add delegate_resource_ext! macro to reduce match-arm boilerplate
- parallelize pod processing with buffer_unordered(10)
- remove unnecessary shutdown_events clone
- safely handle missing pod status/phase
…hold

- switch avg_over_time to max_over_time for stricter idle detection
- add DCGM_FI_DEV_GPU_UTIL fallback via PromQL `or` for H100/H200
  where profiling metrics can go missing (NVIDIA/DCGM#226)
- add optional --power-threshold flag (watts) that excludes pods with
  high power draw from idle candidates via PromQL `unless`
- add 8 unit tests for template rendering
- template uses jinja variables for pod/namespace/container label names
- --honor-labels switches between exported_* (Prometheus default) and
  native dcgm-exporter labels (pod/namespace/container)
- PodMetricData extraction tries both label conventions as safety net
- node_type defaults to "unknown" when node_dmi_info is absent
- 3 new tests for honor-labels template rendering
the node_dmi_info metric (from node-exporter) is not always available.
use PromQL `or` fallback so idle detection works without it, node_type
defaults to "unknown" when absent.
@wseaton wseaton merged commit 60c2ac7 into main Apr 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant