improve idle detection robustness#4
Merged
Conversation
- propagate scale errors for notebook/inferenceservice instead of swallowing - replace .expect() on missing pod timestamp with warning + skip - fix QueryResposne typo -> QueryResponse - fix misleading "ReplicaSet" log message in statefulset context - eliminate double Cli::parse() in setup_logging - add delegate_resource_ext! macro to reduce match-arm boilerplate - parallelize pod processing with buffer_unordered(10) - remove unnecessary shutdown_events clone - safely handle missing pod status/phase
…hold - switch avg_over_time to max_over_time for stricter idle detection - add DCGM_FI_DEV_GPU_UTIL fallback via PromQL `or` for H100/H200 where profiling metrics can go missing (NVIDIA/DCGM#226) - add optional --power-threshold flag (watts) that excludes pods with high power draw from idle candidates via PromQL `unless` - add 8 unit tests for template rendering
- template uses jinja variables for pod/namespace/container label names - --honor-labels switches between exported_* (Prometheus default) and native dcgm-exporter labels (pod/namespace/container) - PodMetricData extraction tries both label conventions as safety net - node_type defaults to "unknown" when node_dmi_info is absent - 3 new tests for honor-labels template rendering
the node_dmi_info metric (from node-exporter) is not always available. use PromQL `or` fallback so idle detection works without it, node_type defaults to "unknown" when absent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
avg_over_time→max_over_time: If the peak GPU activity over the lookback window is 0, the GPU was truly dead the entire time. Eliminates false positives on bursty training workloads where averaging can smooth active jobs to look idle.DCGM_FI_PROF_GR_ENGINE_ACTIVEcan vanish entirely on newer GPUs when the profiling module fails to load (NVIDIA/DCGM#226). AddedDCGM_FI_DEV_GPU_UTILas a PromQLorfallback, normalized from 0-100 to 0-1 scale.--power-thresholdflag (watts). When set, pods with peak power usage above the threshold are excluded from idle candidates via PromQLunless, even if compute utilization reads zero. Safe default: disabled. Missing power metrics don't block detection.All three changes live in the PromQL template with one new CLI flag. No changes to Rust processing logic.
Test plan
cargo clippyclean