Which component are you using?
/area cluster-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.
Debugging transient latency spikes in the Cluster Autoscaler (CA) main loop remains a significant challenge. As previously discussed in #9351, the existing "debugging snapshot" is a manual, reactive mechanism that is difficult to time correctly and is due for deprecation in favor of more robust diagnostics.
The core problem is that in large-scale fleets, the "smoking gun" for a stall (e.g., API contention, scheduler latency, or cloud provider throttles) is often gone by the time a human operator can trigger a manual profile or an alert fires. We need a way for CA to detect its own "unhealthy" iterations and capture high-fidelity data exactly when the problem is occurring.
Describe the solution you'd like.
I propose adding an Automated Performance Diagnostic Collection system. This watchdog-style feature will monitor the duration of the RunOnce iteration and automatically trigger a collection event if thresholds are exceeded.
Key Features:
- Threshold-Based Triggers:
--diagnostic-max-loop-time: The time an iteration can run before profiling starts (e.g., 2m).
--diagnostic-collection-time: Max duration for active collection (e.g., 30s).
--diagnostic-cooldown: Minimum wait between triggers to protect resource usage (e.g., 15m).
- DiagnosticReport & Pluggable Sinks:
- Uses a
DiagnosticReport struct (a map of profile types to byte buffers) to allow for easy extension (CPU, Trace, Heap, etc.).
- Introduces a
DiagnosticSink interface. The OSS version will include a FileSink for local storage, while downstream forks can implement cloud-native storage (e.g., GCS or S3).
- Dashcam Trace Capture:
Describe any alternative solutions you've considered.
- Continuous Profiling: Not really feasible for monitoring a heterogeneous fleet of clusters: we only need data from, say, 1% of "interesting" (stalled) clusters, not the 99% that are healthy.
- External Watchdogs: Hard to implement accurately without "inside" knowledge of when a
RunOnce iteration specifically begins and ends.
Additional context.
I think this will play nicely together with #9351 - we can get detailed information from logs when debugging an ongoing issue, while also being able to debug performance after the fact.
Which component are you using?
/area cluster-autoscaler
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.
Debugging transient latency spikes in the Cluster Autoscaler (CA) main loop remains a significant challenge. As previously discussed in #9351, the existing "debugging snapshot" is a manual, reactive mechanism that is difficult to time correctly and is due for deprecation in favor of more robust diagnostics.
The core problem is that in large-scale fleets, the "smoking gun" for a stall (e.g., API contention, scheduler latency, or cloud provider throttles) is often gone by the time a human operator can trigger a manual profile or an alert fires. We need a way for CA to detect its own "unhealthy" iterations and capture high-fidelity data exactly when the problem is occurring.
Describe the solution you'd like.
I propose adding an Automated Performance Diagnostic Collection system. This watchdog-style feature will monitor the duration of the
RunOnceiteration and automatically trigger a collection event if thresholds are exceeded.Key Features:
--diagnostic-max-loop-time: The time an iteration can run before profiling starts (e.g., 2m).--diagnostic-collection-time: Max duration for active collection (e.g., 30s).--diagnostic-cooldown: Minimum wait between triggers to protect resource usage (e.g., 15m).DiagnosticReportstruct (a map of profile types to byte buffers) to allow for easy extension (CPU, Trace, Heap, etc.).DiagnosticSinkinterface. The OSS version will include a FileSink for local storage, while downstream forks can implement cloud-native storage (e.g., GCS or S3).FlightRecorder(https://go.dev/blog/flight-recorder) to snapshot the execution trace preceding the trigger, ensuring we capture the lead-up to the stall.Describe any alternative solutions you've considered.
RunOnceiteration specifically begins and ends.Additional context.
I think this will play nicely together with #9351 - we can get detailed information from logs when debugging an ongoing issue, while also being able to debug performance after the fact.