CA: Automated performance diagnostic collection

### **Which component are you using?**
/area cluster-autoscaler

### **Is your feature request designed to solve a problem? If so describe the problem this feature should solve.**
Debugging transient latency spikes in the Cluster Autoscaler (CA) main loop remains a significant challenge. As previously discussed in #9351, the existing "debugging snapshot" is a manual, reactive mechanism that is difficult to time correctly and is due for deprecation in favor of more robust diagnostics.

The core problem is that in large-scale fleets, the "smoking gun" for a stall (e.g., API contention, scheduler latency, or cloud provider throttles) is often gone by the time a human operator can trigger a manual profile or an alert fires. We need a way for CA to detect its own "unhealthy" iterations and capture high-fidelity data exactly when the problem is occurring.

### **Describe the solution you'd like.**
I propose adding an **Automated Performance Diagnostic Collection** system. This watchdog-style feature will monitor the duration of the `RunOnce` iteration and automatically trigger a collection event if thresholds are exceeded.

**Key Features:**
*   **Threshold-Based Triggers:**
    *   `--diagnostic-max-loop-time`: The time an iteration can run before profiling starts (e.g., 2m).
    *   `--diagnostic-collection-time`: Max duration for active collection (e.g., 30s).
    *   `--diagnostic-cooldown`: Minimum wait between triggers to protect resource usage (e.g., 15m).
*   **DiagnosticReport & Pluggable Sinks:**
    *   Uses a `DiagnosticReport` struct (a map of profile types to byte buffers) to allow for easy extension (CPU, Trace, Heap, etc.).
    *   Introduces a `DiagnosticSink` interface. The OSS version will include a **FileSink** for local storage, while downstream forks can implement cloud-native storage (e.g., GCS or S3).
*   **Dashcam Trace Capture:**
    *   Leverages Go's `FlightRecorder` (https://go.dev/blog/flight-recorder) to snapshot the execution trace *preceding* the trigger, ensuring we capture the lead-up to the stall.

### **Describe any alternative solutions you've considered.**
*   **Continuous Profiling:** Not really feasible for monitoring a heterogeneous fleet of clusters: we only need data from, say, 1% of "interesting" (stalled) clusters, not the 99% that are healthy.
*   **External Watchdogs:** Hard to implement accurately without "inside" knowledge of when a `RunOnce` iteration specifically begins and ends.

### **Additional context.**
I think this will play nicely together with #9351 - we can get detailed information from logs when debugging an ongoing issue, while also being able to debug performance after the fact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA: Automated performance diagnostic collection #9530

Which component are you using?

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.

Describe the solution you'd like.

Describe any alternative solutions you've considered.

Additional context.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA: Automated performance diagnostic collection #9530

Description

Which component are you using?

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.

Describe the solution you'd like.

Describe any alternative solutions you've considered.

Additional context.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions