Skip to content

CA: Automated performance diagnostic collection #9530

@x13n

Description

@x13n

Which component are you using?

/area cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.

Debugging transient latency spikes in the Cluster Autoscaler (CA) main loop remains a significant challenge. As previously discussed in #9351, the existing "debugging snapshot" is a manual, reactive mechanism that is difficult to time correctly and is due for deprecation in favor of more robust diagnostics.

The core problem is that in large-scale fleets, the "smoking gun" for a stall (e.g., API contention, scheduler latency, or cloud provider throttles) is often gone by the time a human operator can trigger a manual profile or an alert fires. We need a way for CA to detect its own "unhealthy" iterations and capture high-fidelity data exactly when the problem is occurring.

Describe the solution you'd like.

I propose adding an Automated Performance Diagnostic Collection system. This watchdog-style feature will monitor the duration of the RunOnce iteration and automatically trigger a collection event if thresholds are exceeded.

Key Features:

  • Threshold-Based Triggers:
    • --diagnostic-max-loop-time: The time an iteration can run before profiling starts (e.g., 2m).
    • --diagnostic-collection-time: Max duration for active collection (e.g., 30s).
    • --diagnostic-cooldown: Minimum wait between triggers to protect resource usage (e.g., 15m).
  • DiagnosticReport & Pluggable Sinks:
    • Uses a DiagnosticReport struct (a map of profile types to byte buffers) to allow for easy extension (CPU, Trace, Heap, etc.).
    • Introduces a DiagnosticSink interface. The OSS version will include a FileSink for local storage, while downstream forks can implement cloud-native storage (e.g., GCS or S3).
  • Dashcam Trace Capture:

Describe any alternative solutions you've considered.

  • Continuous Profiling: Not really feasible for monitoring a heterogeneous fleet of clusters: we only need data from, say, 1% of "interesting" (stalled) clusters, not the 99% that are healthy.
  • External Watchdogs: Hard to implement accurately without "inside" knowledge of when a RunOnce iteration specifically begins and ends.

Additional context.

I think this will play nicely together with #9351 - we can get detailed information from logs when debugging an ongoing issue, while also being able to debug performance after the fact.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerIssues or PRs related to the Cluster Autoscaler componentkind/featureCategorizes issue or PR as related to a new feature.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions