Skip to content

feat(autoscaler): expose scale-down tuning β€” ignore-daemonsets-utilization and scale-down-utilization-thresholdΒ #5673

Description

@devantler

πŸ€– Filed via Claude Code from a prod root-cause session on devantler-tech/platform (maintainer-requested investigation).

Problem / motivation

On the devantler-tech/platform prod cluster (KSail-managed Talos Γ— Hetzner, autoscaler.node enabled), four autoscale-cx33 nodes created during a demand burst on 2026-06-30 could never be scaled back down, even when holding no meaningful movable workload.

Root cause: the Cluster Autoscaler's scale-down eligibility uses a node utilization threshold (default 0.5) computed over requests, counting DaemonSet pods. This cluster runs a substantial per-node agent stack (Cilium, coroot-node-agent, kubescape node-agent, longhorn-manager, Tetragon, SPIRE agent, CSI plugins, …) whose requests summed to ~3.5Gi β€” β‰ˆ50% of a cx33's allocatable on their own. An empty cx33 therefore already sat at the threshold, and any single movable pod made it permanently unremovable: memory requested … above the scale-down utilization threshold (confirmed in CA logs). Every burst mints a permanent node β€” a one-way ratchet.

Upstream cluster-autoscaler has two flags for exactly this topology, but KSail's NodeAutoscalerConfig (pkg/apis/cluster/v1alpha1/autoscaler.go) exposes only expander, maxNodesTotal, scaleDownUnneededTime, capacityBuffers, and pools:

  • --ignore-daemonsets-utilization (bool, default false) β€” exclude DaemonSet requests from the utilization calculation. The canonical fix for heavy per-node agent stacks; with it, an empty node computes near 0% and scales down regardless of agent weight.
  • --scale-down-utilization-threshold (float, default 0.5) β€” the eligibility gate itself, for finer tuning where needed.

Proposal

Add to NodeAutoscalerConfig (mirroring scaleDownUnneededTime's pass-through style):

autoscaler:
  node:
    ignoreDaemonsetsUtilization: true     # β†’ --ignore-daemonsets-utilization
    scaleDownUtilizationThreshold: "0.5"  # β†’ --scale-down-utilization-threshold

Both map 1:1 onto upstream cluster-autoscaler flags supported by the Hetzner provider build KSail already installs.

Workaround in the platform repo (partial)

devantler-tech/platform#2370 attacks the demand side (stops VPA from inflating DaemonSet memory requests to the busiest node's peak Γ— node count), dropping the per-node stack to ~31% of a cx33. That un-wedges scale-down for near-empty nodes, but the structural fix for agent-heavy clusters is ignore-daemonsets-utilization, which only KSail can pass to the autoscaler it manages.

πŸ€– Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions