feat(autoscaler): expose scale-down tuning — ignore-daemonsets-utilization and scale-down-utilization-threshold

> 🤖 Filed via Claude Code from a prod root-cause session on devantler-tech/platform (maintainer-requested investigation).

## Problem / motivation

On the `devantler-tech/platform` prod cluster (KSail-managed Talos × Hetzner, `autoscaler.node` enabled), four `autoscale-cx33` nodes created during a demand burst on 2026-06-30 could **never be scaled back down**, even when holding no meaningful movable workload.

Root cause: the Cluster Autoscaler's scale-down eligibility uses a **node utilization threshold (default 0.5) computed over *requests*, counting DaemonSet pods**. This cluster runs a substantial per-node agent stack (Cilium, coroot-node-agent, kubescape node-agent, longhorn-manager, Tetragon, SPIRE agent, CSI plugins, …) whose requests summed to ~3.5Gi — **≈50% of a cx33's allocatable on their own**. An *empty* cx33 therefore already sat at the threshold, and any single movable pod made it permanently `unremovable: memory requested … above the scale-down utilization threshold` (confirmed in CA logs). Every burst mints a permanent node — a one-way ratchet.

Upstream cluster-autoscaler has two flags for exactly this topology, but KSail's `NodeAutoscalerConfig` (pkg/apis/cluster/v1alpha1/autoscaler.go) exposes only `expander`, `maxNodesTotal`, `scaleDownUnneededTime`, `capacityBuffers`, and `pools`:

- `--ignore-daemonsets-utilization` (bool, default false) — exclude DaemonSet requests from the utilization calculation. The canonical fix for heavy per-node agent stacks; with it, an empty node computes near 0% and scales down regardless of agent weight.
- `--scale-down-utilization-threshold` (float, default 0.5) — the eligibility gate itself, for finer tuning where needed.

## Proposal

Add to `NodeAutoscalerConfig` (mirroring `scaleDownUnneededTime`'s pass-through style):

```yaml
autoscaler:
  node:
    ignoreDaemonsetsUtilization: true     # → --ignore-daemonsets-utilization
    scaleDownUtilizationThreshold: "0.5"  # → --scale-down-utilization-threshold
```

Both map 1:1 onto upstream cluster-autoscaler flags supported by the Hetzner provider build KSail already installs.

## Workaround in the platform repo (partial)

devantler-tech/platform#2370 attacks the demand side (stops VPA from inflating DaemonSet memory requests to the busiest node's peak × node count), dropping the per-node stack to ~31% of a cx33. That un-wedges scale-down for near-empty nodes, but the structural fix for agent-heavy clusters is `ignore-daemonsets-utilization`, which only KSail can pass to the autoscaler it manages.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(autoscaler): expose scale-down tuning — ignore-daemonsets-utilization and scale-down-utilization-threshold #5673

Problem / motivation

Proposal

Workaround in the platform repo (partial)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

feat(autoscaler): expose scale-down tuning — ignore-daemonsets-utilization and scale-down-utilization-threshold #5673

Description

Problem / motivation

Proposal

Workaround in the platform repo (partial)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions