Skip to content

design: add doc for cluster autoscaling and background reconfiguration#36691

Draft
aljoscha wants to merge 17 commits into
MaterializeInc:mainfrom
aljoscha:design-cluster-autoscaling
Draft

design: add doc for cluster autoscaling and background reconfiguration#36691
aljoscha wants to merge 17 commits into
MaterializeInc:mainfrom
aljoscha:design-cluster-autoscaling

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

@aljoscha aljoscha commented May 22, 2026

Rendered

Resolves SQL-315


Two user-facing capabilities motivate this work:

1. **Background graceful cluster reconfiguration.** Today, `ALTER CLUSTER ... SET (SIZE = ...)` with the graceful (zero-downtime) strategy requires the SQL session to remain open for the duration of the reconfiguration — the session holds the wait-for-hydration stage. Long-running reconfigurations are fragile: any process or session interruption — a network blip, a client timeout, an SQL tool closing, an `environmentd` restart — aborts the reconfiguration. The user experience we want is: the statement returns immediately, and the reconfiguration continues in the background, surviving restarts and disconnects.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yaay!


## Out of Scope

- `HYDRATION_SIZE` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

## Out of Scope

- `HYDRATION_SIZE` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
- More than one concurrent burst replica per cluster. Initial version supports exactly one; revisit if needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also agreed

- A new autoscaling strategy can be added without restructuring the framework.
- Operators can disable the burst behavior across an environment via a break-glass flag without disabling other autoscaling.

## Out of Scope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the calls here


1. **Stuck-reconfiguration recovery policy.** When a reconfiguration's pending replicas have not hydrated within the system timeout, do we (a) park the reconfiguration indefinitely with a clear signal in the introspection view for an operator to act, (b) auto-cancel and revert to the prior steady state, or (c) make the policy a dyncfg with one of (a)/(b) as the default? Same question for stuck burst replicas.

2. **`HYDRATION_SIZE` + `SCHEDULE = ('on-refresh', ...)` combination.** v1 rejects this combination. Semantically the combination is interesting (every refresh window, burst comes up first to accelerate hydration, steady catches up, then the schedule turns the cluster off), but our strong recommendation is to keep this rejected indefinitely: there is currently no appetite to invest further in the `SCHEDULE` syntax, and supporting the combination would expand its surface area.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed


2. **`HYDRATION_SIZE` + `SCHEDULE = ('on-refresh', ...)` combination.** v1 rejects this combination. Semantically the combination is interesting (every refresh window, burst comes up first to accelerate hydration, steady catches up, then the schedule turns the cluster off), but our strong recommendation is to keep this rejected indefinitely: there is currently no appetite to invest further in the `SCHEDULE` syntax, and supporting the combination would expand its surface area.

3. **Multiple burst replicas (one per steady replica vs. one total).** v1 supports exactly one burst replica per cluster regardless of replication factor. Our strong recommendation is to keep it that way: the burst replica is by design transient, and provisioning N burst replicas to mirror an N-replica steady set would multiply cost for diminishing benefit — burst tear-down only requires one steady replica to have caught up. Revisit only if real-world usage proves the single-burst model insufficient.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed


7. **Foreground/synchronous graceful reconfiguration retention.** Our strong recommendation is to deprecate the current foreground (session-bound) mechanism in favor of the background model. During rollout, the foreground experience is preserved as a thin session-side wait shim over the background mechanism (see [SQL surface](#sql-surface)) — *not* by retaining the existing parallel state machine. This means the `pending: bool` flag on replicas and the associated three-stage machinery can be removed up front; deprecating the foreground experience later is simply deleting the wait shim. The one behavioral difference vs. today is that session disconnect during the wait no longer aborts the reconfiguration (arguably a feature; the durable target stays set and the controller continues).

8. **Hydration burst during graceful reconfiguration.** Should burst kick in while a graceful reconfig is in flight (target size differs from current replicas)? Leaning toward no: the new-size replicas are themselves transient hydration capacity, and stacking burst on top risks confusing billing and behavior. Burst resumes once the reconfig settles.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. If a user type ALTER CLUSTER SET SIZE 200cc, that shouldn't trigger a burst. It should trigger a 200cc replica. Once the 200cc replica is hydrated, retire the original replica.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the nope is a confirmation of my "leaning towards no", yes? 😅

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct :)


- **Burst and reconfiguration-transient replicas appear in billing and metering identically to ordinary replicas.** A user with `HYDRATION_SIZE` set sees additional billing during hydration windows; a user issuing a background `ALTER CLUSTER` sees additional billing during the overlap between the old and new replica sets.
- **Background `ALTER CLUSTER` returns immediately** after writing the new target to the catalog. The actual replica transition happens asynchronously and is observable via the new introspection view. This matches the existing pattern for other async DDL (e.g., `CREATE INDEX` returns once the catalog entry exists; hydration happens afterwards).
- **`SHOW CLUSTERS` reports the new (target) size immediately on ALTER**, not the old size. Mid-reconfiguration the durable cluster configuration already reflects the user's intent, so `SHOW CLUSTERS` does too. This is a change from today's behavior, where the old size is reported until the graceful reconfiguration finalizes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. What I would really like is for SHOW CLUSTERS to tell me whether a reconfiguration is in flight or not, tell me what the current size is, and what the target size is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see why you would like that. I think we can do something good there!


The following behaviors fall out of the design rather than being its headline outcomes. They are user-observable and worth flagging in release notes and user-facing documentation.

- **Burst and reconfiguration-transient replicas appear in billing and metering identically to ordinary replicas.** A user with `HYDRATION_SIZE` set sees additional billing during hydration windows; a user issuing a background `ALTER CLUSTER` sees additional billing during the overlap between the old and new replica sets.
Copy link
Copy Markdown
Contributor

@maheshwarip maheshwarip May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this makes sense. But is this new behavior? I assumed that this was how it always worked!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I was wrong!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it is how it worked, but put it in there because the bursting is new. For graceful reconfig it was always like this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha gotcha. Ok, no concerns!


- **Current state — introspection view.** A new builtin view (working name: `mz_internal.mz_cluster_reconfigurations`, naming TBD) reports, per cluster: desired vs. actual size and replication factor, presence of an in-flight reconfiguration, currently-running burst replicas, per-replica hydration progress, and the strategies' current decisions and reasons. Pull-only: users `SELECT` to learn status. No push notifications in v1.

- **`SHOW CLUSTERS` surfaces in-flight reconfigurations.** Today's `SHOW CLUSTERS` is implemented as SQL over `mz_clusters` and related views; we extend that SQL to join with the new introspection view so a user sees, per cluster, both the target size (from the durable cluster config) and the current size (derived from the actual replica set), plus an indication of whether a reconfiguration is in flight. The exact column set is left to implementation, but the design requires that a single `SHOW CLUSTERS` answers the questions "what did I ask for", "what's actually there now", and "is something in progress" without requiring the user to know about the introspection view.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maheshwarip what do you think?

Copy link
Copy Markdown
Contributor

@ggevay ggevay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Releasing some comments, will continue after lunch.

);
```

When the cluster has any object that is not fully hydrated, the cluster controller spins up an additional replica at the configured `HYDRATION_SIZE` to accelerate hydration. Once any steady-state replica has fully hydrated all currently-existing objects on the cluster, the burst replica is shut down. The block's `COOL DOWN` bounds how often the controller may take consecutive scaling actions, providing built-in hysteresis.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough to wait for hydration, or do we want something like the 0dt cutover's caught-up check? Specifically:

  • We might want to wait for frontiers to catch up (either to the wall clock, or to the hydration replica).
  • But we might not need to bother with looking for OOM loops that happen on both replicas and ignoring them, since this feature is intended for functioning clusters. That is, any OOM loop should prevent the cutover, unlike with the 0dt cutover.

- The existing `SCHEDULE = ('on-refresh', ...)` behavior is preserved, ideally as one strategy within a common autoscaling framework rather than as a parallel mechanism.
- All autoscaling decisions and replica lifecycle changes are recorded with reasons and surfaced to the user.
- A new autoscaling strategy can be added without restructuring the framework.
- Operators can disable the burst behavior across an environment via a break-glass flag without disabling other autoscaling.
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? I mean, if it's "across an environment", then what "other autoscaling" is there?


1. **Stuck-reconfiguration recovery policy.** When a reconfiguration's pending replicas have not hydrated within the system timeout, do we (a) park the reconfiguration indefinitely with a clear signal in the introspection view for an operator to act, (b) auto-cancel and revert to the prior steady state, or (c) make the policy a dyncfg with one of (a)/(b) as the default? Same question for stuck burst replicas.

2. **`AUTO SCALING STRATEGY` + `SCHEDULE = ('on-refresh', ...)` combination.** v1 rejects this combination. Semantically the combination is interesting (every refresh window, burst comes up first to accelerate hydration, steady catches up, then the schedule turns the cluster off), but our strong recommendation is to keep this rejected indefinitely: there is currently no appetite to invest further in the `SCHEDULE` syntax, and supporting the combination would expand its surface area.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, no need to support this combination!


- `AUTO SCALING STRATEGY` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
- More than one concurrent burst replica per cluster. Initial version supports exactly one; revisit if needed.
- Per-`ALTER` timeout configuration. The current session-bound `ALTER ... WAIT UNTIL READY (TIMEOUT = ...)` semantics do not carry over directly to background reconfiguration. A system-wide timeout (via dyncfg) is used instead in v1.
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which semantics don't carry over? Cancelling the reconfiguration by cancelling the command certainly doesn't carry over, but the TIMEOUT and the ON TIMEOUT settings could carry over, right?

Or is it better to make these global so that the user can influence an in-progress reconfiguration? That is, instead of having to give IDs to reconfigurations and make a new SQL command for resolving (cancelling or rolling forward) an in-progress reconfiguration, one can simply alter the global timeout settings.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more issue if we make the graceful reconfiguration settings global is that if a user has development clusters and prod clusters in the same env, then they would probably need different reconfiguration settings. I guess during development, they would often not need graceful reconfigurations, and might even actively don't want it, e.g. to save costs (or just for simplicity).

SIZE = '100cc', -- steady-state replica size
AUTO SCALING STRATEGY = (
ON HYDRATION (
HYDRATION_SIZE = '3200cc' -- burst replica size while objects are not hydrated
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For consistency with other SQL syntax, we shouldn't have that underscore in HYDRATION_SIZE. Our SQL design principles also mentions this.

- `AUTO SCALING STRATEGY` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
- More than one concurrent burst replica per cluster. Initial version supports exactly one; revisit if needed.
- Per-`ALTER` timeout configuration. The current session-bound `ALTER ... WAIT UNTIL READY (TIMEOUT = ...)` semantics do not carry over directly to background reconfiguration. A system-wide timeout (via dyncfg) is used instead in v1.
- Push notifications for reconfiguration completion. Pull-only via introspection in v1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a user could SUBSCRIBE to the introspection, right?


### Summary

We introduce a **cluster controller** as a dedicated task inside `environmentd`. It is the single decision-maker for the replica set of every cluster. It operates as a reconciler: it reads desired cluster state from the durable catalog, observes the actual replica set and live status signals, computes a desired replica set per cluster by combining a set of **strategies**, and emits catalog-change commands to the Coordinator via a message channel. The Coordinator remains the sole writer of catalog state.
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term "cluster controller" seems like a natural choice, but I'm wondering if people would get confused all the time with our existing use of the word "controller". Edit: See e.g. my next comment already :D

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the name's just to useful and I don't like the alternatives I have in the last section 🙈

The controller runs as its own task within `environmentd`. Its inputs are:

- **Durable cluster configuration**, observed via the catalog (delivered as updates or snapshotted).
- **Live status signals** from the compute and storage controllers: replica readiness, hydration state, replica failures, current frontiers. Hydration is consumed directly from the controller's in-process state.
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, both the "Durable cluster configuration" and the "Live status signals" could be obtained also through subscribing to builtin relations. That setup could be more future-proof for when we have multiple envd processes, where not all of them have controller (It has begun... I mean, the old sense of "controller") state for all clusters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a section about that, but would keep as future work


Its outputs are **decisions** sent as messages to the Coordinator's existing internal command channel. Each decision is a request to create, drop, or modify a replica, accompanied by a reason. The Coordinator transacts them through the catalog and orchestrator as it does today for `Message::SchedulingDecisions`.

The controller is not the catalog writer. Coordinator remains the only writer. The controller's commands must be idempotent (replayable on retry without harm), because crashes and re-elections may cause the controller to recompute and re-emit a decision that was already partially applied.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write a bit more on how to make these idempotent? For example, how will we make a replica creation idempotent? I can imagine either

  • making the command a compare-and-swap, e.g., a command could be: "if exactly these replicas exist for the cluster, then create this replica"
  • or adding some reconciliation-like logic, e.g. a command could be: "I'd like at least r number of replicas with size s for this cluster".

@ggevay
Copy link
Copy Markdown
Contributor

ggevay commented May 26, 2026

(I think the "Rendered" link in the PR description is pointing to a slightly outdated version.)

@aljoscha
Copy link
Copy Markdown
Contributor Author

(I think the "Rendered" link in the PR description is pointing to a slightly outdated version.)

fixed!

- **Live status signals** from the compute and storage controllers: replica readiness, hydration state, replica failures, current frontiers. Hydration is consumed directly from the controller's in-process state.
- **A periodic tick** for time-based strategies.

Its outputs are **decisions** sent as messages to the Coordinator's existing internal command channel. Each decision is a request to create, drop, or modify a replica, accompanied by a reason. The Coordinator transacts them through the catalog and orchestrator as it does today for `Message::SchedulingDecisions`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each decision is a request to create, drop, or modify a replica, accompanied by a reason

Plus it can also happen that a decision is to modify a cluster's configuration, not just its replicas, right? For example, for ON TIMEOUT = 'ROLLBACK', it would observe if a reconfiguration replica is taking too long to hydrate and drop not just the replica, but also roll back the cluster's desired size.

The cluster's durable configuration represents the user's **desired target state**, not in-flight mechanics:

- `size`, `replication_factor`, `availability_zones`, `schedule`, `logging`, `optimizer_feature_overrides` — preserved.
- `auto_scaling_strategy: Option<AutoScalingStrategy>` — new optional field carrying the strategy block (currently only `ON HYDRATION` with its `HYDRATION_SIZE`, plus the shared `COOL DOWN`). The exact in-catalog shape is left to the implementer; the design only requires that it durably captures everything the user expressed in the SQL block.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, a strategy can be either explicitly recorded, like the "Hydration burst" strategy, or it can be implicit, like a graceful reconfiguration? That is, would a graceful reconfiguration be visible only from the fact that the replica set doesn't match the cluster's desired state?

(Also, does the "currently only" refer to some prototype implementation, which hasn't yet lifted the REFRESH stuff?)

Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's good to make graceful reconfiguration implicit, as it would be an asymmetry among strategies. Also, it might make it complex to see at a glance whether a graceful reconfiguration is in progress. For example, some other strategies might exclude a graceful reconfiguration, but some might not:

  • REFRESH excludes it (as per the design decision above), so if there is a REFRESH strategy, then not having enough replicas does not mean that there is a graceful reconfiguration in progress.
  • I guess the "Hydration burst" strategy doesn't necessarily have to exclude the graceful reconfiguration.

Edit: Although, the latter concern could be addressed with adding good observability, e.g., with the proposed smartening of SHOW CLUSTERS. But I'd say that asymmetry argument still stands.


### Strategies

A **strategy** is a function of the form `(durable cluster config, observed status, current time) → partial intent`. Each strategy returns, for a cluster, the contribution it wants to make to the cluster's desired replica set: replicas to add, replicas to keep, replicas to drop, and the conditions that govern transitions (e.g., "drop replica R once replica R' has hydrated all current objects").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to the cluster's desired replica set

(And maybe also to the cluster's overall config, as noted elsewhere for the case when we are rolling back a graceful cluster reconfiguration.)

Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is a function of the form `(durable cluster config, observed status, current time) → partial intent

How does the implementation of Hysteresis know that something happened recently? Will it look at something like mz_cluster_replica_history and try to guess what happened? Or will it look at the audit log to not have to guess what strategy fired recently? If the latter, than we should add the audit log to the input of strategies.

- A new autoscaling strategy can be added without restructuring the framework.
- Operators can disable the burst behavior across an environment via a break-glass flag without disabling other autoscaling.

## Out of Scope
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are multi-replica (in steady state) clusters in scope? I think the current state of the design document doesn't consider them. For example, in that case a hydration burst might want to wait for multiple replicas to be hydrated (or caught up) rather than any (non-burst) replica being hydrated (or caught up).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we just wait for one to be hydrated, but we could change that in the future, imo


4. **Cancellation syntax.** v1 uses ALTER-back. Do we want explicit `ALTER CLUSTER ... CANCEL ALTER` syntax for clarity?

5. **Strategy roadmap.** No concrete next strategies are committed. Our strong recommendation is to keep the strategy interface minimal — sufficient for graceful reconfiguration, hydration burst, and `ON REFRESH` — and resist extending it preemptively. Capability gets added when concrete needs arrive.
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There have been requests for automated cluster pause/resume for reasons other than REFRESH MVs, e.g., automatically spinning down a cluster when nobody is using it, and spinning it up when somebody queries it. See: https://github.com/MaterializeInc/database-issues/issues/6966

And another strategy could be automatically scaling up/down based on load: https://github.com/MaterializeInc/database-issues/issues/6821


- **Break-glass dyncfg.** A dedicated flag disables the hydration-burst strategy environment-wide without disabling graceful reconfiguration or `ON REFRESH`. Operations can flip this if burst behavior misfires.

- **Feature gate.** The full feature ships behind a dyncfg for staged rollout. Both `AUTO SCALING STRATEGY` SQL acceptance and the background reconfiguration behavior are gated.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It might have to be in feature_flags! instead of a dyncfg, because we need enable_for_item_parsing.)


- **Cluster controller as a separate process.** Deferred. v1 places it as a task inside `environmentd`. The message-channel boundary between controller and coordinator is clean enough that extraction to a separate process later is a localized refactor, not a redesign.

## Open Questions
Copy link
Copy Markdown
Contributor

@ggevay ggevay May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does all this interact with 0dt upgrades when we have two non-read-only envs in the future? The catalog is shared between the old and new envs, so a strategy making a decision affects both envs. Would this mean that they'd need to look at the states of both envs before making a decision? (I think you also mentioned this in one of our 1:1s.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this keeps being a problem, and I decided to largely punt it so we can make progress on this. I think there is a way where we either ignore the other env, or state is in the catalog and everyone needs to consider what the others are doing/whether they are ready. 🙈

### Migration

- **Add `auto_scaling_strategy: Option<AutoScalingStrategy>`** to the durable cluster configuration, durably capturing whatever the user expressed in the `AUTO SCALING STRATEGY` SQL block (for v1: optional `ON HYDRATION { hydration_size }`, optional `COOL DOWN`). Standard catalog migration with `None` default for existing clusters.
- **Remove `pending: bool`** from the durable replica location. Forward-only migration: drop the field from the proto/Rust types and stop reading it. Any pending replicas existing at the moment of upgrade are picked up by the new controller, which converges them via the diff (a pending-flagged replica at the new size is functionally identical to "a replica the controller is tracking as part of a reconfiguration").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This makes it sound like we'll continue in-progress old-style reconfigurations at the moment when we migrate to the new code, but I think we'll actually roll them back, because the durably recorded size will still be the old size, because the old-style reconfiguration doesn't durably record the new size.)


The following behaviors fall out of the design rather than being its headline outcomes. They are user-observable and worth flagging in release notes and user-facing documentation.

- **Burst and reconfiguration-transient replicas appear in billing and metering identically to ordinary replicas.** A user with an `AUTO SCALING STRATEGY` set sees additional billing during hydration windows (at the configured `HYDRATION_SIZE`); a user issuing a background `ALTER CLUSTER` sees additional billing during the overlap between the old and new replica sets.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this already the case with the old-style graceful cluster reconfiguration? As far as I know, the billing pipeline operates based on the set of replicas that actually exist, which would include overlapping replicas during old-style graceful cluster reconfigurations.


- `AUTO SCALING STRATEGY` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
- More than one concurrent burst replica per cluster. Initial version supports exactly one; revisit if needed.
- Per-`ALTER` timeout configuration. The current session-bound `ALTER ... WAIT UNTIL READY (TIMEOUT = ...)` semantics do not carry over directly to background reconfiguration. A system-wide timeout (via dyncfg) is used instead in v1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more issue if we make the graceful reconfiguration settings global is that if a user has development clusters and prod clusters in the same env, then they would probably need different reconfiguration settings. I guess during development, they would often not need graceful reconfigurations, and might even actively don't want it, e.g. to save costs (or just for simplicity).

@aljoscha aljoscha force-pushed the design-cluster-autoscaling branch 2 times, most recently from b8b168e to d7b96bb Compare May 26, 2026 19:33
aljoscha and others added 14 commits May 26, 2026 20:56
…guration

Proposes a cluster controller that runs alongside the Coordinator as a
reconciler over durable cluster config. Reshapes graceful reconfiguration to
run in the background by making the user's target the durable cluster config
and removing session-bound intent. Introduces HYDRATION_SIZE for burst replicas
during hydration, with the existing ON REFRESH scheduling lifted into the same
strategy framework.
Replaces the flat HYDRATION_SIZE cluster option with a structured
AUTO SCALING STRATEGY block that holds strategy specs (ON HYDRATION
today) and shared parameters (COOL DOWN). All key=value pairs use =
to match the established cluster-option convention (SCHEDULE precedent).
COOL DOWN replaces the dyncfg-only hysteresis hooks with a user-facing
parameter; the dyncfg default applies when the SQL block omits it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback: SHOW CLUSTERS is SQL over mz_clusters and friends, so we
can join it with the new introspection view to expose target size, current
size, and reconfiguration-in-flight status in one place rather than requiring
the user to find the introspection view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to 20260522_cluster_autoscaling.md. Cuts the design into nine
PR-sized chunks (controller skeleton, ON REFRESH migration, catalog
field, SQL parsing, graceful reconfig + background ALTER, pending-flag
removal, hydration burst, introspection view, SHOW CLUSTERS extension),
each with explicit scope, deliverables, and tests at the boundaries.

The file leads with a prompt for an implementer agent so that a single
invocation picks the first TODO PR, implements it, and updates the
tracking sections — letting us drive the work one self-contained PR at a
time with the cross-cutting interface decisions locked in up front
rather than re-litigated per PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread .bk.yaml Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, snuck in by accident 🙈


- `AUTO SCALING STRATEGY` combined with `SCHEDULE = ('on-refresh', ...)`. Initial version rejects; see [Open Questions](#open-questions).
- More than one concurrent burst replica per cluster. Initial version supports exactly one; revisit if needed.
- Dedicated push channel for reconfiguration completion (e.g., NOTICE-style). The introspection view is `SUBSCRIBE`-able, which is enough for clients that want push semantics; we're not building a separate notification surface in v1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a nice variant (particularly for testing) is a version that explicitly IS blocking. This could just be syntactic sugar around the two building blocks but I think it could be useful

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a version of the ALTER that is blocking? We do have that, and the doc mentions that we can keep that using a shim. I mentioned we only wanted to keep initially but could also keep indefinitely

A reconciler computes "desired replica set" purely from durable state plus observable status. This has several properties we want:

- **Crash-safe by construction.** Any state needed to continue an operation across a restart is, by definition, in the catalog or derivable from it.
- **Composable.** Multiple strategies (graceful reconfig, hydration burst, on-refresh, future ones) each contribute to the desired set; the controller merges them with explicit precedence. No state-machine interactions between strategies.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, and I can't think of something better, but one challenge will be: how do you explain the decisions to the user after they create dumb policies, and how altering one affects the other.

I am guessing this is just a problem we solve with docs, but I wonder if we could somehow surface conflicting strategies.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we could show that in SHOW CLUSTER once (if ever) we allow multiple strategies


**New cluster option: `AUTO SCALING STRATEGY = (...)`.** Available at `CREATE CLUSTER` and `ALTER CLUSTER ... SET (...)`. The value is a paren-enclosed list mixing strategy specs (currently only `ON HYDRATION (HYDRATION_SIZE = '...')`) with shared parameters (`COOL DOWN = '...'`). The whole block is set or replaced atomically; omitting it on ALTER (or specifying an empty list) disables auto-scaling for the cluster. Fine-grained ALTERs that change one sub-parameter are out of scope for v1. `COOL DOWN` is itself optional; if omitted, a system-wide default (via dyncfg) applies.

**Cancellation.** v1 uses ALTER-back semantics: a user cancels an in-flight reconfiguration by issuing another `ALTER CLUSTER` with the original (or any other) size. The controller observes the new target and converges. No "remembered original size" is needed because the durable `cluster.size` is the desired state; the user issues the size they want.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm not sure I understand this ... somewhere something has to know the previous state right? Maybe you are saying this is somehow implicitly remembered as part of a catalog mutation?


- **Graceful reconfiguration.** When the actual replica set differs from the cluster's durable managed config — covering whatever today's graceful reconfig supports (size, replication factor, availability zones, etc.) — this strategy produces an overlapping transition: create matching replicas to reach the desired set, and mark the existing wrong-config replicas for drop *after* the new replicas hydrate. This is the same end behavior as today's three-stage pipeline, but driven by the controller diffing durable state. The strategy also reads `reconfiguration_deadline`: while `now <= deadline` it drives the transition; once `now > deadline` with the new replicas still un-hydrated, it instead proposes dropping those un-hydrated target replicas and offers no replacement, leaving the existing replicas and `cluster.size` in place. The past deadline keeps it backed off, so the timed-out state is stable rather than a retry loop (see [Failure handling](#failure-handling-and-safety-rails)).

- **Hydration burst.** When the cluster's `AUTO SCALING STRATEGY` includes `ON HYDRATION` with a `HYDRATION SIZE`, `cluster.replication_factor > 0` (i.e., cluster is On), and the cluster has at least one object that is not fully hydrated on every existing steady replica, this strategy contributes a single additional replica at the configured `HYDRATION SIZE`. The burst replica is removed once **at least one steady replica has all currently-existing cluster objects fully hydrated *and* its output frontiers have caught up to the burst replica's output frontiers**. Tying the catch-up check to the burst replica's frontiers (rather than wall clock) keeps tear-down decoupled from ingest cadence — we only need the steady to be no worse than what burst gave us. We deliberately do *not* port 0dt's OOM-tolerance logic here: for burst the desired behavior is the opposite — a steady replica that is OOM-looping or otherwise unable to keep up should *prevent* tear-down, which falls out naturally from the catch-up check. The strategy block's `COOL DOWN` bounds how soon the controller may take a follow-on scaling action after a transition. Burst is bounded by a durable `burst_deadline` — from the optional `ON HYDRATION` `TIMEOUT`, else the dyncfg floor — written when the burst replica is created. If the deadline passes before tear-down, the strategy drops the burst replica and **retains the deadline as a tombstone**, keeping burst suppressed while the same objects stay un-hydrated; this is the same deadline-as-tombstone behavior reconfiguration uses (see [Failure handling](#failure-handling-and-safety-rails)). The tombstone resets when the un-hydrated set clears or on a reconfiguring `ALTER`, so a burst that cannot help a too-small steady set stops thrashing rather than re-launching every cool-down. One documented edge: a newly-added object does not re-engage burst while an earlier object remains stuck — resolving or resizing past the stuck object resets it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least one steady replica has all currently-existing cluster objects fully hydrated

Are these somehow snapshotted when the reconfiguration starts? Is this an existing choice?


Commands from the controller to the coordinator are **desired-state assertions** ("ensure a replica with config X exists for cluster C under name N", "ensure replica N is dropped"), not imperative actions. Each strategy contributes a fixed-shape set of replicas under deterministic names derived from the cluster and the strategy's intent (today's `r1..rN` for steady managed replicas is an example of this style), so a recomputed command is byte-identical to the previous one.

The coordinator's `CreateClusterReplica` handler today errors on a duplicate name (`SqlCatalogError::DuplicateReplica`). For the controller path we extend it to check-by-name within the transaction: if a replica under the asserted name already exists with the asserted config, the transaction commits as a no-op; if it exists with a different config, that's a controller bug and is surfaced as an error.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean we allow two concurrent alter cluster commands?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't, and one "cancels" any previous one, but with these commands going to the Coordinator there can be races, there is concurrency

@aljoscha aljoscha force-pushed the design-cluster-autoscaling branch from d7b96bb to d71bddd Compare May 27, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants