Scale-down during failed rolling replacement can skip decommission
Environment
- Kubernetes: local kind cluster
- CockroachDB operator image:
cockroachdb/cockroach-operator:v2.18.3
- Operator args: default released
v2.18.3 args
- CockroachDB image:
cockroachdb/cockroach:v25.2.12
- Test topology: 4-node insecure
CrdbCluster, then shrink 4 -> 3
What Happened
When the user scales the CockroachDB CR down (spec.nodes 4 → 3) while the operator is in the middle of performing the StatefulSet rolling replacement of the highest-ordinal pod, the operator skips the node decommission and shrinks the StatefulSet directly. This cause the operator to delete the CockroachDB node without properly decommissioning it.
Specifically, the cluster started with four joined CockroachDB members, pod-0 to pod-3. Then the operator starts to restart the last pod (pod-3). While the StatefulSet was in that intermediate state, the user changed spec.nodes from 4 to 3. Instead of blocking scale-down or decommissioning the pod-3, the operator updated the StatefulSet to spec.replicas=3, deleting the last pod directly.
Therefore, Kubernetes and CockroachDB had different views of the same node. The pod-3 has been deleted by the Kubernetes, and it still remains in the CockroachDB membership list, considered as crashed.
I have attached a reproduction script here: https://gist.github.com/oyjhl/cd672bf67a70219514355a7c8bd3c607
Expected Behavior
The operator must not lower StatefulSet replicas until CockroachDB reports the same node is fully decommissioned.
The operator should either:
- decommission the CockroachDB member
pod-3 before shrinking;
- block scale-down while the replacement is stuck;
- or recover the replacement first, rediscover the node identity, and then run
the normal decommission flow.
Where The Source Code Goes Wrong
Decommission can be skipped while Deploy is still allowed to shrink the StatefulSet.
In the bad run, the source behaves like this:
pod-3 had joined CockroachDB
pod-3 replacement was stuck
user requested 4 -> 3 scale-down
operator skipped Decommission
operator ran Deploy
Deploy wrote StatefulSet spec.replicas=3
CockroachDB still reported pod-3's node as active with nonzero replicas
The decommission gate depends on StatefulSet rollout status:
ss.Spec.Replicas desired Kubernetes pod count stored in the StatefulSet
ss.Status.Replicas pods currently observed by the StatefulSet controller
ss.Status.CurrentReplicas pods already updated to the latest StatefulSet pod spec generated from the CR
cluster.Spec().Nodes desired CockroachDB node count from the CrdbCluster
CurrentReplicas is not "how many CockroachDB nodes are safe to remove". It is only Kubernetes rollout status: how many observed pods already match the current StatefulSet revision. During a failed rolling replacement, the StatefulSet can still have four observed pods, but only three current-revision pods:
ss.Spec.Replicas = 4
ss.Status.Replicas = 4
ss.Status.CurrentReplicas = 3
cluster.Spec().Nodes = 3
In that state, the first predicate is false (3 != 4).
ss.Status.CurrentReplicas == ss.Status.Replicas
ss.Status.CurrentReplicas > cluster.Spec().Nodes
But the deploy path can still see that the desired StatefulSet should now have only three replicas and write that smaller replica count. Rather, it should check:
*ss.Spec.Replicas > cluster.Spec().Nodes
Scale-down during failed rolling replacement can skip decommission
Environment
cockroachdb/cockroach-operator:v2.18.3v2.18.3argscockroachdb/cockroach:v25.2.12CrdbCluster, then shrink4 -> 3What Happened
When the user scales the CockroachDB CR down (spec.nodes 4 → 3) while the operator is in the middle of performing the StatefulSet rolling replacement of the highest-ordinal pod, the operator skips the node decommission and shrinks the StatefulSet directly. This cause the operator to delete the CockroachDB node without properly decommissioning it.
Specifically, the cluster started with four joined CockroachDB members,
pod-0topod-3. Then the operator starts to restart the last pod (pod-3). While the StatefulSet was in that intermediate state, the user changedspec.nodesfrom4to3. Instead of blocking scale-down or decommissioning thepod-3, the operator updated the StatefulSet tospec.replicas=3, deleting the last pod directly.Therefore, Kubernetes and CockroachDB had different views of the same node. The pod-3 has been deleted by the Kubernetes, and it still remains in the CockroachDB membership list, considered as crashed.
I have attached a reproduction script here: https://gist.github.com/oyjhl/cd672bf67a70219514355a7c8bd3c607
Expected Behavior
The operator must not lower StatefulSet replicas until CockroachDB reports the same node is fully decommissioned.
The operator should either:
pod-3before shrinking;the normal decommission flow.
Where The Source Code Goes Wrong
Decommission can be skipped while Deploy is still allowed to shrink the StatefulSet.
In the bad run, the source behaves like this:
The decommission gate depends on StatefulSet rollout status:
CurrentReplicasis not "how many CockroachDB nodes are safe to remove". It is only Kubernetes rollout status: how many observed pods already match the current StatefulSet revision. During a failed rolling replacement, the StatefulSet can still have four observed pods, but only three current-revision pods:In that state, the first predicate is false (
3 != 4).But the deploy path can still see that the desired StatefulSet should now have only three replicas and write that smaller replica count. Rather, it should check: