Skip to content

Scale-down during failed rolling replacement can skip decommission #1151

Description

@oyjhl

Scale-down during failed rolling replacement can skip decommission

Environment

  • Kubernetes: local kind cluster
  • CockroachDB operator image: cockroachdb/cockroach-operator:v2.18.3
  • Operator args: default released v2.18.3 args
  • CockroachDB image: cockroachdb/cockroach:v25.2.12
  • Test topology: 4-node insecure CrdbCluster, then shrink 4 -> 3

What Happened

When the user scales the CockroachDB CR down (spec.nodes 4 → 3) while the operator is in the middle of performing the StatefulSet rolling replacement of the highest-ordinal pod, the operator skips the node decommission and shrinks the StatefulSet directly. This cause the operator to delete the CockroachDB node without properly decommissioning it.

Specifically, the cluster started with four joined CockroachDB members, pod-0 to pod-3. Then the operator starts to restart the last pod (pod-3). While the StatefulSet was in that intermediate state, the user changed spec.nodes from 4 to 3. Instead of blocking scale-down or decommissioning the pod-3, the operator updated the StatefulSet to spec.replicas=3, deleting the last pod directly.

Therefore, Kubernetes and CockroachDB had different views of the same node. The pod-3 has been deleted by the Kubernetes, and it still remains in the CockroachDB membership list, considered as crashed.

I have attached a reproduction script here: https://gist.github.com/oyjhl/cd672bf67a70219514355a7c8bd3c607

Expected Behavior

The operator must not lower StatefulSet replicas until CockroachDB reports the same node is fully decommissioned.

The operator should either:

  • decommission the CockroachDB member pod-3 before shrinking;
  • block scale-down while the replacement is stuck;
  • or recover the replacement first, rediscover the node identity, and then run
    the normal decommission flow.

Where The Source Code Goes Wrong

Decommission can be skipped while Deploy is still allowed to shrink the StatefulSet.

In the bad run, the source behaves like this:

pod-3 had joined CockroachDB
pod-3 replacement was stuck
user requested 4 -> 3 scale-down
operator skipped Decommission
operator ran Deploy
Deploy wrote StatefulSet spec.replicas=3
CockroachDB still reported pod-3's node as active with nonzero replicas

The decommission gate depends on StatefulSet rollout status:

ss.Spec.Replicas             desired Kubernetes pod count stored in the StatefulSet
ss.Status.Replicas           pods currently observed by the StatefulSet controller
ss.Status.CurrentReplicas    pods already updated to the latest StatefulSet pod spec generated from the CR
cluster.Spec().Nodes         desired CockroachDB node count from the CrdbCluster

CurrentReplicas is not "how many CockroachDB nodes are safe to remove". It is only Kubernetes rollout status: how many observed pods already match the current StatefulSet revision. During a failed rolling replacement, the StatefulSet can still have four observed pods, but only three current-revision pods:

ss.Spec.Replicas             = 4
ss.Status.Replicas           = 4
ss.Status.CurrentReplicas    = 3
cluster.Spec().Nodes         = 3

In that state, the first predicate is false (3 != 4).

ss.Status.CurrentReplicas == ss.Status.Replicas
ss.Status.CurrentReplicas > cluster.Spec().Nodes

But the deploy path can still see that the desired StatefulSet should now have only three replicas and write that smaller replica count. Rather, it should check:

*ss.Spec.Replicas > cluster.Spec().Nodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions