CNTRLPLANE-3631: Predictable NodePool Rollout Control#2042
Conversation
|
@csrwng: This pull request references CNTRLPLANE-3631 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| 1. An automated image-updater bumps the HAProxy image digest in the HyperShift | ||
| operator deployment. | ||
| 2. The operator reconciles all NodePools. The full `Hash()` changes (because it | ||
| includes HAProxy), so a new user-data secret is generated with the latest |
There was a problem hiding this comment.
why is a new userdata secret generated here if there's no consumers for it?
Besides, as soon as a new userdata is created, the existing one gets deleted via token.cleanupOutdated
which would make the nodepool unable to perform scaling operations.
And the token secret gets an expiration timestamp (IgnitionServerTokenExpirationTimestampAnnotation), which will result in the payload being deleted from the ignition server
There was a problem hiding this comment.
You're right — this is the same issue I addressed in my response to your comment on the workflow section header. The current wording in step 2 is wrong: when only management-side content changes and no rollout is triggered, the controller should NOT create new token/user-data secrets or run cleanupOutdated(). The existing secrets remain valid and the MachineDeployment continues to reference them.
I'll rewrite this workflow to reflect the corrected behavior:
- Operator reconciles all NodePools. The
RolloutHashWithoutVersion()does NOT change. - No new token or user-data secrets are created. The existing secrets remain valid.
- The
ConfigUpdatePendingcondition transitions toTruewith reasonManagementConfigDrift. - No MachineDeployment or MachineSet spec change occurs. Existing nodes remain undisturbed. Scale-up continues to work using the existing user-data secret.
| `nodePoolCurrentRolloutConfig` annotation. | ||
| 4. The MachineDeployment spec is updated with the new version and | ||
| `DataSecretName`, triggering a CAPI rolling Replace. | ||
| 5. When the rollout completes (`MachineDeploymentComplete()`), the controller |
There was a problem hiding this comment.
can we articulate the steps for what happens if during a rollout the service provider config changes again.
There was a problem hiding this comment.
Good question. I'll add a fourth workflow scenario covering this. The behavior:
Mid-rollout management-side config change:
- A spec-driven rollout is in progress — the MachineDeployment is rolling,
UpdatingConfigcondition isTrue. - While the rollout is running, a management-side change occurs (e.g., HAProxy image bump).
- The rollout hash has NOT changed (management-side content is excluded), so no new rollout is triggered.
- The in-progress rollout continues using the existing token/user-data secrets. No new secrets are created (rollout hash unchanged).
- When the rollout completes, the
nodePoolCurrentRolloutConfigannotation is updated to the current rollout hash. The nodes that were just replaced have the ignition payload that was generated at the start of the rollout — they do NOT automatically pick up the mid-rollout management-side change. - The
ConfigUpdatePendingcondition may transition toTrueif the management-side change means the current payload differs from what newly-created nodes would get on a fresh provision.
Mid-rollout spec-driven config change:
This is an existing behavior and is unchanged by this enhancement. CAPI handles this via the MachineDeployment's rolling update strategy — the new desired state supersedes the in-progress one, and CAPI continues rolling until all machines match the latest template.
|
|
||
| ### Goals | ||
|
|
||
| 1. Management-plane image digest bumps (e.g., HAProxy, CPO) MUST NOT trigger |
There was a problem hiding this comment.
can we be specific about what things owned by the platform might cause a rollout today? haproxy dataplane image/overrides. What else?
It's also probably worth mentioning that haproxy dataplane image bumping is a particular exception because of early haproxy needs. There's no reason for that image to not come from payload now, unless we ever support shared ingress in selfhosted.
Also should probably include as still a risk to be addressed separately that some config.openshift.io changes to defaults might result in accidental intent for a rollout
There was a problem hiding this comment.
Good points. I'll update the Goals section to enumerate the specific platform-owned inputs that can trigger rollouts today:
-
HAProxy data plane image — the kube-apiserver-proxy static pod image. For shared ingress clusters (ROSA HCP, ARO HCP), this comes from the operator's
IMAGE_SHARED_INGRESS_HAPROXYenv var rather than the NodePool's release payload. For non-shared-ingress (self-hosted) clusters, it already comes from the NodePool's release payload (haproxy-routercomponent), so it's not an issue there. This is a historical artifact from early shared ingress bootstrapping — there's no reason it can't come from the payload now. -
Registry overrides applied to the HAProxy image —
--registry-overrideson the management cluster rewrites the image reference that gets embedded in the ignition payload, even though data plane CRI-O handles mirroring natively (tracked in OCPBUGS-86415). -
config.openshift.iocomputed defaults — theglobalConfigString()function reconciles proxy and image configs with platform-specific defaults (e.g.,Status.NoProxyentries like network CIDRs,169.254.169.254for AWS). If the operator code changes these defaults, the serialized config changes and triggers a rollout even though the user's spec didn't change.
For (3), the two-hash architecture can address this by hashing only the user's raw spec inputs (HostedCluster.Spec.Configuration.Proxy, HostedCluster.Spec.Configuration.Image) in the rollout hash — without reconciliation or computed defaults. The full reconciled config continues to be used for payload generation. This follows the same pattern as HAProxy: user intent drives rollouts, platform-computed values flow into the payload silently.
I'll update the enhancement to cover this as part of the rollout hash design rather than calling it out as a separate risk.
| not yet applied to existing nodes. | ||
|
|
||
| ### Workflow Description | ||
|
|
There was a problem hiding this comment.
can we articulate how any workflow impact the lifecycle of the token secrets and payload cache generation and expiration
There was a problem hiding this comment.
Good call — this is a gap in the current design that needs to be addressed. Let me trace the problem:
When only management-side content changes (e.g., HAProxy bump):
Hash()changes →isOutdated()returns truecleanupOutdated()expires the old token secret (2hr TTL) and deletes the old user-data secret- New token/user-data secrets are created with names based on the new
Hash() - No rollout triggered (rollout hash unchanged) → MachineDeployment still references the old (now deleted) user-data secret
- Scale-up would reference a non-existent secret
The fix: when the rollout hash has not changed, the controller should not create new token/user-data secrets or cleanup existing ones. The existing secrets remain valid — they contain a working ignition payload, and the MachineDeployment continues to reference them. Scale-up nodes get the same config as existing nodes, which is the correct behavior since we're explicitly choosing not to roll out.
For spec-driven rollouts, the lifecycle stays the same as today: new secrets are created, old ones are expired/cleaned up, and the MachineDeployment is updated to reference the new secret.
I'll add a "Token secret and payload cache lifecycle" section to each workflow scenario covering this.
| return cg.doParse(configs, cg.haproxyRawConfig) | ||
| } | ||
|
|
||
| func (cg *ConfigGenerator) parseWithoutHaproxy(configs []corev1.ConfigMap) (string, error) { |
There was a problem hiding this comment.
might want to name this after rollout vs non rollout so it's extendable
There was a problem hiding this comment.
Agreed. I'll rename the table headers and descriptions to use "rollout" vs "non-rollout" terminology:
| Hash | Category | Inputs | Used for |
|---|---|---|---|
Hash() / HashWithoutVersion() |
Non-rollout | Full MCO config including HAProxy, pull secret name, additional trust bundle name, reconciled global config | User-data secret naming, payload generation |
RolloutHash() / RolloutHashWithoutVersion() |
Rollout | MCO config excluding HAProxy, pull secret name, additional trust bundle name, user-set global config (proxy spec, image spec — without computed defaults) | Rollout decisions |
This makes it clearer that the categories are extensible — if new management-side content is added in the future, it goes into the "non-rollout" hash only.
| ignition payload so new nodes always receive the latest configuration, but they | ||
| no longer trigger rollouts of existing nodes. |
There was a problem hiding this comment.
This doesn't sound correct. When only HAProxy changes:
- A new user-data secret IS generated (different name from Hash())
- But
propagateVersionAndTemplatedoes NOT updateMachineDeployment.Spec.Template.Spec.Bootstrap.DataSecretName(because rollout hash didn't change) - Therefore scale-up nodes created by the MachineDeployment will reference the OLD DataSecretName
- Scale-up nodes get stale management-side config, not "the full payload"
There was a problem hiding this comment.
You're right — the summary as written is incorrect. With the corrected design (discussed in the thread with @enxebre), when only management-side content changes, no new token/user-data secrets are created at all. The existing secrets remain valid, and the MachineDeployment continues to reference them. Scale-up nodes get the same config as existing nodes — which is the correct and intended behavior, since we're explicitly choosing not to roll out.
I'll update the summary to reflect this accurately: "Management-side changes do not trigger rollouts, and both existing and scale-up nodes retain the current configuration until the next spec-driven rollout."
Introduces a two-hash architecture in the NodePool controller to decouple rollout decisions from management-side configuration changes. A new "rollout hash" derived only from customer-facing spec inputs (user MachineConfigs, release version, pull secret, trust bundle, user-set proxy/image config) determines whether to trigger Replace or InPlace rollouts. Management-side changes (HAProxy image bumps, registry overrides, config.openshift.io computed defaults) no longer trigger rollouts. Tracking: CNTRLPLANE-3631
|
@csrwng: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| reference embedded in the ignition payload, even though data plane CRI-O | ||
| handles mirroring natively | ||
| ([OCPBUGS-86415](https://issues.redhat.com/browse/OCPBUGS-86415)). | ||
| - **`config.openshift.io` computed defaults**: The `globalConfigString()` |
There was a problem hiding this comment.
how would we differentiate default changes from user intent changes? I don't think this is addressed by this enhacement
| updated to the current rollout hash. The nodes that were replaced have the | ||
| ignition payload that was generated at the start of the rollout — they do NOT | ||
| automatically pick up the mid-rollout management-side change. | ||
| 6. The `ConfigUpdatePending` condition may transition to `True` if the |
There was a problem hiding this comment.
can we detail how would this be implemented in the impl details section?
how does the controller decides what to set to current not rollout config, if there are two targets, the latest (written in the target annotation) and the one in flight (which is not stored in the annotation anymore because latest overrode that)?
Summary
Enhancement proposal for OCPSTRAT-3298 — Predictable NodePool Rollout Control for Hosted Control Planes.
Today, the HyperShift NodePool controller uses a single hash over the entire rendered ignition config to drive rollout decisions. Any change — including automated management-side image digest bumps (e.g., HAProxy) — triggers a full Replace rollout of all worker nodes. This has caused production incidents.
This enhancement proposes:
nodePoolCurrentRolloutConfigannotation with safe first-reconcile seeding for existing NodePoolsConfigUpdatePendingcondition for observability into management-side configuration driftJira
Test plan
🤖 Generated with Claude Code