Skip to content

fix: re-land CPO registryOverride for FSI compliance (AROSLSRE-1363)#5846

Draft
avollmer-redhat wants to merge 1 commit into
Azure:mainfrom
avollmer-redhat:fix/reland-cpo-registry-override-AROSLSRE-1363
Draft

fix: re-land CPO registryOverride for FSI compliance (AROSLSRE-1363)#5846
avollmer-redhat wants to merge 1 commit into
Azure:mainfrom
avollmer-redhat:fix/reland-cpo-registry-override-AROSLSRE-1363

Conversation

@avollmer-redhat

Copy link
Copy Markdown
Collaborator

Why

The Control Plane Operator (CPO) images are pulled from quay.io/redhat-user-workloads/crt-redhat-acm-tenant/..., which is not on the FSI image registry allowlist. When the ValidatingAdmissionPolicy (VAP) is switched from Audit to Deny mode, HCP cluster creation hangs because CPO's init containers (e.g. availability-prober) are blocked.

This PR re-adds the --registry-overrides mapping that remaps the CPO image source to ACR, so the VAP will allow these images. This mapping was originally landed in #5162 but reverted in #5189 due to an incident where it triggered unexpected customer node pool re-rolls (ITN-2026-00126).

What unblocked re-landing

Two upstream HyperShift fixes have merged that resolve the root cause:

With these fixes, when we provide the registry override mapping, the CPO will correctly rewrite all image references (including init containers) from quay.io to ACR.

What changed since the original #5162

The Helm values template syntax was updated after the original PR to use {{ .acrDNSSuffix }} instead of hardcoded azurecr.io. This PR uses the updated template syntax.

Node pool re-roll impact

⚠️ Adding this registryOverrides entry changes the HyperShift operator deployment spec → HO pod rolls → ignition configs re-rendered → NodePool config hash changes → CAPI MachineSet replacements across all customer clusters. This is the same mechanism that caused the Apr 28 incident.

Deployment must be coordinated with SRE. The node-rollout-detection e2e test is tracked in ARO-27491.

Follow-up

  • AROSLSRE-861: Remove the hardcoded CPO exception from the VAP CEL expression (depends on this change being deployed and verified in int/stg first)
  • Switch VAP from Audit to Deny mode in production

Tracking

  • Jira: AROSLSRE-1363
  • Epic: ARO-22152 (Ensure All Deployed Images Come from MCR or ACR for MSFT FSI Compliance)

Add registry override mapping for the Control Plane Operator (CPO) image
source (quay.io/redhat-user-workloads/crt-redhat-acm-tenant) to ACR.
This was originally landed in Azure#5162 but reverted in Azure#5189 due to
customer node pool re-rolls (ITN-2026-00126).

Re-landing is now safe because upstream HyperShift fixes (#8509, #8824)
ensure the CPO correctly propagates registry overrides to its init
containers (e.g. availability-prober), which was the root cause of HCP
cluster creation hanging when the VAP was in Deny mode.

⚠️ This change will trigger a node pool re-roll across all hosted
clusters in affected regions. Coordinate deployment with SRE.

Follow-up: AROSLSRE-861 will remove the hardcoded CPO exception from
the VAP CEL expression once this override is deployed and verified.
@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: avollmer-redhat
Once this PR has been reviewed and has the lgtm label, please assign stevekuznetsov for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci

openshift-ci Bot commented Jun 29, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@avollmer-redhat

Copy link
Copy Markdown
Collaborator Author

/test e2e-aro-hcp-dev

@avollmer-redhat

Copy link
Copy Markdown
Collaborator Author

/test all

@avollmer-redhat

avollmer-redhat commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

Status: On Hold — Node Re-Roll Coordination Needed

Context

Mariusz (mmazur) flagged that the upstream CPO fixes (#8509, #8824) landed in HyperShift main targeting 5.0, but the CPO is a versioned image — each OCP release (4.20, 4.21, 4.22) ships its own CPO binary. Our production fleet runs 4.20–4.22, so the CPO in those versions doesn't have the registryoverride.Replace() fix for init containers yet. Backports are needed.

Why this PR is still correct

  1. It's a prerequisite — without the override mapping, the CPO never gets told to remap. This must be in place before the backports matter.
  2. The hardcoded VAP exception for CPO stays — it won't be removed (AROSLSRE-861) until backports are verified in the fleet.
  3. No regression — old CPO versions simply won't apply the override to init containers, same as today.

The re-roll concern

Merging this PR changes the --registry-overrides flag on the HO deployment, which cascades to CPO deployment spec → ignition/MachineConfig re-render → NodePool config hash change → all customer worker nodes roll. This is the same mechanism as the Apr 28 incident (ITN-2026-00126).

This would be one re-roll, not two — when CPO backports eventually land in 4.20–4.22, customers pick them up during normal OCP version upgrades, which already re-roll nodes.

Decision needed

  • Land now: accept one coordinated re-roll, get the prerequisite in place
  • Defer and bundle: wait for another planned HO config change that already triggers a re-roll, bundle this with it
  • Defer until backports land: wait months, then land override + remove VAP exception together

Keeping as draft until we decide on timing. See ARO-22152 and AROSLSRE-1363 for full analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant