fix: re-land CPO registryOverride for FSI compliance (AROSLSRE-1363)#5846
fix: re-land CPO registryOverride for FSI compliance (AROSLSRE-1363)#5846avollmer-redhat wants to merge 1 commit into
Conversation
Add registry override mapping for the Control Plane Operator (CPO) image source (quay.io/redhat-user-workloads/crt-redhat-acm-tenant) to ACR. This was originally landed in Azure#5162 but reverted in Azure#5189 due to customer node pool re-rolls (ITN-2026-00126). Re-landing is now safe because upstream HyperShift fixes (#8509, #8824) ensure the CPO correctly propagates registry overrides to its init containers (e.g. availability-prober), which was the root cause of HCP cluster creation hanging when the VAP was in Deny mode.⚠️ This change will trigger a node pool re-roll across all hosted clusters in affected regions. Coordinate deployment with SRE. Follow-up: AROSLSRE-861 will remove the hardcoded CPO exception from the VAP CEL expression once this override is deployed and verified.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: avollmer-redhat The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aro-hcp-dev |
|
/test all |
Status: On Hold — Node Re-Roll Coordination NeededContextMariusz (mmazur) flagged that the upstream CPO fixes (#8509, #8824) landed in HyperShift main targeting 5.0, but the CPO is a versioned image — each OCP release (4.20, 4.21, 4.22) ships its own CPO binary. Our production fleet runs 4.20–4.22, so the CPO in those versions doesn't have the Why this PR is still correct
The re-roll concernMerging this PR changes the This would be one re-roll, not two — when CPO backports eventually land in 4.20–4.22, customers pick them up during normal OCP version upgrades, which already re-roll nodes. Decision needed
Keeping as draft until we decide on timing. See ARO-22152 and AROSLSRE-1363 for full analysis. |
Why
The Control Plane Operator (CPO) images are pulled from
quay.io/redhat-user-workloads/crt-redhat-acm-tenant/..., which is not on the FSI image registry allowlist. When the ValidatingAdmissionPolicy (VAP) is switched from Audit to Deny mode, HCP cluster creation hangs because CPO's init containers (e.g.availability-prober) are blocked.This PR re-adds the
--registry-overridesmapping that remaps the CPO image source to ACR, so the VAP will allow these images. This mapping was originally landed in #5162 but reverted in #5189 due to an incident where it triggered unexpected customer node pool re-rolls (ITN-2026-00126).What unblocked re-landing
Two upstream HyperShift fixes have merged that resolve the root cause:
--registry-overridesto CPO init containers (merged Jun 22)registryoverride.Replaceto handle@sha256:digest and:tagseparators (merged Jun 25)With these fixes, when we provide the registry override mapping, the CPO will correctly rewrite all image references (including init containers) from quay.io to ACR.
What changed since the original #5162
The Helm values template syntax was updated after the original PR to use
{{ .acrDNSSuffix }}instead of hardcodedazurecr.io. This PR uses the updated template syntax.Node pool re-roll impact
registryOverridesentry changes the HyperShift operator deployment spec → HO pod rolls → ignition configs re-rendered → NodePool config hash changes → CAPI MachineSet replacements across all customer clusters. This is the same mechanism that caused the Apr 28 incident.Deployment must be coordinated with SRE. The node-rollout-detection e2e test is tracked in ARO-27491.
Follow-up
AudittoDenymode in productionTracking