chore: bump all component images except clusters-service (AROSLSRE-1395) by raelga · Pull Request #5926 · Azure/ARO-HCP

raelga · 2026-07-03T19:11:15Z

Advances every component in config/config.yaml to its latest digest except
clusters-service, which is intentionally pinned at the last-known-good digest
sha256:6a49b32… (vcs-ref 18b5a25, 2026-06-19).

Why

The automated bulk bump #5789 started failing ci/prow/e2e-parallel after the
clusters-service image moved to sha256:b17f6fe… (vcs-ref ee741db). Bisecting
the bump into per-component PRs isolated the culprit:

chore(config): bump hypershift operator digest (bisect e2e #5789) #5910 (hypershift only) — e2e-parallel green
chore(config): bump ACM/MCE operator bundles (bisect e2e #5789) #5911 (ACM/MCE only) — e2e-parallel green (merged)
chore(config): bump clusters-service digest (bisect e2e #5789) #5920 (clusters-service only) — e2e-parallel fails reproducibly (2 consecutive runs) on the exact same specs:
test/e2e/complete_cluster_create_multiversion.go:172 "verify simple web app runs" for the candidate 4.22 and 5.0 channels, with route was never reachable: dial tcp 10.0.0.5:443: i/o timeout — the control plane + node pools provision, but the data-plane ingress on the newest OCP channels never becomes reachable. 4.20/4.21 are unaffected.
chore(config): bump component digests except ACM/hypershift/clusters-service (bisect e2e #5789) #5912 (everything except CS) — only flaky/environmental failures (1 then 16, non-reproducible signature), consistent with the shared-CI ARM-throttling episode also hitting unrelated PRs (e.g. fix: Remove all cluster-service SLO alerts pending rework #5915, which edits only alert YAML).

An image-digest bump is a behavioral change (the cluster runs the new build), and
the CS-only PR reproduces a version-specific data-plane regression, so we pin CS
at the good digest while letting the remaining components advance. Follow-up on
the bad CS image is tracked in the Jira below.

Jira: https://redhat.atlassian.net/browse/AROSLSRE-1395

Components bumped (clusters-service pinned, not bumped)

Component	Old	New
acrpull	v0.1.23	v0.1.24
arobit forwarder	v5.0.4 (06-19)	v5.0.4 (07-03)
mdsd	1.42.0-20260615	1.42.0-20260629
kube-events	20260621.1	20260701.1
maestro (provider)	v1.8.2 (06-11)	v1.8.2 (06-26)
hypershift	9aeb1f3	488ef0e
OADP velero-server	1.6.1 (06-25)	1.6.1 (07-02)
OADP velero azure-plugin	1.6.1 (06-25)	1.6.1 (07-02)
OADP velero hypershift-plugin	1.6.1 (06-25)	1.6.1 (07-02)
kube-state-metrics	v2.19.0 (06-12)	v2.19.0 (06-30)
maestro-agent-sidecar (nginx)	azl3.0.20260602	azl3.0.20260616
image-sync/oc-mirror	`690892d`	`5bfc996`
clusters-service	6a49b32 (pinned — good)	— (b17f6fe held back)

Advances every component in config/config.yaml to its latest digest **except clusters-service**, which is intentionally pinned at the last-known-good digest `sha256:6a49b32…` (vcs-ref `18b5a25`, 2026-06-19). ## Why The automated bulk bump Azure#5789 started failing `ci/prow/e2e-parallel` after the clusters-service image moved to `sha256:b17f6fe…` (vcs-ref `ee741db`). Bisecting the bump into per-component PRs isolated the culprit: - Azure#5910 (hypershift only) — e2e-parallel green - Azure#5911 (ACM/MCE only) — e2e-parallel green (merged) - Azure#5920 (**clusters-service only**) — e2e-parallel **fails reproducibly** (2 consecutive runs) on the exact same specs: `test/e2e/complete_cluster_create_multiversion.go:172` "verify simple web app runs" for the **candidate 4.22 and 5.0** channels, with `route was never reachable: dial tcp 10.0.0.5:443: i/o timeout` — the control plane + node pools provision, but the data-plane ingress on the newest OCP channels never becomes reachable. 4.20/4.21 are unaffected. - Azure#5912 (everything except CS) — only flaky/environmental failures (1 then 16, non-reproducible signature), consistent with the shared-CI ARM-throttling episode also hitting unrelated PRs (e.g. Azure#5915, which edits only alert YAML). An image-digest bump is a behavioral change (the cluster runs the new build), and the CS-only PR reproduces a version-specific data-plane regression, so we pin CS at the good digest while letting the remaining components advance. Follow-up on the bad CS image is tracked in the Jira below. Jira: https://redhat.atlassian.net/browse/AROSLSRE-1395 ## Components bumped (clusters-service pinned, not bumped) | Component | Old | New | | --- | --- | --- | | acrpull | v0.1.23 | v0.1.24 | | arobit forwarder | v5.0.4 (06-19) | v5.0.4 (07-03) | | mdsd | 1.42.0-20260615 | 1.42.0-20260629 | | kube-events | 20260621.1 | 20260701.1 | | maestro (provider) | v1.8.2 (06-11) | v1.8.2 (06-26) | | hypershift | 9aeb1f3 | 488ef0e | | OADP velero-server | 1.6.1 (06-25) | 1.6.1 (07-02) | | OADP velero azure-plugin | 1.6.1 (06-25) | 1.6.1 (07-02) | | OADP velero hypershift-plugin | 1.6.1 (06-25) | 1.6.1 (07-02) | | kube-state-metrics | v2.19.0 (06-12) | v2.19.0 (06-30) | | maestro-agent-sidecar (nginx) | azl3.0.20260602 | azl3.0.20260616 | | image-sync/oc-mirror | 690892d | 5bfc996 | | **clusters-service** | **6a49b32 (pinned — good)** | **— (b17f6fe held back)** |

Copilot

Pull request overview

This PR updates the repo’s source-of-truth configuration (config/config.yaml) and the materialized dev rendered configs to advance component image digests to their latest versions, while intentionally keeping clusters-service pinned to a last-known-good digest due to a known E2E regression tracked in AROSLSRE-1395.

Changes:

Bump multiple component image digests (e.g., acrPull, arobit forwarder/mdsd, kube-events, hypershift, OADP velero + plugins, kube-state-metrics, prometheus operator/prometheus/config-reloader, oc-mirror).
Update Istio istioctlVersion from 1.30.1 → 1.30.2.
Commit regenerated config/rendered/dev/* outputs consistent with the updated defaults.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
config/config.yaml	Updates default component image digests/versions while leaving clusters-service pinned to the known-good digest.
config/rendered/dev/pers/westus3.yaml	Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/perf/westus3.yaml	Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/dev/westus3.yaml	Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/cspr/westus3.yaml	Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/ci01/centralus.yaml	Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/ci00/centralus.yaml	Materialized dev config reflecting updated component digests/versions.

Add an inline note on the pinned clustersService digest matching the existing shared-ingress HAProxy convention, so an automated bump does not silently reintroduce the known-bad digest without context. Addresses PR review feedback.

raelga · 2026-07-03T19:26:13Z

Confirmed root cause — clusters-service ARO-26913 (api.listening → HostedCluster Topology)

Bisecting the #5789 bulk digest bump isolated clusters-service as the sole culprit (bisect PRs: #5910 hypershift ✅, #5911 ACM/MCE ✅ merged, #5912 rest-minus-CS ✅, #5920 CS-only ❌ reproducible).

Single-variable proof: #5920 changes only the CS digest vs main. #5912 (no CS) passed candidate 4.22 & 5.0 on the identical 5.0.0-ec.3 payload at overlapping time; #5920 failed both, twice. Same OCP payload → not a version-inventory change → a CS install-behavior regression.

Mechanism (CS 18b5a25 → ee741db): the delta is merge ARO-26913, which wires the CS api.listening field to the HyperShift AzurePlatformSpec.Topology/Private fields on the HostedCluster CR (AzureTopologyFromCluster: external→PublicAndPrivate, internal→Private; commit 4229a315f) and upgrades the HyperShift API. Setting Topology/Private is gated to Swift-enabled clusters (658939fb6) — and the CI dev stamps (ci00/ci01) are Swift-enabled, so every e2e HostedCluster now gets Topology: PublicAndPrivate.

These Topology/Private fields are only honored by the newest control-plane-operator shipped in candidate 4.22 & 5.0; on 4.20/4.21 the older CPO ignores them (legacy fallback) → those channels stay green. On 4.22/5.0 the new topology handling makes the guest *.apps ingress route resolve to an internal address, so cluster operators (console et al.) never complete, CVO stays Progressing on 5.0.0-ec.3, and the e2e client hits route was never reachable: dial tcp 10.0.0.5:443: i/o timeout (complete_cluster_create_multiversion.go:172).

This PR pins clusters-service to the last-known-good digest (sha256:6a49b32…, vcs-ref 18b5a25) while taking all other latest digests, unblocking the bump. The CS-side ARO-26913 topology behavior needs a fix (or a coordinated HyperShift payload) before the new CS digest can land.

Tracking: AROSLSRE-1395

raelga · 2026-07-03T19:45:12Z

Corroboration — hcpctl snapshot `analyze` (independent) closes the loop

An independent hcpctl snapshot analyze of the #5920 candidate-5.0 failure (which only saw the single failing snapshot, not the bisect) pinpointed the guest-side mechanism, and its HyperShift code citation ties it directly back to the CS trigger:

Complete causal chain

New CS (ARO-26913, 4229a315f) sets AzurePlatformSpec.Topology = PublicAndPrivate on the HostedCluster for external Swift clusters — i.e. every e2e cluster.
HyperShift HCCO .../ingress/params.go:51-54 sets the guest default IngressController Scope: Internal whenever Azure.Topology == AzureTopologyPrivate || AzureTopologyPublicAndPrivate. So the new CS flips the guest ingress from External → Internal. HCCO publishes wildcard *.apps → 10.0.0.5 (internal LB frontend) and, per reconcile.go:18, never mutates an existing IngressController (create-only).
On candidate 4.22 & 5.0 only, the payload's guest cluster-ingress-operator then flips the IngressController Scope: Internal → External (gen 1→2), stripping the azure-load-balancer-internal annotation; CCM migrates router-default onto the public LB and tears down the internal LB (ARM-throttled), but the wildcard DNS is never updated away from 10.0.0.5 → the route dials a deprovisioning internal LB → dial tcp 10.0.0.5:443: i/o timeout.
Passing siblings 4.20/4.21 show zero router-default LB churn and never flip; chore(config): bump component digests except ACM/hypershift/clusters-service (bisect e2e #5789) #5912 (old CS, no Topology set) keeps ingress External on the identical 5.0.0-ec.3 payload → pass.

So the CS ARO-26913 Topology wiring is the trigger that puts the cluster into the Internal-ingress state where the candidate payload's ingress-operator behavior breaks external reachability. Two independent fronts are viable: (a) the CS/topology side (what #5926 addresses by pinning CS), and (b) a HyperShift/ARO-HCP guardrail that reconciles the guest ingress scope so a payload can't silently flip an Azure private cluster's ingress to External.

… 18b5a25 The automated image bump (tag: latest) advances clusters-service to vcs-ref ee741db (ARO-26913), which sets AzurePlatformSpec.Topology=PublicAndPrivate on the HostedCluster; on candidate 4.22/5.0 payloads the guest ingress-operator flips the ingress LB scope back to External, leaving the *.apps route unreachable and failing e2e-parallel. Pin the image-updater source tag to the last-known-good build 18b5a25 so the bump cannot re-advance it. See AROSLSRE-1395.

openshift-ci · 2026-07-03T21:05:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [raelga]
~~tooling/image-updater/OWNERS~~ [raelga]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

raelga · 2026-07-03T21:05:51Z

Added a hard pin in the image-updater (commit `eba392c`)

The config.yaml digest pin alone wasn't durable — the automated bumper (tooling/image-updater/config.yaml, clusters-service.source.tag: "latest") resolves latest to the bad b17f6fe (vcs-ref ee741db) and would re-advance CS on the next run.

Pinned the source tag to the last-known-good immutable build tag 18b5a25 → digest sha256:6a49b32…, mirroring the existing hypershift-shared-ingress fixed-tag pin pattern (README §"Pinning Images and Rolling Back").

Verified: with the bad digest injected, the pinned updater proposes reverting to the good 6a49b32 (Tag 18b5a25); make -C config materialize + detect-change are clean. Restore tag: "latest" once the CS ARO-26913 topology regression is fixed.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

raelga · 2026-07-04T07:30:05Z

/retest-required

Copilot AI review requested due to automatic review settings July 3, 2026 19:11

openshift-ci Bot requested review from hbhushan3 and janboll July 3, 2026 19:11

openshift-ci Bot added the approved label Jul 3, 2026

Copilot started reviewing on behalf of raelga July 3, 2026 19:11 View session

raelga mentioned this pull request Jul 3, 2026

chore(config): bump clusters-service digest (bisect e2e #5789) #5920

Open

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread config/config.yaml

Copilot AI review requested due to automatic review settings July 3, 2026 21:05

Copilot started reviewing on behalf of raelga July 3, 2026 21:06 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

Comment thread config/config.yaml Outdated

docs(config): reference AROSLSRE-1395 in clusters-service pin comment

7243211

raelga mentioned this pull request Jul 3, 2026

chore(config): bump component digests except ACM/hypershift/clusters-service (bisect e2e #5789) #5912

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: bump all component images except clusters-service (AROSLSRE-1395)#5926

chore: bump all component images except clusters-service (AROSLSRE-1395)#5926
raelga wants to merge 4 commits into
Azure:mainfrom
raelga:bump-all-except-clusters-service

raelga commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

raelga commented Jul 3, 2026 •

edited by openshift-ci Bot

Loading

Uh oh!

raelga commented Jul 3, 2026 •

edited by openshift-ci Bot

Loading

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

raelga commented Jul 3, 2026 •

edited by openshift-ci Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

raelga commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

raelga commented Jul 3, 2026

Why

Components bumped (clusters-service pinned, not bumped)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

raelga commented Jul 3, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Confirmed root cause — clusters-service ARO-26913 (api.listening → HostedCluster Topology)

Uh oh!

raelga commented Jul 3, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Corroboration — hcpctl snapshot analyze (independent) closes the loop

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

raelga commented Jul 3, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added a hard pin in the image-updater (commit eba392c)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

raelga commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raelga commented Jul 3, 2026 •

edited by openshift-ci Bot

Loading

raelga commented Jul 3, 2026 •

edited by openshift-ci Bot

Loading

Corroboration — hcpctl snapshot `analyze` (independent) closes the loop

raelga commented Jul 3, 2026 •

edited by openshift-ci Bot

Loading

Added a hard pin in the image-updater (commit `eba392c`)