Skip to content

chore: bump all component images except clusters-service (AROSLSRE-1395)#5926

Open
raelga wants to merge 4 commits into
Azure:mainfrom
raelga:bump-all-except-clusters-service
Open

chore: bump all component images except clusters-service (AROSLSRE-1395)#5926
raelga wants to merge 4 commits into
Azure:mainfrom
raelga:bump-all-except-clusters-service

Conversation

@raelga

@raelga raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Advances every component in config/config.yaml to its latest digest except
clusters-service
, which is intentionally pinned at the last-known-good digest
sha256:6a49b32… (vcs-ref 18b5a25, 2026-06-19).

Why

The automated bulk bump #5789 started failing ci/prow/e2e-parallel after the
clusters-service image moved to sha256:b17f6fe… (vcs-ref ee741db). Bisecting
the bump into per-component PRs isolated the culprit:

An image-digest bump is a behavioral change (the cluster runs the new build), and
the CS-only PR reproduces a version-specific data-plane regression, so we pin CS
at the good digest while letting the remaining components advance. Follow-up on
the bad CS image is tracked in the Jira below.

Jira: https://redhat.atlassian.net/browse/AROSLSRE-1395

Components bumped (clusters-service pinned, not bumped)

Component Old New
acrpull v0.1.23 v0.1.24
arobit forwarder v5.0.4 (06-19) v5.0.4 (07-03)
mdsd 1.42.0-20260615 1.42.0-20260629
kube-events 20260621.1 20260701.1
maestro (provider) v1.8.2 (06-11) v1.8.2 (06-26)
hypershift 9aeb1f3 488ef0e
OADP velero-server 1.6.1 (06-25) 1.6.1 (07-02)
OADP velero azure-plugin 1.6.1 (06-25) 1.6.1 (07-02)
OADP velero hypershift-plugin 1.6.1 (06-25) 1.6.1 (07-02)
kube-state-metrics v2.19.0 (06-12) v2.19.0 (06-30)
maestro-agent-sidecar (nginx) azl3.0.20260602 azl3.0.20260616
image-sync/oc-mirror 690892d 5bfc996
clusters-service 6a49b32 (pinned — good) — (b17f6fe held back)

Advances every component in config/config.yaml to its latest digest **except
clusters-service**, which is intentionally pinned at the last-known-good digest
`sha256:6a49b32…` (vcs-ref `18b5a25`, 2026-06-19).

## Why

The automated bulk bump Azure#5789 started failing `ci/prow/e2e-parallel` after the
clusters-service image moved to `sha256:b17f6fe…` (vcs-ref `ee741db`). Bisecting
the bump into per-component PRs isolated the culprit:

- Azure#5910 (hypershift only) — e2e-parallel green
- Azure#5911 (ACM/MCE only) — e2e-parallel green (merged)
- Azure#5920 (**clusters-service only**) — e2e-parallel **fails reproducibly** (2 consecutive runs) on the exact same specs:
  `test/e2e/complete_cluster_create_multiversion.go:172` "verify simple web app runs" for the **candidate 4.22 and 5.0** channels, with `route was never reachable: dial tcp 10.0.0.5:443: i/o timeout` — the control plane + node pools provision, but the data-plane ingress on the newest OCP channels never becomes reachable. 4.20/4.21 are unaffected.
- Azure#5912 (everything except CS) — only flaky/environmental failures (1 then 16, non-reproducible signature), consistent with the shared-CI ARM-throttling episode also hitting unrelated PRs (e.g. Azure#5915, which edits only alert YAML).

An image-digest bump is a behavioral change (the cluster runs the new build), and
the CS-only PR reproduces a version-specific data-plane regression, so we pin CS
at the good digest while letting the remaining components advance. Follow-up on
the bad CS image is tracked in the Jira below.

Jira: https://redhat.atlassian.net/browse/AROSLSRE-1395

## Components bumped (clusters-service pinned, not bumped)

| Component | Old | New |
| --- | --- | --- |
| acrpull | v0.1.23 | v0.1.24 |
| arobit forwarder | v5.0.4 (06-19) | v5.0.4 (07-03) |
| mdsd | 1.42.0-20260615 | 1.42.0-20260629 |
| kube-events | 20260621.1 | 20260701.1 |
| maestro (provider) | v1.8.2 (06-11) | v1.8.2 (06-26) |
| hypershift | 9aeb1f3 | 488ef0e |
| OADP velero-server | 1.6.1 (06-25) | 1.6.1 (07-02) |
| OADP velero azure-plugin | 1.6.1 (06-25) | 1.6.1 (07-02) |
| OADP velero hypershift-plugin | 1.6.1 (06-25) | 1.6.1 (07-02) |
| kube-state-metrics | v2.19.0 (06-12) | v2.19.0 (06-30) |
| maestro-agent-sidecar (nginx) | azl3.0.20260602 | azl3.0.20260616 |
| image-sync/oc-mirror | 690892d | 5bfc996 |
| **clusters-service** | **6a49b32 (pinned — good)** | **— (b17f6fe held back)** |

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the repo’s source-of-truth configuration (config/config.yaml) and the materialized dev rendered configs to advance component image digests to their latest versions, while intentionally keeping clusters-service pinned to a last-known-good digest due to a known E2E regression tracked in AROSLSRE-1395.

Changes:

  • Bump multiple component image digests (e.g., acrPull, arobit forwarder/mdsd, kube-events, hypershift, OADP velero + plugins, kube-state-metrics, prometheus operator/prometheus/config-reloader, oc-mirror).
  • Update Istio istioctlVersion from 1.30.1 → 1.30.2.
  • Commit regenerated config/rendered/dev/* outputs consistent with the updated defaults.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
config/config.yaml Updates default component image digests/versions while leaving clusters-service pinned to the known-good digest.
config/rendered/dev/pers/westus3.yaml Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/perf/westus3.yaml Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/dev/westus3.yaml Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/cspr/westus3.yaml Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/ci01/centralus.yaml Materialized dev config reflecting updated component digests/versions.
config/rendered/dev/ci00/centralus.yaml Materialized dev config reflecting updated component digests/versions.

Comment thread config/config.yaml
Add an inline note on the pinned clustersService digest matching the existing
shared-ingress HAProxy convention, so an automated bump does not silently
reintroduce the known-bad digest without context. Addresses PR review feedback.
@raelga

raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Confirmed root cause — clusters-service ARO-26913 (api.listening → HostedCluster Topology)

Bisecting the #5789 bulk digest bump isolated clusters-service as the sole culprit (bisect PRs: #5910 hypershift ✅, #5911 ACM/MCE ✅ merged, #5912 rest-minus-CS ✅, #5920 CS-only ❌ reproducible).

Single-variable proof: #5920 changes only the CS digest vs main. #5912 (no CS) passed candidate 4.22 & 5.0 on the identical 5.0.0-ec.3 payload at overlapping time; #5920 failed both, twice. Same OCP payload → not a version-inventory change → a CS install-behavior regression.

Mechanism (CS 18b5a25ee741db): the delta is merge ARO-26913, which wires the CS api.listening field to the HyperShift AzurePlatformSpec.Topology/Private fields on the HostedCluster CR (AzureTopologyFromCluster: external→PublicAndPrivate, internal→Private; commit 4229a315f) and upgrades the HyperShift API. Setting Topology/Private is gated to Swift-enabled clusters (658939fb6) — and the CI dev stamps (ci00/ci01) are Swift-enabled, so every e2e HostedCluster now gets Topology: PublicAndPrivate.

These Topology/Private fields are only honored by the newest control-plane-operator shipped in candidate 4.22 & 5.0; on 4.20/4.21 the older CPO ignores them (legacy fallback) → those channels stay green. On 4.22/5.0 the new topology handling makes the guest *.apps ingress route resolve to an internal address, so cluster operators (console et al.) never complete, CVO stays Progressing on 5.0.0-ec.3, and the e2e client hits route was never reachable: dial tcp 10.0.0.5:443: i/o timeout (complete_cluster_create_multiversion.go:172).

This PR pins clusters-service to the last-known-good digest (sha256:6a49b32…, vcs-ref 18b5a25) while taking all other latest digests, unblocking the bump. The CS-side ARO-26913 topology behavior needs a fix (or a coordinated HyperShift payload) before the new CS digest can land.

Tracking: AROSLSRE-1395

@raelga

raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Corroboration — hcpctl snapshot analyze (independent) closes the loop

An independent hcpctl snapshot analyze of the #5920 candidate-5.0 failure (which only saw the single failing snapshot, not the bisect) pinpointed the guest-side mechanism, and its HyperShift code citation ties it directly back to the CS trigger:

Complete causal chain

  1. New CS (ARO-26913, 4229a315f) sets AzurePlatformSpec.Topology = PublicAndPrivate on the HostedCluster for external Swift clusters — i.e. every e2e cluster.
  2. HyperShift HCCO .../ingress/params.go:51-54 sets the guest default IngressController Scope: Internal whenever Azure.Topology == AzureTopologyPrivate || AzureTopologyPublicAndPrivate. So the new CS flips the guest ingress from External → Internal. HCCO publishes wildcard *.apps → 10.0.0.5 (internal LB frontend) and, per reconcile.go:18, never mutates an existing IngressController (create-only).
  3. On candidate 4.22 & 5.0 only, the payload's guest cluster-ingress-operator then flips the IngressController Scope: Internal → External (gen 1→2), stripping the azure-load-balancer-internal annotation; CCM migrates router-default onto the public LB and tears down the internal LB (ARM-throttled), but the wildcard DNS is never updated away from 10.0.0.5 → the route dials a deprovisioning internal LB → dial tcp 10.0.0.5:443: i/o timeout.
  4. Passing siblings 4.20/4.21 show zero router-default LB churn and never flip; chore(config): bump component digests except ACM/hypershift/clusters-service (bisect e2e #5789) #5912 (old CS, no Topology set) keeps ingress External on the identical 5.0.0-ec.3 payload → pass.

So the CS ARO-26913 Topology wiring is the trigger that puts the cluster into the Internal-ingress state where the candidate payload's ingress-operator behavior breaks external reachability. Two independent fronts are viable: (a) the CS/topology side (what #5926 addresses by pinning CS), and (b) a HyperShift/ARO-HCP guardrail that reconciles the guest ingress scope so a payload can't silently flip an Azure private cluster's ingress to External.

… 18b5a25

The automated image bump (tag: latest) advances clusters-service to vcs-ref
ee741db (ARO-26913), which sets AzurePlatformSpec.Topology=PublicAndPrivate on
the HostedCluster; on candidate 4.22/5.0 payloads the guest ingress-operator
flips the ingress LB scope back to External, leaving the *.apps route
unreachable and failing e2e-parallel. Pin the image-updater source tag to the
last-known-good build 18b5a25 so the bump cannot re-advance it. See AROSLSRE-1395.
Copilot AI review requested due to automatic review settings July 3, 2026 21:05
@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@raelga

raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Added a hard pin in the image-updater (commit eba392c)

The config.yaml digest pin alone wasn't durable — the automated bumper (tooling/image-updater/config.yaml, clusters-service.source.tag: "latest") resolves latest to the bad b17f6fe (vcs-ref ee741db) and would re-advance CS on the next run.

Pinned the source tag to the last-known-good immutable build tag 18b5a25 → digest sha256:6a49b32…, mirroring the existing hypershift-shared-ingress fixed-tag pin pattern (README §"Pinning Images and Rolling Back").

Verified: with the bad digest injected, the pinned updater proposes reverting to the good 6a49b32 (Tag 18b5a25); make -C config materialize + detect-change are clean. Restore tag: "latest" once the CS ARO-26913 topology regression is fixed.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comment thread config/config.yaml Outdated
@raelga

raelga commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

/retest-required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants