chore(config): bump clusters-service digest (bisect e2e #5789) by raelga · Pull Request #5920 · Azure/ARO-HCP

raelga · 2026-07-03T13:49:48Z

Bisecting the e2e-parallel regression on the automated image-digest PR #5789. Split out from the "rest" PR #5912.

What

Bumps only the clusters-service image digest to the latest value from #5789, on top of current main. All other components stay at their main digests.

app-sre/aro-hcp-clusters-service
  sha256:6a49b32…  (main)
  sha256:b17f6fe…  (#5789 latest)

Why

hypershift (#5910) and ACM/MCE (#5911) were cleared. clusters-service was not part of the original pass→fail boundary at 128feb1, but it is bumped in the overall #5789 update, so this PR verifies it in isolation. clusters-service is the component most directly involved in HCP cluster provisioning.

Testing

Relies on ci/prow/e2e-parallel. make -C config/ detect-change is clean. Diagnostic bisect PR — not intended to merge as-is.

Special notes for your reviewer

Do not merge. Diagnostic bisect of #5789.

Copilot

Pull request overview

Diagnostic PR to bisect an e2e-parallel regression by bumping only the clusters-service container image digest to the latest value from #5789, while keeping all other component digests aligned with current main.

Changes:

Update clusters-service default image digest in config/config.yaml.
Regenerate/update affected rendered dev configs to reflect the new digest across West US 3 and Central US dev environments.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
config/config.yaml	Bumps the default `clusters-service` image digest to `sha256:b17f6fe…` (with updated metadata comment).
config/rendered/dev/dev/westus3.yaml	Updates rendered dev environment `clustersService.image.digest` to the new digest.
config/rendered/dev/pers/westus3.yaml	Updates rendered pers environment `clustersService.image.digest` to the new digest.
config/rendered/dev/perf/westus3.yaml	Updates rendered perf environment `clustersService.image.digest` to the new digest.
config/rendered/dev/cspr/westus3.yaml	Updates rendered cspr environment `clustersService.image.digest` to the new digest.
config/rendered/dev/ci00/centralus.yaml	Updates rendered ci00 environment `clustersService.image.digest` to the new digest.
config/rendered/dev/ci01/centralus.yaml	Updates rendered ci01 environment `clustersService.image.digest` to the new digest.

ahitacat · 2026-07-03T14:14:35Z

/lgtm

openshift-ci · 2026-07-03T14:14:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahitacat, raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~config/OWNERS~~ [raelga]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

raelga · 2026-07-03T19:11:48Z

/hold

Holding this PR: the clusters-service digest bump it carries (sha256:b17f6fe…, vcs-ref ee741db) is confirmed to break ci/prow/e2e-parallel. Across two consecutive runs it reproducibly fails the same specs — test/e2e/complete_cluster_create_multiversion.go:172 "verify simple web app runs" for the candidate 4.22 and 5.0 channels — with route was never reachable: dial tcp 10.0.0.5:443: i/o timeout (control plane + node pools provision, but the newest-channel data-plane ingress never becomes reachable). 4.20/4.21 are unaffected.

This is the isolated culprit behind #5789. Do not merge. The good CS digest stays pinned via #5926 (bump-all-except-clusters-service); follow-up tracked in AROSLSRE-1395.

raelga · 2026-07-03T21:33:36Z

Root-cause evidence — this clusters-service bump breaks e2e-parallel (candidate 4.22 & 5.0)

This CS-only digest bump (18b5a25 → ee741db, i.e. old sha256:6a49b32… → new sha256:b17f6fe9…) is the confirmed culprit behind the automated bulk bump #5789 failing ci/prow/e2e-parallel. Updated with hard evidence gathered from the actual failing runs (Kusto datadump + oc adm inspect artifacts), not just repeated failures.

How it was isolated

Bisected #5789's bulk digest bump into per-component subset PRs, all driven through e2e-parallel:

PR	Scope	Result
#5910	hypershift only	✅ pass (merged)
#5911	ACM/MCE only	✅ pass (merged)
#5912	everything except ACM/CS/HS	✅ pass (closed, superseded)
#5926	all bumps except CS (good CS)	✅ candidate 4.22 & 5.0 pass, 0× `10.0.0.5`
#5920 (this)	clusters-service only	❌ reproducible fail — candidate 4.22 & 5.0, twice

Single-variable proof. #5920 changes only the CS digest vs main; #5926 changes everything except CS. Toggling only the CS image flips candidate 4.22/5.0 between PASS (good CS) and FAIL-with-10.0.0.5 (bad CS). #5926's sole failure is an unrelated back-level-4.19 nodepool provisioning timeout (cluster_version_backlevel.go:194, context deadline exceeded).

Failure signature (from the run's test log)

test/e2e/complete_cluster_create_multiversion.go:172 — the guest *.apps route is never reachable:

route was never reachable: Get "https://agnhost-...apps.aro...": dial tcp 10.0.0.5:443: i/o timeout

Failed: candidate 5.0 and 4.22 (3 failed / 47 passed). Control plane Available, node pool Ready — failure is confined to the guest data-plane ingress path. 10.0.0.5 is a private IP.

Hard evidence 1 — CS sets `Topology=PublicAndPrivate` (real HostedCluster manifests, Kusto)

Queried ServiceLogs.backendLogs datadump (kubeContent) for HostedCluster.spec.platform.azure:

cluster	CS	`topology`	`private.type`
failing candidate-5.0 (this run)	bad	PublicAndPrivate	Swift
a good-CS candidate-5.0	good	(absent / null)	(null)

Across 24h, 32 of 315 candidate clusters carried topology=PublicAndPrivate — exactly the bad-CS runs; the other 283 are empty. This is the output of merge ARO-26913:

4229a315f wire api.listening to HyperShift Topology — AzureTopologyFromCluster: external→PublicAndPrivate.
658939fb6 make Topology/Private conditional on Swift enablement — gates it to Swift clusters. CI stamps (ci00/ci01) are Swift ⇒ every e2e HostedCluster gets Topology=PublicAndPrivate.

Hard evidence 2 — the guest ingress LB frontend (from `oc adm inspect`, directly observed)

Captured the failing 4.22 and passing 4.20 guest IngressController/Service/DNSRecord in the same run:

	IngressController scope	`router-default` LB frontend	wildcard `*.apps` DNS	ingress-operator status
4.22 (FAIL)	External	`10.0.0.5` (private)	`10.0.0.5`	LoadBalancerReady=True, Progressing=False, DNSReady=True
4.20 (PASS)	External	`20.106.106.144` (public)	`20.106.106.144`	same (healthy)

Both IngressControllers are scope: External at generation 2, and neither Service carries azure-load-balancer-internal. The discriminator is not the k8s scope — it is the Azure LB frontend IP: on the 4.22/5.0 payload the ingress LB is provisioned with a private frontend (10.0.0.5, from the node subnet) despite scope: External, the operator reports full success, and DNS faithfully publishes that private IP as the wildcard. On 4.20/4.21 the frontend is public → reachable. 10.0.0.5 recurs across both failing runs/clusters ⇒ it is the internal LB frontend, not a random address.

Mechanism

CS ARO-26913 sets HostedCluster.spec.platform.azure.topology=PublicAndPrivate for Swift clusters (all CI versions). (evidence 1)
On the 4.22/5.0 guest payload the ingress LoadBalancer is provisioned with a private frontend (10.0.0.5) even though the Service is External/no-internal-annotation — the internalization derives from the PublicAndPrivate topology wiring, not the k8s annotation. The operator deems it healthy and publishes the private IP. On 4.20/4.21 the frontend is public. (evidence 2)
e2e probes *.apps from a public vantage ⇒ dial 10.0.0.5:443 i/o timeout. (test log)

Note: this supersedes an earlier hypothesis that 4.22/5.0 retained scope: Internal. Direct inspect shows scope: External on both pass and fail — the difference is purely the provisioned frontend IP.

Open question for the CS / HyperShift owners

Why does the newer payload internalize the ingress frontend for PublicAndPrivate while 4.20/4.21 keeps it public? Pinning that exactly needs guest CCM / cloud-network-config-controller logs (guest-side, not in CI artifacts). Regardless, the trigger is CS now emitting Topology=PublicAndPrivate for these (public) e2e clusters:

Should api.listening=external map to PublicAndPrivate, or leave Topology unset for non-private clusters? (AzureTopologyFromCluster, 4229a315f.)
Confirm the Swift-enablement gate (658939fb6) is intended to apply to public e2e clusters.

Tracking: AROSLSRE-1395 · investigation/fix AROSLSRE-1404. Mitigation: #5926 pins CS to the good 18b5a25 (config digest + image-updater source tag) until resolved. Keep this PR on hold — retained as the reproduction/evidence.

raelga · 2026-07-03T21:51:38Z

/test e2e-parallel

openshift-ci · 2026-07-04T00:06:23Z

@raelga: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-parallel	`cf2c936`	link	true	`/test e2e-parallel`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

chore(config): bump clusters-service digest (bisect e2e Azure#5789)

cf2c936

Copilot AI review requested due to automatic review settings July 3, 2026 13:49

openshift-ci Bot requested review from janboll and sclarkso July 3, 2026 13:49

openshift-ci Bot added the approved label Jul 3, 2026

Copilot started reviewing on behalf of raelga July 3, 2026 13:50 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

openshift-ci Bot assigned ahitacat Jul 3, 2026

openshift-ci Bot added the lgtm label Jul 3, 2026

raelga mentioned this pull request Jul 3, 2026

chore: bump all component images except clusters-service (AROSLSRE-1395) #5926

Open

openshift-ci Bot added the do-not-merge/hold label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(config): bump clusters-service digest (bisect e2e #5789)#5920

chore(config): bump clusters-service digest (bisect e2e #5789)#5920
raelga wants to merge 1 commit into
Azure:mainfrom
raelga:bisect/cs-only

raelga commented Jul 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ahitacat commented Jul 3, 2026

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

raelga commented Jul 3, 2026

Uh oh!

raelga commented Jul 3, 2026 •

edited

Loading

Uh oh!

raelga commented Jul 3, 2026

Uh oh!

openshift-ci Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

raelga commented Jul 3, 2026

What

Why

Testing

Special notes for your reviewer

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ahitacat commented Jul 3, 2026

Uh oh!

openshift-ci Bot commented Jul 3, 2026

Uh oh!

raelga commented Jul 3, 2026

Uh oh!

raelga commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root-cause evidence — this clusters-service bump breaks e2e-parallel (candidate 4.22 & 5.0)

How it was isolated

Failure signature (from the run's test log)

Hard evidence 1 — CS sets Topology=PublicAndPrivate (real HostedCluster manifests, Kusto)

Hard evidence 2 — the guest ingress LB frontend (from oc adm inspect, directly observed)

Mechanism

Open question for the CS / HyperShift owners

Uh oh!

raelga commented Jul 3, 2026

Uh oh!

openshift-ci Bot commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

raelga commented Jul 3, 2026 •

edited

Loading

Hard evidence 1 — CS sets `Topology=PublicAndPrivate` (real HostedCluster manifests, Kusto)

Hard evidence 2 — the guest ingress LB frontend (from `oc adm inspect`, directly observed)