Skip to content

chore(config): bump clusters-service digest (bisect e2e #5789)#5920

Open
raelga wants to merge 1 commit into
Azure:mainfrom
raelga:bisect/cs-only
Open

chore(config): bump clusters-service digest (bisect e2e #5789)#5920
raelga wants to merge 1 commit into
Azure:mainfrom
raelga:bisect/cs-only

Conversation

@raelga

@raelga raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Bisecting the e2e-parallel regression on the automated image-digest PR #5789. Split out from the "rest" PR #5912.

What

Bumps only the clusters-service image digest to the latest value from #5789, on top of current main. All other components stay at their main digests.

app-sre/aro-hcp-clusters-service
  sha256:6a49b32…  (main)
  sha256:b17f6fe…  (#5789 latest)

Why

hypershift (#5910) and ACM/MCE (#5911) were cleared. clusters-service was not part of the original pass→fail boundary at 128feb1, but it is bumped in the overall #5789 update, so this PR verifies it in isolation. clusters-service is the component most directly involved in HCP cluster provisioning.

Testing

Relies on ci/prow/e2e-parallel. make -C config/ detect-change is clean. Diagnostic bisect PR — not intended to merge as-is.

Special notes for your reviewer

Do not merge. Diagnostic bisect of #5789.

Copilot AI review requested due to automatic review settings July 3, 2026 13:49
@openshift-ci openshift-ci Bot requested review from janboll and sclarkso July 3, 2026 13:49
@openshift-ci openshift-ci Bot added the approved label Jul 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Diagnostic PR to bisect an e2e-parallel regression by bumping only the clusters-service container image digest to the latest value from #5789, while keeping all other component digests aligned with current main.

Changes:

  • Update clusters-service default image digest in config/config.yaml.
  • Regenerate/update affected rendered dev configs to reflect the new digest across West US 3 and Central US dev environments.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
config/config.yaml Bumps the default clusters-service image digest to sha256:b17f6fe… (with updated metadata comment).
config/rendered/dev/dev/westus3.yaml Updates rendered dev environment clustersService.image.digest to the new digest.
config/rendered/dev/pers/westus3.yaml Updates rendered pers environment clustersService.image.digest to the new digest.
config/rendered/dev/perf/westus3.yaml Updates rendered perf environment clustersService.image.digest to the new digest.
config/rendered/dev/cspr/westus3.yaml Updates rendered cspr environment clustersService.image.digest to the new digest.
config/rendered/dev/ci00/centralus.yaml Updates rendered ci00 environment clustersService.image.digest to the new digest.
config/rendered/dev/ci01/centralus.yaml Updates rendered ci01 environment clustersService.image.digest to the new digest.

@ahitacat

ahitacat commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

/lgtm

@openshift-ci

openshift-ci Bot commented Jul 3, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahitacat, raelga

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@raelga

raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/hold

Holding this PR: the clusters-service digest bump it carries (sha256:b17f6fe…, vcs-ref ee741db) is confirmed to break ci/prow/e2e-parallel. Across two consecutive runs it reproducibly fails the same specs — test/e2e/complete_cluster_create_multiversion.go:172 "verify simple web app runs" for the candidate 4.22 and 5.0 channels — with route was never reachable: dial tcp 10.0.0.5:443: i/o timeout (control plane + node pools provision, but the newest-channel data-plane ingress never becomes reachable). 4.20/4.21 are unaffected.

This is the isolated culprit behind #5789. Do not merge. The good CS digest stays pinned via #5926 (bump-all-except-clusters-service); follow-up tracked in AROSLSRE-1395.

@raelga

raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Root-cause evidence — this clusters-service bump breaks e2e-parallel (candidate 4.22 & 5.0)

This CS-only digest bump (18b5a25ee741db, i.e. old sha256:6a49b32… → new sha256:b17f6fe9…) is the confirmed culprit behind the automated bulk bump #5789 failing ci/prow/e2e-parallel. Updated with hard evidence gathered from the actual failing runs (Kusto datadump + oc adm inspect artifacts), not just repeated failures.

How it was isolated

Bisected #5789's bulk digest bump into per-component subset PRs, all driven through e2e-parallel:

PR Scope Result
#5910 hypershift only ✅ pass (merged)
#5911 ACM/MCE only ✅ pass (merged)
#5912 everything except ACM/CS/HS ✅ pass (closed, superseded)
#5926 all bumps except CS (good CS) ✅ candidate 4.22 & 5.0 pass, 0× 10.0.0.5
#5920 (this) clusters-service only reproducible fail — candidate 4.22 & 5.0, twice

Single-variable proof. #5920 changes only the CS digest vs main; #5926 changes everything except CS. Toggling only the CS image flips candidate 4.22/5.0 between PASS (good CS) and FAIL-with-10.0.0.5 (bad CS). #5926's sole failure is an unrelated back-level-4.19 nodepool provisioning timeout (cluster_version_backlevel.go:194, context deadline exceeded).

Failure signature (from the run's test log)

test/e2e/complete_cluster_create_multiversion.go:172 — the guest *.apps route is never reachable:

route was never reachable: Get "https://agnhost-...apps.aro...": dial tcp 10.0.0.5:443: i/o timeout

Failed: candidate 5.0 and 4.22 (3 failed / 47 passed). Control plane Available, node pool Ready — failure is confined to the guest data-plane ingress path. 10.0.0.5 is a private IP.

Hard evidence 1 — CS sets Topology=PublicAndPrivate (real HostedCluster manifests, Kusto)

Queried ServiceLogs.backendLogs datadump (kubeContent) for HostedCluster.spec.platform.azure:

cluster CS topology private.type
failing candidate-5.0 (this run) bad PublicAndPrivate Swift
a good-CS candidate-5.0 good (absent / null) (null)

Across 24h, 32 of 315 candidate clusters carried topology=PublicAndPrivate — exactly the bad-CS runs; the other 283 are empty. This is the output of merge ARO-26913:

  • 4229a315f wire api.listening to HyperShift TopologyAzureTopologyFromCluster: external→PublicAndPrivate.
  • 658939fb6 make Topology/Private conditional on Swift enablement — gates it to Swift clusters. CI stamps (ci00/ci01) are Swift ⇒ every e2e HostedCluster gets Topology=PublicAndPrivate.

Hard evidence 2 — the guest ingress LB frontend (from oc adm inspect, directly observed)

Captured the failing 4.22 and passing 4.20 guest IngressController/Service/DNSRecord in the same run:

IngressController scope router-default LB frontend wildcard *.apps DNS ingress-operator status
4.22 (FAIL) External 10.0.0.5 (private) 10.0.0.5 LoadBalancerReady=True, Progressing=False, DNSReady=True
4.20 (PASS) External 20.106.106.144 (public) 20.106.106.144 same (healthy)

Both IngressControllers are scope: External at generation 2, and neither Service carries azure-load-balancer-internal. The discriminator is not the k8s scope — it is the Azure LB frontend IP: on the 4.22/5.0 payload the ingress LB is provisioned with a private frontend (10.0.0.5, from the node subnet) despite scope: External, the operator reports full success, and DNS faithfully publishes that private IP as the wildcard. On 4.20/4.21 the frontend is public → reachable. 10.0.0.5 recurs across both failing runs/clusters ⇒ it is the internal LB frontend, not a random address.

Mechanism

  1. CS ARO-26913 sets HostedCluster.spec.platform.azure.topology=PublicAndPrivate for Swift clusters (all CI versions). (evidence 1)
  2. On the 4.22/5.0 guest payload the ingress LoadBalancer is provisioned with a private frontend (10.0.0.5) even though the Service is External/no-internal-annotation — the internalization derives from the PublicAndPrivate topology wiring, not the k8s annotation. The operator deems it healthy and publishes the private IP. On 4.20/4.21 the frontend is public. (evidence 2)
  3. e2e probes *.apps from a public vantage ⇒ dial 10.0.0.5:443 i/o timeout. (test log)

Note: this supersedes an earlier hypothesis that 4.22/5.0 retained scope: Internal. Direct inspect shows scope: External on both pass and fail — the difference is purely the provisioned frontend IP.

Open question for the CS / HyperShift owners

Why does the newer payload internalize the ingress frontend for PublicAndPrivate while 4.20/4.21 keeps it public? Pinning that exactly needs guest CCM / cloud-network-config-controller logs (guest-side, not in CI artifacts). Regardless, the trigger is CS now emitting Topology=PublicAndPrivate for these (public) e2e clusters:

  • Should api.listening=external map to PublicAndPrivate, or leave Topology unset for non-private clusters? (AzureTopologyFromCluster, 4229a315f.)
  • Confirm the Swift-enablement gate (658939fb6) is intended to apply to public e2e clusters.

Tracking: AROSLSRE-1395 · investigation/fix AROSLSRE-1404. Mitigation: #5926 pins CS to the good 18b5a25 (config digest + image-updater source tag) until resolved. Keep this PR on hold — retained as the reproduction/evidence.

@raelga

raelga commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/test e2e-parallel

@openshift-ci

openshift-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown

@raelga: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-parallel cf2c936 link true /test e2e-parallel

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants