chore(config): bump clusters-service digest (bisect e2e #5789)#5920
chore(config): bump clusters-service digest (bisect e2e #5789)#5920raelga wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Diagnostic PR to bisect an e2e-parallel regression by bumping only the clusters-service container image digest to the latest value from #5789, while keeping all other component digests aligned with current main.
Changes:
- Update
clusters-servicedefault image digest inconfig/config.yaml. - Regenerate/update affected rendered dev configs to reflect the new digest across West US 3 and Central US dev environments.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| config/config.yaml | Bumps the default clusters-service image digest to sha256:b17f6fe… (with updated metadata comment). |
| config/rendered/dev/dev/westus3.yaml | Updates rendered dev environment clustersService.image.digest to the new digest. |
| config/rendered/dev/pers/westus3.yaml | Updates rendered pers environment clustersService.image.digest to the new digest. |
| config/rendered/dev/perf/westus3.yaml | Updates rendered perf environment clustersService.image.digest to the new digest. |
| config/rendered/dev/cspr/westus3.yaml | Updates rendered cspr environment clustersService.image.digest to the new digest. |
| config/rendered/dev/ci00/centralus.yaml | Updates rendered ci00 environment clustersService.image.digest to the new digest. |
| config/rendered/dev/ci01/centralus.yaml | Updates rendered ci01 environment clustersService.image.digest to the new digest. |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahitacat, raelga The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold Holding this PR: the clusters-service digest bump it carries ( This is the isolated culprit behind #5789. Do not merge. The good CS digest stays pinned via #5926 (bump-all-except-clusters-service); follow-up tracked in AROSLSRE-1395. |
Root-cause evidence — this clusters-service bump breaks e2e-parallel (candidate 4.22 & 5.0)This CS-only digest bump ( How it was isolatedBisected #5789's bulk digest bump into per-component subset PRs, all driven through e2e-parallel:
Single-variable proof. #5920 changes only the CS digest vs Failure signature (from the run's test log)
Failed: candidate 5.0 and 4.22 (3 failed / 47 passed). Control plane Available, node pool Ready — failure is confined to the guest data-plane ingress path. Hard evidence 1 — CS sets
|
| cluster | CS | topology |
private.type |
|---|---|---|---|
| failing candidate-5.0 (this run) | bad | PublicAndPrivate | Swift |
| a good-CS candidate-5.0 | good | (absent / null) | (null) |
Across 24h, 32 of 315 candidate clusters carried topology=PublicAndPrivate — exactly the bad-CS runs; the other 283 are empty. This is the output of merge ARO-26913:
4229a315fwire api.listening to HyperShift Topology —AzureTopologyFromCluster: external→PublicAndPrivate.658939fb6make Topology/Private conditional on Swift enablement — gates it to Swift clusters. CI stamps (ci00/ci01) are Swift ⇒ every e2e HostedCluster getsTopology=PublicAndPrivate.
Hard evidence 2 — the guest ingress LB frontend (from oc adm inspect, directly observed)
Captured the failing 4.22 and passing 4.20 guest IngressController/Service/DNSRecord in the same run:
| IngressController scope | router-default LB frontend |
wildcard *.apps DNS |
ingress-operator status | |
|---|---|---|---|---|
| 4.22 (FAIL) | External | 10.0.0.5 (private) |
10.0.0.5 |
LoadBalancerReady=True, Progressing=False, DNSReady=True |
| 4.20 (PASS) | External | 20.106.106.144 (public) |
20.106.106.144 |
same (healthy) |
Both IngressControllers are scope: External at generation 2, and neither Service carries azure-load-balancer-internal. The discriminator is not the k8s scope — it is the Azure LB frontend IP: on the 4.22/5.0 payload the ingress LB is provisioned with a private frontend (10.0.0.5, from the node subnet) despite scope: External, the operator reports full success, and DNS faithfully publishes that private IP as the wildcard. On 4.20/4.21 the frontend is public → reachable. 10.0.0.5 recurs across both failing runs/clusters ⇒ it is the internal LB frontend, not a random address.
Mechanism
- CS ARO-26913 sets
HostedCluster.spec.platform.azure.topology=PublicAndPrivatefor Swift clusters (all CI versions). (evidence 1) - On the 4.22/5.0 guest payload the ingress LoadBalancer is provisioned with a private frontend (
10.0.0.5) even though the Service isExternal/no-internal-annotation — the internalization derives from thePublicAndPrivatetopology wiring, not the k8s annotation. The operator deems it healthy and publishes the private IP. On 4.20/4.21 the frontend is public. (evidence 2) - e2e probes
*.appsfrom a public vantage ⇒dial 10.0.0.5:443 i/o timeout. (test log)
Note: this supersedes an earlier hypothesis that 4.22/5.0 retained
scope: Internal. Direct inspect showsscope: Externalon both pass and fail — the difference is purely the provisioned frontend IP.
Open question for the CS / HyperShift owners
Why does the newer payload internalize the ingress frontend for PublicAndPrivate while 4.20/4.21 keeps it public? Pinning that exactly needs guest CCM / cloud-network-config-controller logs (guest-side, not in CI artifacts). Regardless, the trigger is CS now emitting Topology=PublicAndPrivate for these (public) e2e clusters:
- Should
api.listening=externalmap toPublicAndPrivate, or leaveTopologyunset for non-private clusters? (AzureTopologyFromCluster,4229a315f.) - Confirm the Swift-enablement gate (
658939fb6) is intended to apply to public e2e clusters.
Tracking: AROSLSRE-1395 · investigation/fix AROSLSRE-1404. Mitigation: #5926 pins CS to the good 18b5a25 (config digest + image-updater source tag) until resolved. Keep this PR on hold — retained as the reproduction/evidence.
|
/test e2e-parallel |
|
@raelga: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Bisecting the e2e-parallel regression on the automated image-digest PR #5789. Split out from the "rest" PR #5912.
What
Bumps only the
clusters-serviceimage digest to the latest value from #5789, on top of currentmain. All other components stay at theirmaindigests.Why
hypershift (#5910) and ACM/MCE (#5911) were cleared. clusters-service was not part of the original pass→fail boundary at
128feb1, but it is bumped in the overall #5789 update, so this PR verifies it in isolation. clusters-service is the component most directly involved in HCP cluster provisioning.Testing
Relies on
ci/prow/e2e-parallel.make -C config/ detect-changeis clean. Diagnostic bisect PR — not intended to merge as-is.Special notes for your reviewer
Do not merge. Diagnostic bisect of #5789.