Skip to content

OCPBUGS-86415: Use canonical image for kube-apiserver-proxy static pod#8742

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
csrwng:OCPBUGS-86415
Jun 25, 2026
Merged

OCPBUGS-86415: Use canonical image for kube-apiserver-proxy static pod#8742
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
csrwng:OCPBUGS-86415

Conversation

@csrwng

@csrwng csrwng commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added the hypershift.openshift.io/canonical-data-plane-images annotation to control whether HAProxy uses canonical (pre-override) data-plane component images.
    • HAProxy image resolution can now automatically prefer canonical images for new or upgrading node pools, or follow the annotation when set.
  • Bug Fixes

    • Registry mirror and release payload handling now retain canonical component image mappings, preserving intended HAProxy image behavior (including honoring explicitly set HAProxy image annotations).
  • Tests

    • Expanded HAProxy image resolution scenarios and coverage for stable, upgrading, and registry override cases.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 16, 2026
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jun 16, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added do-not-merge/needs-area area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release labels Jun 16, 2026
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds canonical component image tracking to support HAProxy image resolution when registry overrides are active. ReleaseImage gains a new canonicalComponentImages map and methods to store and retrieve pre-override component image references. The RegistryMirrorProviderDecorator captures canonical images from the delegate before applying overrides, preserving them in the returned ReleaseImage. A new NodePool annotation hypershift.openshift.io/canonical-data-plane-images is introduced. resolveHAProxyImage accepts a useCanonicalImages flag to select HAProxy images from the canonical mapping when set. generateHAProxyRawConfig detects new or upgrading NodePools by comparing Status.Version to releaseImage.Version() and automatically sets the annotation to "true" for those cases. Test coverage is extended with new parameterized cases covering upgrade scenarios, annotation behavior, and stable NodePool handling.

Sequence Diagram

sequenceDiagram
  participant DelegateProvider as Delegate<br/>ReleaseProvider
  participant RegistryMirror as RegistryMirror<br/>Decorator.Lookup
  participant ReleaseImage
  participant NPController as NodePool<br/>Controller
  participant resolveHAProxy as resolveHAProxyImage

  DelegateProvider->>ReleaseImage: returns with ComponentImages()
  RegistryMirror->>ReleaseImage: captures CanonicalComponentImages()
  RegistryMirror->>RegistryMirror: rewrites ImageStream tags
  RegistryMirror->>ReleaseImage: SetCanonicalComponentImages()<br/>(pre-override mapping)
  RegistryMirror->>NPController: returns ReleaseImage
  NPController->>NPController: detect new/upgrading<br/>via Status.Version
  NPController->>NPController: set canonical-data-plane-images<br/>annotation = "true"
  NPController->>resolveHAProxy: useCanonicalImages flag
  resolveHAProxy->>ReleaseImage: CanonicalComponentImages()<br/>or ComponentImages()
  resolveHAProxy->>NPController: selected HAProxy image
Loading

Suggested Reviewers

  • jparrill
  • devguyio
  • muraee
🚥 Pre-merge checks | ✅ 10 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning TestResolveHAProxyImage lacks meaningful failure messages on 6 assertions (lines 2799, 2803-2804, 2809, 2814, 2821), violating the assertion message quality requirement specified in the check. Add descriptive failure messages to all assertions: e.g., g.Expect(err).ToNot(HaveOccurred(), "failed to generate HAProxy config"). Lines 2826 and 2830-2831 provide good examples.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing canonical image support for the kube-apiserver-proxy static pod, addressing the specific Jira issue OCPBUGS-86415.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 13 test cases in TestResolveHAProxyImage use static, descriptive string literals. Test names are defined in test case struct and contain no dynamic values, timestamps, IDs, pod/node/namespace n...
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are limited to image selection logic via annotation-gated canonical image references. No pod affinity, topology spread, node selectors, or replica c...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests. It only modifies Go unit tests (standard testing package) and implementation code. Check is not applicable.
No-Weak-Crypto ✅ Passed PR introduces no weak cryptography, custom crypto, or non-constant-time secret comparisons. Changes only affect image reference metadata and annotation handling.
Container-Privileges ✅ Passed PR does not introduce privileged containers, hostPID, hostNetwork, hostIPC, SYS_ADMIN capabilities, or allowPrivilegeEscalation settings. Changes are image reference handling only.
No-Sensitive-Data-In-Logs ✅ Passed No logging introduced that exposes passwords, tokens, API keys, PII, or sensitive data. Changes only manipulate image references and annotation states without exposing credentials.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Jun 16, 2026
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.51%. Comparing base (8efac9c) to head (b12120c).
⚠️ Report is 41 commits behind head on main.

Files with missing lines Patch % Lines
support/releaseinfo/releaseinfo.go 28.57% 4 Missing and 1 partial ⚠️
support/releaseinfo/registry_mirror_provider.go 62.50% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8742      +/-   ##
==========================================
+ Coverage   41.91%   42.51%   +0.60%     
==========================================
  Files         769      768       -1     
  Lines       96763    95298    -1465     
==========================================
- Hits        40557    40519      -38     
+ Misses      53402    51973    -1429     
- Partials     2804     2806       +2     
Files with missing lines Coverage Δ
...erator/controllers/nodepool/nodepool_controller.go 44.00% <100.00%> (+0.82%) ⬆️
support/releaseinfo/registry_mirror_provider.go 48.27% <62.50%> (+0.44%) ⬆️
support/releaseinfo/releaseinfo.go 52.15% <28.57%> (+0.66%) ⬆️

... and 6 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.47% <46.66%> (+0.01%) ⬆️
cpo-hostedcontrolplane 44.84% <ø> (ø)
cpo-other 44.32% <ø> (-0.25%) ⬇️
hypershift-operator 53.08% <100.00%> (+2.90%) ⬆️
other 31.69% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nodePoolAnnotationTaints = "hypershift.openshift.io/nodePoolTaints"
nodePoolAnnotationPlatformMachineTemplate = "hypershift.openshift.io/nodePoolPlatformMachineTemplate"
nodePoolAnnotationTaints = "hypershift.openshift.io/nodePoolTaints"
nodePoolAnnotationCanonicalDataPlaneImages = "hypershift.openshift.io/canonical-data-plane-images"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we // doc this annotation? You pr desc seems pretty explanatory "Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added a doc comment on the nodePoolAnnotationCanonicalDataPlaneImages constant explaining its purpose and when it's set automatically.

@hasueki

hasueki commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@csrwng Does this change partially address this as well? https://redhat.atlassian.net/browse/OCPBUGS-44470

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
support/releaseinfo/releaseinfo.go (1)

61-69: ⚡ Quick win

Defensively copy canonical component maps at API boundaries.

CanonicalComponentImages() and SetCanonicalComponentImages() currently share the same map instances with callers. That makes internal state externally mutable and can leak state across provider/decorator flows.

Suggested change
 func (i *ReleaseImage) CanonicalComponentImages() map[string]string {
-	if i.canonicalComponentImages != nil {
-		return i.canonicalComponentImages
-	}
-	return i.ComponentImages()
+	source := i.canonicalComponentImages
+	if source == nil {
+		source = i.ComponentImages()
+	}
+	out := make(map[string]string, len(source))
+	for k, v := range source {
+		out[k] = v
+	}
+	return out
 }
 
 // SetCanonicalComponentImages stores the pre-override component images.
 func (i *ReleaseImage) SetCanonicalComponentImages(images map[string]string) {
-	i.canonicalComponentImages = images
+	if images == nil {
+		i.canonicalComponentImages = nil
+		return
+	}
+	copied := make(map[string]string, len(images))
+	for k, v := range images {
+		copied[k] = v
+	}
+	i.canonicalComponentImages = copied
 }

Also applies to: 71-74

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@support/releaseinfo/releaseinfo.go` around lines 61 - 69, The
CanonicalComponentImages() method returns a direct reference to the internal
canonicalComponentImages map, allowing callers to mutate internal state.
Similarly, the SetCanonicalComponentImages() method stores a direct reference to
the provided map parameter. To fix this, defensively copy the map in
CanonicalComponentImages() before returning it so callers cannot modify the
internal state, and defensively copy the provided map parameter in
SetCanonicalComponentImages() before storing it to prevent external mutations
from affecting internal state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 2599-2605: The test case at line 2599 intends to validate the
"stable NodePool without annotation" scenario but the FakeReleaseProvider
builder (around lines 2715-2720) does not set an explicit Version field, causing
the controller's nodePool.Status.Version comparison check to treat this as an
upgrade case instead of stable. Set an explicit Version value on the
FakeReleaseProvider builder that matches the nodePoolStatusVersion of "4.18.0"
so the test properly exercises the stable-without-annotation branch and
validates that registry overrides are preserved in the stable case.

In `@support/releaseinfo/registry_mirror_provider.go`:
- Around line 35-38: The current implementation snapshots ComponentImages() only
when local registry overrides exist, which causes decorator chains to lose
canonical state and replace upstream canonical data with already-overridden
values. Instead of conditionally calling releaseImage.ComponentImages() inside
the len(p.RegistryOverrides) check, access the canonical images directly from
the delegate regardless of whether overrides exist. This preserves upstream
canonical state in decorator chains. Apply this same fix to both occurrences of
this pattern (the one around lines 35-38 and the additional one noted at lines
51-53).

---

Nitpick comments:
In `@support/releaseinfo/releaseinfo.go`:
- Around line 61-69: The CanonicalComponentImages() method returns a direct
reference to the internal canonicalComponentImages map, allowing callers to
mutate internal state. Similarly, the SetCanonicalComponentImages() method
stores a direct reference to the provided map parameter. To fix this,
defensively copy the map in CanonicalComponentImages() before returning it so
callers cannot modify the internal state, and defensively copy the provided map
parameter in SetCanonicalComponentImages() before storing it to prevent external
mutations from affecting internal state.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: c42a1fb8-b195-4446-b50d-62dbc37a4694

📥 Commits

Reviewing files that changed from the base of the PR and between e7c2be0 and d9d6577.

📒 Files selected for processing (5)
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/nodepool_controller_test.go
  • support/releaseinfo/fake/fake.go
  • support/releaseinfo/registry_mirror_provider.go
  • support/releaseinfo/releaseinfo.go

Comment thread support/releaseinfo/registry_mirror_provider.go Outdated
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 23, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

  • Added a hypershift.openshift.io/canonical-data-plane-images annotation to control whether HAProxy uses canonical data-plane component images.

  • HAProxy image selection now prefers canonical component images when enabled, while preserving any explicitly configured HAProxy image. For new or upgrading node pools, the annotation is auto-enabled unless set otherwise.

  • Bug Fixes

  • Improved registry override handling so canonical component image mappings are retained and applied consistently.

  • Tests

  • Expanded HAProxy image resolution tests for new, upgrading, and stable node pools, including registry override scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot

Copy link
Copy Markdown

@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

  • Added the hypershift.openshift.io/canonical-data-plane-images annotation to control whether HAProxy uses canonical (pre-override) data-plane component images.

  • HAProxy image resolution now prefers canonical component images when enabled and auto-enables the annotation for new or upgrading node pools.

  • Bug Fixes

  • Registry mirror and release payload handling now retain canonical component image mappings, ensuring consistent HAProxy image selection and preserving explicitly set HAProxy images.

  • Tests

  • Expanded HAProxy image resolution coverage for stable, upgrading, annotation-forced, and registry override scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng csrwng force-pushed the OCPBUGS-86415 branch 2 times, most recently from 8c2fac4 to cdb758d Compare June 23, 2026 20:32
The kube-apiserver-proxy static pod image was being unnecessarily
rewritten by registry overrides. Data plane nodes run CRI-O which
handles mirroring natively via IDMS/ICSP, so they should receive the
canonical (non-overridden) image reference.

Capture canonical component images before registry overrides are applied
in RegistryMirrorProviderDecorator and expose them via
CanonicalComponentImages(). Gate the fix behind a
hypershift.openshift.io/canonical-data-plane-images annotation to avoid
triggering rollouts on existing stable NodePools. The annotation is set
automatically on new NodePools and during version upgrades.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@csrwng

csrwng commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Does this change partially address this as well? https://redhat.atlassian.net/browse/OCPBUGS-44470

@hasueki yes it does

@csrwng csrwng marked this pull request as ready for review June 23, 2026 20:35
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 23, 2026
@openshift-ci openshift-ci Bot requested review from bryan-cox and sjenning June 23, 2026 20:35
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke

@hypershift-jira-solve-ci

Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2069765335543713792 | Cost: $2.944201499999999 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Now I have the full picture. The early i/o timeouts around hosted cluster guest API servers are normal during initial cluster spinup. The critical event is the management cluster API server going completely unreachable starting at ~14:12 UTC. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Management cluster kube-apiserver (52.6.176.64:6443) became unreachable starting at ~14:12 UTC,
causing cascading failures across all 4 tests that were still running.

TestUpgradeControlPlane: controlPlaneVersion state is Partial — openshift-oauth-apiserver stuck
rolling out (1 of 3 replicas updated), cluster-network-operator waiting on network-node-identity.

TestKarpenterUpgradeControlPlane: Degraded=True: UnavailableReplicas([kube-apiserver deployment
has 1 unavailable replicas, openshift-apiserver deployment has 1 unavailable replicas, router
deployment has 1 unavailable replicas])

TestNodePool/HostedCluster2: NodePool config update timed out — UpdatingConfig=True stuck for 20m.

TestKarpenter: Instance profile annotation propagation test failed with i/o timeout to management API.

Summary

All four test failures share a single root cause: the shared management cluster's kube-apiserver (accessed via ELB at afb01b07274764c35b532105f7aa2443-6ede9dae57f3db6c.elb.us-east-1.amazonaws.com:6443) became unreachable starting at approximately 14:12 UTC. The connectivity loss manifested as alternating dial tcp 52.6.176.64:6443: i/o timeout and connection refused errors, indicating the API server pod was crash-looping or being rescheduled. This outage persisted for the remainder of the test run (over 1.5 hours) until the 2-hour timeout killed the process at 15:38 UTC. Because all hosted clusters were managed by this single management cluster, every active test that needed to communicate with the management API (to check HostedCluster status, ControlPlaneComponent rollouts, or NodePool conditions) failed simultaneously. The PR's code changes (adding canonical image support for HAProxy static pods in NodePool ignition config) do not affect the management cluster's kube-apiserver availability and are not related to these failures.

Root Cause

The root cause is a management cluster kube-apiserver outage — an infrastructure-level failure unrelated to this PR's code changes.

Timeline:

  1. 13:38 UTC — Test phase began; hypershift-aws-run-e2e-nested step started executing E2E tests against the shared management cluster (3e99c10be6-mgmt).
  2. ~14:12:50 UTC — First signs of management API instability: portforward.go errors show broken pipes and timeout on port 6443, followed by http2: client connection lost errors.
  3. 14:13:44 UTC — The controller-runtime cache lost its watch connection to the management cluster (Failed to watch *v1.Pod: http2: client connection lost), confirming the management cluster API server went down.
  4. 14:12–15:38 UTC — For ~85 minutes, all tests polling the management API experienced continuous dial tcp 52.6.176.64:6443: i/o timeout and connection refused errors. The alternating pattern (timeout vs. refused) indicates the API server pod was repeatedly crashing and restarting without recovering.
  5. 14:39:30 UTC — Port-forward to a hosted cluster control plane pod also failed: failed to connect to localhost:6443 inside namespace ... connection refused, showing that the API server container inside the hosted cluster control plane pod was also affected.
  6. 15:38:27 UTC — The 2-hour Prow timeout killed the process. Tests were still blocked on management API connectivity.

Specific test failure mechanisms:

  • TestUpgradeControlPlane: The control plane upgrade from 4.22.0-0.ci-2026-06-23-052716 to 4.22.0-0.ci-2026-06-24-045629 was in progress. Two ControlPlaneComponents could not complete rollout: openshift-oauth-apiserver (1/3 replicas updated) and cluster-network-operator (network-node-identity not ready). The 30-minute wait timed out because the management cluster was unreachable to orchestrate the rollout.
  • TestKarpenterUpgradeControlPlane: Same upgrade scenario. The hosted cluster reported Degraded=True with kube-apiserver, openshift-apiserver, and router having unavailable replicas. Rollout never completed — controlPlaneVersion state is Partial.
  • TestNodePool/HostedCluster2: TestAdditionalTrustBundlePropagation was waiting for NodePool config update to complete (UpdatingConfig=True), but the management API was unreachable to orchestrate the config rollout. 20-minute timeout exceeded.
  • TestKarpenter: Instance_profile_annotation_propagation failed when attempting to read HostedCluster status — got i/o timeout to the management API.

Why this is NOT related to PR #8742:
The PR modifies resolveHAProxyImage() in the NodePool controller to use canonical (pre-override) images for the HAProxy static pod. These changes:

  1. Only affect ignition config generation for data plane nodes
  2. Are gated behind a new annotation (hypershift.openshift.io/canonical-data-plane-images) that is only set on new NodePools or during upgrades
  3. Do not touch management cluster kube-apiserver, ELB configuration, or cluster networking
  4. The management API outage started during cluster setup, before any NodePool config changes from this PR would have been applied

The tests that passed (TestCreateCluster, TestCreateClusterProxy, TestCreateClusterCustomConfig, TestAutoscaling, TestNodePool/HostedCluster0) all completed their work before the management cluster went down at ~14:12 UTC.

Recommendations
  1. Re-run the CI job — This is a transient management cluster infrastructure failure. The PR's code changes are not related to the failures. A re-run on a healthy management cluster should pass.

  2. If re-run also fails, investigate the shared management cluster 3e99c10be6-mgmt for:

    • Node pressure or resource exhaustion on management cluster worker nodes
    • kube-apiserver pod OOMKills or crash-loops
    • ELB health check failures
    • AWS infrastructure issues in us-east-1
  3. Consider the AWS KeyPair limit — The logs show KeyPairLimitExceeded: Maximum of 5000 keypairs reached during teardown, which indicates resource exhaustion in the shared AWS account that could affect other CI runs.

Evidence
Evidence Detail
Management API outage start ~14:12:50 UTC — portforward.go:391 error copying from remote stream: broken pipe
API server down confirmation 14:13:44 UTC — controller-runtime.cache: Failed to watch *v1.Pod: http2: client connection lost
Outage duration ~85 minutes (14:12 – 15:38 UTC, ended by Prow 2h timeout)
Affected ELB endpoint afb01b07274764c35b532105f7aa2443-6ede9dae57f3db6c.elb.us-east-1.amazonaws.com:644352.6.176.64:6443
Error pattern Alternating i/o timeout and connection refused — API server crash-looping
TestUpgradeControlPlane stuck components openshift-oauth-apiserver (1/3 replicas), cluster-network-operator (network-node-identity not ready)
TestKarpenterUpgradeControlPlane degraded kube-apiserver, openshift-apiserver, router — all with unavailable replicas
TestNodePool stuck condition UpdatingConfig=True: Updating config in progress. Target config: 0c11fe71 — 20m timeout
Prow timeout 15:38:27 UTC — Process did not finish before 2h0m0s timeout
Container exit code 127 (killed by signal after grace period)
AWS resource exhaustion KeyPairLimitExceeded: Maximum of 5000 keypairs reached
PR files changed nodepool_controller.go, releaseinfo.go, registry_mirror_provider.go, fake.go — none affect mgmt cluster API
Tests that passed before outage TestCreateCluster, TestCreateClusterProxy, TestCreateClusterCustomConfig, TestCreateClusterPrivate, etc.

@csrwng

csrwng commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Both test failures are due to the mgmt cluster refusing connections at some point, which points more to an infrastructure issue than a code issue.

/retest-required

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, cblecker, csrwng

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [bryan-cox,cblecker,csrwng]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@csrwng

csrwng commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

e2e-aws failure is due to recurring EnsureGlobalPullSecret flake
/retest-required

@cblecker

Copy link
Copy Markdown
Member

/verified by e2e and unit tests

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jun 25, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@cblecker: This PR has been marked as verified by e2e and unit tests.

Details

In response to this:

/verified by e2e and unit tests

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@csrwng: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit fc582c5 into openshift:main Jun 25, 2026
43 checks passed
@openshift-ci-robot

Copy link
Copy Markdown

@csrwng: Jira Issue Verification Checks: Jira Issue OCPBUGS-86415
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-86415 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Summary

  • Fix the kube-apiserver-proxy static pod image in the ignition payload being unnecessarily rewritten by --registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.
  • Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.

Test plan

  • Verify existing HAProxy image resolution tests pass (TestResolveHAProxyImage)
  • Verify new NodePools get canonical (non-overridden) HAProxy image in static pod manifest
  • Verify upgrading NodePools switch to canonical image during version upgrade
  • Verify stable NodePools with no annotation preserve the existing (overridden) image
  • Verify annotation-specified HAProxy images are not affected by override reversal
  • Run e2e-aws-upgrade-hypershift-operator to validate rollout safety

Fixes: https://issues.redhat.com/browse/OCPBUGS-86415

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

  • Added the hypershift.openshift.io/canonical-data-plane-images annotation to control whether HAProxy uses canonical (pre-override) data-plane component images.

  • HAProxy image resolution can now automatically prefer canonical images for new or upgrading node pools, or follow the annotation when set.

  • Bug Fixes

  • Registry mirror and release payload handling now retain canonical component image mappings, preserving intended HAProxy image behavior (including honoring explicitly set HAProxy image annotations).

  • Tests

  • Expanded HAProxy image resolution scenarios and coverage for stable, upgrading, and registry override cases.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot

Copy link
Copy Markdown
Contributor

Fix included in release 5.0.0-0.nightly-2026-06-25-194049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants