OCPBUGS-86415: Use canonical image for kube-apiserver-proxy static pod#8742
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds canonical component image tracking to support HAProxy image resolution when registry overrides are active. Sequence DiagramsequenceDiagram
participant DelegateProvider as Delegate<br/>ReleaseProvider
participant RegistryMirror as RegistryMirror<br/>Decorator.Lookup
participant ReleaseImage
participant NPController as NodePool<br/>Controller
participant resolveHAProxy as resolveHAProxyImage
DelegateProvider->>ReleaseImage: returns with ComponentImages()
RegistryMirror->>ReleaseImage: captures CanonicalComponentImages()
RegistryMirror->>RegistryMirror: rewrites ImageStream tags
RegistryMirror->>ReleaseImage: SetCanonicalComponentImages()<br/>(pre-override mapping)
RegistryMirror->>NPController: returns ReleaseImage
NPController->>NPController: detect new/upgrading<br/>via Status.Version
NPController->>NPController: set canonical-data-plane-images<br/>annotation = "true"
NPController->>resolveHAProxy: useCanonicalImages flag
resolveHAProxy->>ReleaseImage: CanonicalComponentImages()<br/>or ComponentImages()
resolveHAProxy->>NPController: selected HAProxy image
Suggested Reviewers
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8742 +/- ##
==========================================
+ Coverage 41.91% 42.51% +0.60%
==========================================
Files 769 768 -1
Lines 96763 95298 -1465
==========================================
- Hits 40557 40519 -38
+ Misses 53402 51973 -1429
- Partials 2804 2806 +2
... and 6 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
| nodePoolAnnotationTaints = "hypershift.openshift.io/nodePoolTaints" | ||
| nodePoolAnnotationPlatformMachineTemplate = "hypershift.openshift.io/nodePoolPlatformMachineTemplate" | ||
| nodePoolAnnotationTaints = "hypershift.openshift.io/nodePoolTaints" | ||
| nodePoolAnnotationCanonicalDataPlaneImages = "hypershift.openshift.io/canonical-data-plane-images" |
There was a problem hiding this comment.
can we // doc this annotation? You pr desc seems pretty explanatory "Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades."
There was a problem hiding this comment.
Done — added a doc comment on the nodePoolAnnotationCanonicalDataPlaneImages constant explaining its purpose and when it's set automatically.
|
@csrwng Does this change partially address this as well? https://redhat.atlassian.net/browse/OCPBUGS-44470 |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
support/releaseinfo/releaseinfo.go (1)
61-69: ⚡ Quick winDefensively copy canonical component maps at API boundaries.
CanonicalComponentImages()andSetCanonicalComponentImages()currently share the same map instances with callers. That makes internal state externally mutable and can leak state across provider/decorator flows.Suggested change
func (i *ReleaseImage) CanonicalComponentImages() map[string]string { - if i.canonicalComponentImages != nil { - return i.canonicalComponentImages - } - return i.ComponentImages() + source := i.canonicalComponentImages + if source == nil { + source = i.ComponentImages() + } + out := make(map[string]string, len(source)) + for k, v := range source { + out[k] = v + } + return out } // SetCanonicalComponentImages stores the pre-override component images. func (i *ReleaseImage) SetCanonicalComponentImages(images map[string]string) { - i.canonicalComponentImages = images + if images == nil { + i.canonicalComponentImages = nil + return + } + copied := make(map[string]string, len(images)) + for k, v := range images { + copied[k] = v + } + i.canonicalComponentImages = copied }Also applies to: 71-74
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@support/releaseinfo/releaseinfo.go` around lines 61 - 69, The CanonicalComponentImages() method returns a direct reference to the internal canonicalComponentImages map, allowing callers to mutate internal state. Similarly, the SetCanonicalComponentImages() method stores a direct reference to the provided map parameter. To fix this, defensively copy the map in CanonicalComponentImages() before returning it so callers cannot modify the internal state, and defensively copy the provided map parameter in SetCanonicalComponentImages() before storing it to prevent external mutations from affecting internal state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller_test.go`:
- Around line 2599-2605: The test case at line 2599 intends to validate the
"stable NodePool without annotation" scenario but the FakeReleaseProvider
builder (around lines 2715-2720) does not set an explicit Version field, causing
the controller's nodePool.Status.Version comparison check to treat this as an
upgrade case instead of stable. Set an explicit Version value on the
FakeReleaseProvider builder that matches the nodePoolStatusVersion of "4.18.0"
so the test properly exercises the stable-without-annotation branch and
validates that registry overrides are preserved in the stable case.
In `@support/releaseinfo/registry_mirror_provider.go`:
- Around line 35-38: The current implementation snapshots ComponentImages() only
when local registry overrides exist, which causes decorator chains to lose
canonical state and replace upstream canonical data with already-overridden
values. Instead of conditionally calling releaseImage.ComponentImages() inside
the len(p.RegistryOverrides) check, access the canonical images directly from
the delegate regardless of whether overrides exist. This preserves upstream
canonical state in decorator chains. Apply this same fix to both occurrences of
this pattern (the one around lines 35-38 and the additional one noted at lines
51-53).
---
Nitpick comments:
In `@support/releaseinfo/releaseinfo.go`:
- Around line 61-69: The CanonicalComponentImages() method returns a direct
reference to the internal canonicalComponentImages map, allowing callers to
mutate internal state. Similarly, the SetCanonicalComponentImages() method
stores a direct reference to the provided map parameter. To fix this,
defensively copy the map in CanonicalComponentImages() before returning it so
callers cannot modify the internal state, and defensively copy the provided map
parameter in SetCanonicalComponentImages() before storing it to prevent external
mutations from affecting internal state.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: c42a1fb8-b195-4446-b50d-62dbc37a4694
📒 Files selected for processing (5)
hypershift-operator/controllers/nodepool/nodepool_controller.gohypershift-operator/controllers/nodepool/nodepool_controller_test.gosupport/releaseinfo/fake/fake.gosupport/releaseinfo/registry_mirror_provider.gosupport/releaseinfo/releaseinfo.go
|
@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@csrwng: This pull request references Jira Issue OCPBUGS-86415, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
8c2fac4 to
cdb758d
Compare
The kube-apiserver-proxy static pod image was being unnecessarily rewritten by registry overrides. Data plane nodes run CRI-O which handles mirroring natively via IDMS/ICSP, so they should receive the canonical (non-overridden) image reference. Capture canonical component images before registry overrides are applied in RegistryMirrorProviderDecorator and expose them via CanonicalComponentImages(). Gate the fix behind a hypershift.openshift.io/canonical-data-plane-images annotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hasueki yes it does |
|
Scheduling tests matching the |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
Now I have the full picture. The early i/o timeouts around hosted cluster guest API servers are normal during initial cluster spinup. The critical event is the management cluster API server going completely unreachable starting at ~14:12 UTC. Let me produce the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryAll four test failures share a single root cause: the shared management cluster's kube-apiserver (accessed via ELB at Root CauseThe root cause is a management cluster kube-apiserver outage — an infrastructure-level failure unrelated to this PR's code changes. Timeline:
Specific test failure mechanisms:
Why this is NOT related to PR #8742:
The tests that passed (TestCreateCluster, TestCreateClusterProxy, TestCreateClusterCustomConfig, TestAutoscaling, TestNodePool/HostedCluster0) all completed their work before the management cluster went down at ~14:12 UTC. Recommendations
Evidence
|
|
Both test failures are due to the mgmt cluster refusing connections at some point, which points more to an infrastructure issue than a code issue. /retest-required |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, cblecker, csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
e2e-aws failure is due to recurring EnsureGlobalPullSecret flake |
|
/verified by e2e and unit tests |
|
@cblecker: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@csrwng: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@csrwng: Jira Issue Verification Checks: Jira Issue OCPBUGS-86415 Jira Issue OCPBUGS-86415 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in release 5.0.0-0.nightly-2026-06-25-194049 |
Summary
--registry-overrides. Data plane nodes use CRI-O which handles mirroring natively via IDMS/ICSP, so the canonical image reference should be used.hypershift.openshift.io/canonical-data-plane-imagesannotation to avoid triggering rollouts on existing stable NodePools. The annotation is set automatically on new NodePools and during version upgrades.Test plan
TestResolveHAProxyImage)e2e-aws-upgrade-hypershift-operatorto validate rollout safetyFixes: https://issues.redhat.com/browse/OCPBUGS-86415
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
hypershift.openshift.io/canonical-data-plane-imagesannotation to control whether HAProxy uses canonical (pre-override) data-plane component images.Bug Fixes
Tests