OCPBUGS-92013: fix NodePool stuck in UpdatingVersion/UpdatingConfig due to stale conversion-data annotation#8821
Conversation
…on for MachineDeploymentComplete The conversion-data annotation is metadata, which is preserved from etcd on status-subresource writes. Since CAPI updates MachineDeployment status via the status subresource, the annotation is never refreshed by those writes. If the last main-resource v1beta2 write captured mid-rollout state, the annotation becomes permanently stale, causing MachineDeploymentComplete to return false forever and deadlocking the NodePool in UpdatingVersion/UpdatingConfig. Replace the annotation-based cross-check with a read of Status.V1Beta2.UpToDateReplicas and Status.V1Beta2.AvailableReplicas. These fields are part of the v1beta1 Status struct (not metadata), so they are correctly updated on every status-subresource write via the v1beta2-to-v1beta1 conversion layer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@csrwng: This pull request references Jira Issue OCPBUGS-92013, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@csrwng: This pull request references Jira Issue OCPBUGS-92013, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthrough
🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@csrwng: This pull request references Jira Issue OCPBUGS-92013, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 816-818: The completion check in NodePool reconciliation is
incorrectly treating nil v1beta2 replica status fields as zero when comparing
against deployment.Spec.Replicas, which can incorrectly mark a scale-to-zero
MachineDeployment as complete. Update the logic in the NodePool controller’s
replica comparison path to require UpToDateReplicas and AvailableReplicas to be
present before comparing values, rather than using ptr.Deref defaults, and keep
the desired replica comparison in sync with that presence check. Add a test for
the zero-replica case where v1beta2 status fields are nil to verify it is not
considered complete.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 479d46e9-9a2a-4496-beb2-c77137789c23
📒 Files selected for processing (2)
hypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/nodepool_controller.go
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8821 +/- ##
==========================================
- Coverage 42.55% 42.55% -0.01%
==========================================
Files 768 768
Lines 95297 95283 -14
==========================================
- Hits 40558 40544 -14
Misses 51932 51932
Partials 2807 2807
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…g to zero Nil UpToDateReplicas/AvailableReplicas means "not yet set", not "zero replicas". Using ptr.Deref with a zero default would incorrectly match a desired count of zero, marking incomplete deployments as complete. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Scheduling tests matching the |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, csrwng The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
|
Now I have a very clear picture. The key observations:
The root cause is clear: ExternalDNS failed to create DNS records for these 3 hosted clusters within the 10-minute timeout. The ExternalDNS operator didn't create Route53 records for This is a transient ExternalDNS/Route53 propagation failure — a flaky CI infrastructure issue, not caused by the PR changes (which only affect NodePool controller's Now I have enough information to write the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryThree hosted clusters ( Root CauseTransient ExternalDNS/Route53 propagation failure — a CI infrastructure flake. All hosted clusters in this job initially failed DNS resolution ( Evidence that the control planes themselves were healthy for the failing clusters:
The PR #8821 changes are limited to Recommendations
Evidence
|
Test Resultse2e-aws
e2e-aks
|
|
The aws-4-22 failure was an infra issue
/retest-required |
|
/verified by @csrwng |
|
@csrwng: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@csrwng: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@csrwng: Jira Issue OCPBUGS-92013: Some pull requests linked via external trackers have merged: The following pull request, linked via external tracker, has not merged:
All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-92013 has not been moved to the MODIFIED state. This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
cluster.x-k8s.io/conversion-dataannotation-based v1beta2 cross-check inMachineDeploymentCompletewith a direct read ofStatus.V1Beta2fieldsUpdatingVersion=TrueandUpdatingConfig=Trueafter a rollout completesRoot cause
The conversion-data annotation is metadata. When CAPI updates MachineDeployment status via the v1beta2 status subresource, the API server's
PrepareForUpdatepreserves metadata from etcd — so the annotation is never refreshed by status writes. If the last main-resource v1beta2 write captured mid-rollout state (e.g.replicas: 4, upToDateReplicas: null), the annotation becomes permanently stale.machineDeploymentCompleteFromConversionDatathen returnsfalseforever, preventingreconcileMachineDeploymentStatusfrom updatingnodePool.Status.Versionand the config annotation.Fix
The v1beta1
MachineDeploymentStatusalready has aV1Beta2nested field that stores the v1beta2-native replica counters (UpToDateReplicas,AvailableReplicas). These fields are part of the status (not metadata), so they are correctly updated on every status-subresource write viaConvert_v1beta2_MachineDeploymentStatus_To_v1beta1_MachineDeploymentStatus.The new
machineDeploymentCompleteFromV1Beta2Statusreads these fields directly instead of parsing the stale annotation.Jira: https://issues.redhat.com/browse/OCPBUGS-92013
Test plan
TestMachineDeploymentComplete— 7 cases covering v1beta2 disagreement, nil V1Beta2, nil UpToDateReplicas, v1beta1-only fallback)go test ./hypershift-operator/controllers/nodepool/)🤖 Generated with Claude Code
Summary by CodeRabbit