OCPBUGS-92013: MachineDeployment stale status#8827
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@clebs: This pull request references Jira Issue OCPBUGS-92013, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe controller now calls 🚥 Pre-merge checks | ✅ 11✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clebs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8827 +/- ##
==========================================
+ Coverage 42.55% 42.64% +0.08%
==========================================
Files 768 768
Lines 95297 95399 +102
==========================================
+ Hits 40558 40686 +128
+ Misses 51932 51901 -31
- Partials 2807 2812 +5
... and 12 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 820-827: `MachineDeploymentComplete` is swallowing the MachineSet
`List` failure by logging and returning false, which prevents reconciliation
from retrying; update this helper to return the `c.List` error instead of a
boolean fallback, and propagate that error through
`reconcileMachineDeploymentStatus` and `Reconcile` so transient failures
requeue. Use the existing `MachineDeploymentComplete` and
`reconcileMachineDeploymentStatus` symbols to locate the change and keep the
normal completion path unchanged when no error occurs.
- Around line 829-832: MachineDeploymentComplete is incorrectly marking rollout
as complete as soon as a MachineSet matches targetMachineTemplate, even if that
MachineSet is not yet current/available/up-to-date. Update the logic in
MachineDeploymentComplete to inspect the matching MachineSet’s status fields and
only return true when the matching MachineSet has the expected healthy replicas;
use the existing machineSets.Items loop and targetMachineTemplate match, but
gate completion on that MachineSet’s status rather than the template reference
alone.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6fecb323-ee08-4ad4-8af1-ff5542e6392f
📒 Files selected for processing (3)
hypershift-operator/controllers/nodepool/capi.gohypershift-operator/controllers/nodepool/capi_test.gohypershift-operator/controllers/nodepool/nodepool_controller.go
| machineSets := &capiv1.MachineSetList{} | ||
| if err := c.List(ctx, machineSets, | ||
| client.InNamespace(deployment.Namespace), | ||
| client.MatchingLabels{capiv1.MachineDeploymentNameLabel: deployment.Name}, | ||
| ); err != nil { | ||
| log.Error(err, "Failed to list MachineSets for MachineDeployment, cannot verify completion") | ||
| return false | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify all MachineDeploymentComplete callers that need signature updates.
rg -nP '\bMachineDeploymentComplete\s*\(' hypershift-operator/controllers/nodepoolRepository: openshift/hypershift
Length of output: 706
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '780,860p' hypershift-operator/controllers/nodepool/nodepool_controller.go
printf '\n---\n'
sed -n '560,640p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3520,3605p' hypershift-operator/controllers/nodepool/capi_test.goRepository: openshift/hypershift
Length of output: 8831
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '780,860p' hypershift-operator/controllers/nodepool/nodepool_controller.go
printf '\n---\n'
sed -n '560,640p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3520,3605p' hypershift-operator/controllers/nodepool/capi_test.goRepository: openshift/hypershift
Length of output: 8831
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "reconcileMachineDeploymentStatus|MachineDeploymentComplete\(" hypershift-operator/controllers/nodepool/nodepool_controller.go hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '640,780p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '1,220p' hypershift-operator/controllers/nodepool/nodepool_controller.goRepository: openshift/hypershift
Length of output: 16787
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "reconcileMachineDeploymentStatus|MachineDeploymentComplete\(" hypershift-operator/controllers/nodepool/nodepool_controller.go hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '640,780p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '1,220p' hypershift-operator/controllers/nodepool/nodepool_controller.goRepository: openshift/hypershift
Length of output: 16787
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '120,190p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
rg -n "MachineDeploymentComplete" hypershift-operator/controllers/nodepool -g'*.go'
printf '\n---\n'
rg -n "return ctrl.Result\{\}, nil" hypershift-operator/controllers/nodepool/capi.goRepository: openshift/hypershift
Length of output: 3677
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '120,190p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
rg -n "MachineDeploymentComplete" hypershift-operator/controllers/nodepool -g'*.go'
printf '\n---\n'
rg -n "return ctrl.Result\{\}, nil" hypershift-operator/controllers/nodepool/capi.goRepository: openshift/hypershift
Length of output: 3677
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '960,1010p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '1,120p' hypershift-operator/controllers/nodepool/capi_test.goRepository: openshift/hypershift
Length of output: 5992
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '960,1010p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3308,3575p' hypershift-operator/controllers/nodepool/capi_test.goRepository: openshift/hypershift
Length of output: 10500
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '960,1010p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3308,3575p' hypershift-operator/controllers/nodepool/capi_test.goRepository: openshift/hypershift
Length of output: 10500
Propagate MachineSet list failures so reconciliation retries.
MachineDeploymentComplete logs a List error and returns false, so reconcileMachineDeploymentStatus keeps going and Reconcile can exit cleanly. Return the error from MachineDeploymentComplete and bubble it up so a transient cache/API failure requeues instead of leaving the NodePool version/config annotations stale.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines
820 - 827, `MachineDeploymentComplete` is swallowing the MachineSet `List`
failure by logging and returning false, which prevents reconciliation from
retrying; update this helper to return the `c.List` error instead of a boolean
fallback, and propagate that error through `reconcileMachineDeploymentStatus`
and `Reconcile` so transient failures requeue. Use the existing
`MachineDeploymentComplete` and `reconcileMachineDeploymentStatus` symbols to
locate the change and keep the normal completion path unchanged when no error
occurs.
Source: Path instructions
MachineDeploymentComplete() has a regression when introducing v1beta2 where it has flakiness due to lossy conversion and stale status. Replaced conversion-data annotation check with deployment.Status.V1Beta2. Additionally, guard against CAPI status race by adding a MachineSet template cross-verification to MachineDeploymentComplete, and move status reconciliation outside CreateOrPatch so it reads post-persist state. Signed-off-by: Borja Clemente <bclement@redhat.com>
|
/test e2e-aws |
Test Resultse2e-aws
|
|
@clebs: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryTwo independent issues blocked this PR. (1) e2e-aws: The Root Causee2e-aws failure — Pre-existing flake, NOT caused by this PR: The failure chain is:
Why this is unrelated to the PR: PR #8827 modifies three files, all in
None of these changes affect: the tide failure — Merge conflict: The PR branch Recommendations
Evidence
|
|
@clebs: This pull request references Jira Issue OCPBUGS-92013. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Closed in favor of #8821 |
What this PR does / why we need it:
MachineDeploymentComplete() has a regression when introducing v1beta2 where it has flakiness due to lossy conversion and stale status.
Replaced conversion-data annotation check with
deployment.Status.V1Beta2.
Additionally, guard against CAPI status race by adding a MachineSet template cross-verification to MachineDeploymentComplete, and move status reconciliation outside CreateOrPatch so it reads post-persist state.
Which issue(s) this PR fixes:
Fixes OCPBUGS-92013
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit