Skip to content

OCPBUGS-92013: MachineDeployment stale status#8827

Closed
clebs wants to merge 1 commit into
openshift:mainfrom
clebs:capi-stale-status
Closed

OCPBUGS-92013: MachineDeployment stale status#8827
clebs wants to merge 1 commit into
openshift:mainfrom
clebs:capi-stale-status

Conversation

@clebs

@clebs clebs commented Jun 24, 2026

Copy link
Copy Markdown
Member

What this PR does / why we need it:

MachineDeploymentComplete() has a regression when introducing v1beta2 where it has flakiness due to lossy conversion and stale status.

Replaced conversion-data annotation check with
deployment.Status.V1Beta2.

Additionally, guard against CAPI status race by adding a MachineSet template cross-verification to MachineDeploymentComplete, and move status reconciliation outside CreateOrPatch so it reads post-persist state.

Which issue(s) this PR fixes:

Fixes OCPBUGS-92013

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes
    • Improved node pool upgrade status tracking by reconciling MachineDeployment status immediately after updates during replace upgrades.
    • Reduced false “complete” signals by basing readiness on native status fields and only marking ready after confirming MachineSets reference the expected machine template.
    • Status/annotation updates now run more consistently and are gated on the updated completeness check, improving upgrade progress visibility.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 24, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@clebs: This pull request references Jira Issue OCPBUGS-92013, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

MachineDeploymentComplete() has a regression when introducing v1beta2 where it has flakiness due to lossy conversion and stale status.

Replaced conversion-data annotation check with
deployment.Status.V1Beta2.

Additionally, guard against CAPI status race by adding a MachineSet template cross-verification to MachineDeploymentComplete, and move status reconciliation outside CreateOrPatch so it reads post-persist state.

Which issue(s) this PR fixes:

Fixes OCPBUGS-92013

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b6575aa4-ac1f-4eda-8024-89d5d47233b4

📥 Commits

Reviewing files that changed from the base of the PR and between 2df1a7a and e897a13.

📒 Files selected for processing (3)
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go
  • hypershift-operator/controllers/nodepool/capi_test.go

📝 Walkthrough

Walkthrough

The controller now calls reconcileMachineDeploymentStatus after CreateOrPatch of the MachineDeployment. reconcileMachineDeployment no longer handles status reconciliation inline, and propagateVersionAndTemplate no longer returns a boolean. MachineDeploymentComplete now accepts context.Context, a reader, and a target template name, and it checks v1beta2 status plus matching MachineSet references before status updates proceed. Tests were updated for the new inputs.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the PR’s main goal of fixing MachineDeployment stale status behavior.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo titles or dynamic test names were added; the touched tests use static, descriptive t.Run names only.
Test Structure And Quality ✅ Passed PASS: The touched tests are plain table-driven unit tests, not Ginkgo blocks; they use fake clients, have no waits/cleanup hazards, and each subtest targets one behavior.
Topology-Aware Scheduling Compatibility ✅ Passed PASS: changes only adjust MachineDeployment completion/status logic and tests; no affinity, nodeSelector, topology spread, or arbiter/control-plane scheduling constraints were added.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PASS: The touched tests are plain unit tests (t.Run/Gomega), with no new Ginkgo e2e specs, hardcoded IPs, or external-network calls in the changed sections.
No-Weak-Crypto ✅ Passed Touched files add no crypto imports or weak algorithms; comparisons are status ints and template names, not secrets/tokens.
Container-Privileges ✅ Passed Touched files are Go controller/tests only; searches found no privileged, hostNetwork, hostIPC, allowPrivilegeEscalation, or root settings in any manifest.
No-Sensitive-Data-In-Logs ✅ Passed Checked changed controller/test files: logs only emit resource names, template names, counts, and hashes—no passwords, tokens, PII, or raw secret contents.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Jun 24, 2026
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clebs
Once this PR has been reviewed and has the lgtm label, please assign muraee for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.18919% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.64%. Comparing base (bc3bda9) to head (e897a13).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
...erator/controllers/nodepool/nodepool_controller.go 87.50% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8827      +/-   ##
==========================================
+ Coverage   42.55%   42.64%   +0.08%     
==========================================
  Files         768      768              
  Lines       95297    95399     +102     
==========================================
+ Hits        40558    40686     +128     
+ Misses      51932    51901      -31     
- Partials     2807     2812       +5     
Files with missing lines Coverage Δ
hypershift-operator/controllers/nodepool/capi.go 71.58% <100.00%> (-0.19%) ⬇️
...erator/controllers/nodepool/nodepool_controller.go 43.30% <87.50%> (+0.12%) ⬆️

... and 12 files with indirect coverage changes

Flag Coverage Δ
cmd-support 35.62% <ø> (+0.15%) ⬆️
cpo-hostedcontrolplane 44.88% <ø> (+0.04%) ⬆️
cpo-other 44.94% <ø> (+0.24%) ⬆️
hypershift-operator 53.04% <89.18%> (-0.02%) ⬇️
other 31.69% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@hypershift-operator/controllers/nodepool/nodepool_controller.go`:
- Around line 820-827: `MachineDeploymentComplete` is swallowing the MachineSet
`List` failure by logging and returning false, which prevents reconciliation
from retrying; update this helper to return the `c.List` error instead of a
boolean fallback, and propagate that error through
`reconcileMachineDeploymentStatus` and `Reconcile` so transient failures
requeue. Use the existing `MachineDeploymentComplete` and
`reconcileMachineDeploymentStatus` symbols to locate the change and keep the
normal completion path unchanged when no error occurs.
- Around line 829-832: MachineDeploymentComplete is incorrectly marking rollout
as complete as soon as a MachineSet matches targetMachineTemplate, even if that
MachineSet is not yet current/available/up-to-date. Update the logic in
MachineDeploymentComplete to inspect the matching MachineSet’s status fields and
only return true when the matching MachineSet has the expected healthy replicas;
use the existing machineSets.Items loop and targetMachineTemplate match, but
gate completion on that MachineSet’s status rather than the template reference
alone.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6fecb323-ee08-4ad4-8af1-ff5542e6392f

📥 Commits

Reviewing files that changed from the base of the PR and between 64c8878 and 2df1a7a.

📒 Files selected for processing (3)
  • hypershift-operator/controllers/nodepool/capi.go
  • hypershift-operator/controllers/nodepool/capi_test.go
  • hypershift-operator/controllers/nodepool/nodepool_controller.go

Comment on lines +820 to +827
machineSets := &capiv1.MachineSetList{}
if err := c.List(ctx, machineSets,
client.InNamespace(deployment.Namespace),
client.MatchingLabels{capiv1.MachineDeploymentNameLabel: deployment.Name},
); err != nil {
log.Error(err, "Failed to list MachineSets for MachineDeployment, cannot verify completion")
return false
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify all MachineDeploymentComplete callers that need signature updates.
rg -nP '\bMachineDeploymentComplete\s*\(' hypershift-operator/controllers/nodepool

Repository: openshift/hypershift

Length of output: 706


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '780,860p' hypershift-operator/controllers/nodepool/nodepool_controller.go
printf '\n---\n'
sed -n '560,640p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3520,3605p' hypershift-operator/controllers/nodepool/capi_test.go

Repository: openshift/hypershift

Length of output: 8831


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '780,860p' hypershift-operator/controllers/nodepool/nodepool_controller.go
printf '\n---\n'
sed -n '560,640p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3520,3605p' hypershift-operator/controllers/nodepool/capi_test.go

Repository: openshift/hypershift

Length of output: 8831


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "reconcileMachineDeploymentStatus|MachineDeploymentComplete\(" hypershift-operator/controllers/nodepool/nodepool_controller.go hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '640,780p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '1,220p' hypershift-operator/controllers/nodepool/nodepool_controller.go

Repository: openshift/hypershift

Length of output: 16787


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "reconcileMachineDeploymentStatus|MachineDeploymentComplete\(" hypershift-operator/controllers/nodepool/nodepool_controller.go hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '640,780p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '1,220p' hypershift-operator/controllers/nodepool/nodepool_controller.go

Repository: openshift/hypershift

Length of output: 16787


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '120,190p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
rg -n "MachineDeploymentComplete" hypershift-operator/controllers/nodepool -g'*.go'
printf '\n---\n'
rg -n "return ctrl.Result\{\}, nil" hypershift-operator/controllers/nodepool/capi.go

Repository: openshift/hypershift

Length of output: 3677


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '120,190p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
rg -n "MachineDeploymentComplete" hypershift-operator/controllers/nodepool -g'*.go'
printf '\n---\n'
rg -n "return ctrl.Result\{\}, nil" hypershift-operator/controllers/nodepool/capi.go

Repository: openshift/hypershift

Length of output: 3677


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '960,1010p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '1,120p' hypershift-operator/controllers/nodepool/capi_test.go

Repository: openshift/hypershift

Length of output: 5992


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '960,1010p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3308,3575p' hypershift-operator/controllers/nodepool/capi_test.go

Repository: openshift/hypershift

Length of output: 10500


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '960,1010p' hypershift-operator/controllers/nodepool/capi.go
printf '\n---\n'
sed -n '3308,3575p' hypershift-operator/controllers/nodepool/capi_test.go

Repository: openshift/hypershift

Length of output: 10500


Propagate MachineSet list failures so reconciliation retries.
MachineDeploymentComplete logs a List error and returns false, so reconcileMachineDeploymentStatus keeps going and Reconcile can exit cleanly. Return the error from MachineDeploymentComplete and bubble it up so a transient cache/API failure requeues instead of leaving the NodePool version/config annotations stale.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@hypershift-operator/controllers/nodepool/nodepool_controller.go` around lines
820 - 827, `MachineDeploymentComplete` is swallowing the MachineSet `List`
failure by logging and returning false, which prevents reconciliation from
retrying; update this helper to return the `c.List` error instead of a boolean
fallback, and propagate that error through `reconcileMachineDeploymentStatus`
and `Reconcile` so transient failures requeue. Use the existing
`MachineDeploymentComplete` and `reconcileMachineDeploymentStatus` symbols to
locate the change and keep the normal completion path unchanged when no error
occurs.

Source: Path instructions

Comment thread hypershift-operator/controllers/nodepool/nodepool_controller.go
MachineDeploymentComplete() has a regression when introducing v1beta2
where it has flakiness due to lossy conversion and stale status.

Replaced conversion-data annotation check with
deployment.Status.V1Beta2.

Additionally, guard against CAPI status race by adding a
MachineSet template cross-verification to MachineDeploymentComplete,
and move status reconciliation outside CreateOrPatch so it reads
post-persist state.

Signed-off-by: Borja Clemente <bclement@redhat.com>
@clebs clebs force-pushed the capi-stale-status branch from 2df1a7a to e897a13 Compare June 25, 2026 06:30
@clebs

clebs commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

/test e2e-aws

@cwbotbot

Copy link
Copy Markdown

Test Results

e2e-aws

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@clebs: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws e897a13 link true /test e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws
  • Build ID: 2070036687915847680
  • Target: e2e-aws
  • Test Step: e2e-aws-hypershift-aws-run-e2e-nested
  • Result: 5 failures (2 leaf, 3 cascading parent), 623 tests total, 30 skipped
  • Tide Status: Not mergeable — PR has a merge conflict with main

Test Failure Analysis

Error

TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout (1205.07s):
  failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded
  DaemonSet stuck at 2/3 pods ready for ~20 minutes

TestCreateCluster/Main/EnsureGlobalPullSecret/Check_if_the_config.json_is_correct_in_all_of_the_nodes (0.02s):
  daemonsets.apps "kubelet-config-verifier" already exists (409 Conflict)

tide: Not mergeable. PR has a merge conflict.

Summary

Two independent issues blocked this PR. (1) e2e-aws: The global-pull-secret-syncer DaemonSet was stuck at 2/3 ready pods for the entire 20-minute timeout during EnsureGlobalPullSecret, causing a cascading failure where the cleanup-skipped kubelet-config-verifier DaemonSet then caused an AlreadyExists error in the second subtest. All 5 reported failures trace to these 2 leaf failures. This is a pre-existing test flake unrelated to the PR — the PR only modifies MachineDeployment completion detection logic in nodepool_controller.go and capi.go, which has no connection to global pull secret syncing or its DaemonSet readiness. Comparison: the same e2e-aws job passed all EnsureGlobalPullSecret tests on PR #8824 (different failures there), confirming this is a non-deterministic infrastructure issue. (2) tide: The PR branch capi-stale-status has a merge conflict with main (mergeable state: CONFLICTING), preventing Tide from merging.

Root Cause

e2e-aws failure — Pre-existing flake, NOT caused by this PR:

The failure chain is:

  1. Primary failure: The EnsureGlobalPullSecret test patches the management-cluster pull secret with a dummy auth entry, then waits for the global-pull-secret-syncer DaemonSet to become fully ready on all 3 hosted cluster nodes. One pod never became ready — the log shows DaemonSet global-pull-secret-syncer not ready: 2/3 pods ready repeated continuously for the full 20-minute timeout window (util.go:2290).

  2. Cascading failure: Because the first subtest hit a context deadline via gomega Eventually, deferred cleanup (DaemonSet deletion) was skipped. The second subtest Check_if_the_config.json_is_correct_in_all_of_the_nodes then tried to create the same kubelet-config-verifier DaemonSet, which already existed, producing a 409 AlreadyExists error.

  3. Parent test propagation: The two leaf failures cascaded up through EnsureGlobalPullSecretMainTestCreateCluster, producing 5 total reported failures from 2 root causes.

Why this is unrelated to the PR: PR #8827 modifies three files, all in hypershift-operator/controllers/nodepool/:

  • capi.go: Changes call order of reconcileMachineDeploymentStatus (moved after Reconcile, removed early-return from propagateVersionAndTemplate)
  • nodepool_controller.go: Rewrites MachineDeploymentComplete() to use Status.V1Beta2 instead of conversion-data annotation, adds MachineSet template verification
  • capi_test.go: Updates tests for the new signatures

None of these changes affect: the global-pull-secret-syncer DaemonSet, pull secret propagation, DaemonSet scheduling, or the EnsureGlobalPullSecret test code. The DaemonSet pod scheduling issue (1 of 3 pods failing to become ready on a node) is an infrastructure-level issue.

tide failure — Merge conflict:

The PR branch capi-stale-status has diverged from main and has conflicting changes. GitHub reports mergeable: false, mergeStateStatus: DIRTY. The branch needs to be rebased onto current main to resolve the conflict before Tide can process it.

Recommendations
  1. Resolve the merge conflict: Rebase the capi-stale-status branch onto current main to fix the Tide error. This is a blocker — CI cannot proceed until the conflict is resolved.

  2. Retest after rebase: After rebasing, run /retest e2e-aws to trigger a fresh e2e run. The EnsureGlobalPullSecret failure is a flake and is expected to pass on retry.

  3. No code changes needed for the test failure: The global-pull-secret-syncer DaemonSet readiness timeout is a known class of flake (DaemonSet pod scheduling delays on hosted cluster nodes). It is not caused by the PR's MachineDeployment status detection changes.

Evidence
Evidence Detail
Failing test TestCreateCluster/Main/EnsureGlobalPullSecret/When_management-cluster_hostedCluster.Spec.PullSecret_is_updated_in-place_it_should_propagate_to_guest_without_rollout
Error failed to wait for DaemonSet global-pull-secret-syncer to be ready: context deadline exceeded
DaemonSet state Stuck at 2/3 pods ready for 20 minutes (build-log.txt lines 2144–7234, ~366 repetitions)
Cascading error kubelet-config-verifier DaemonSet AlreadyExists (409) due to skipped cleanup
PR files changed capi.go, capi_test.go, nodepool_controller.go — all in controllers/nodepool/
PR scope MachineDeploymentComplete() rewrite: v1beta2 status fields + MachineSet template verification
Test relevance PR changes have zero overlap with global pull secret syncing, DaemonSet scheduling, or the failing test code
Comparison run PR #8824 build 2069964113513025536 — same job, EnsureGlobalPullSecret passed (all tests passed)
Tide status mergeable: false, mergeStateStatus: DIRTY — PR has a merge conflict with main
Test result totals 623 tests, 30 skipped, 5 failures (2 leaf + 3 parent cascades)

@clebs clebs closed this Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@clebs: This pull request references Jira Issue OCPBUGS-92013. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

MachineDeploymentComplete() has a regression when introducing v1beta2 where it has flakiness due to lossy conversion and stale status.

Replaced conversion-data annotation check with
deployment.Status.V1Beta2.

Additionally, guard against CAPI status race by adding a MachineSet template cross-verification to MachineDeploymentComplete, and move status reconciliation outside CreateOrPatch so it reads post-persist state.

Which issue(s) this PR fixes:

Fixes OCPBUGS-92013

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes
  • Improved node pool upgrade status tracking by reconciling MachineDeployment status immediately after updates during replace upgrades.
  • Reduced false “complete” signals by basing readiness on native status fields and only marking ready after confirming MachineSets reference the expected machine template.
  • Status/annotation updates now run more consistently and are gated on the updated completeness check, improving upgrade progress visibility.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@clebs

clebs commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

Closed in favor of #8821

@clebs clebs deleted the capi-stale-status branch June 26, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants