OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy#8406
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWhen an existing etcd recovery Job finishes unsuccessfully, the reconciler now fetches the etcd StatefulSet; if the StatefulSet reports ReadyReplicas==3 and AvailableReplicas==3 it deletes the recovery Job and related objects, clears the EtcdRecoveryActive condition (Status=False, Reason=AsExpectedReason) on the HostedCluster, updates HostedCluster status, and returns without marking manual-intervention. If the StatefulSet is not fully ready, the reconciler retains the prior failure handling (EtcdRecoveryJobFailedReason). Separately, when no failing etcd pod is detected and the StatefulSet is fully available, the reconciler clears a stale EtcdRecoveryActive failure condition only if its prior Reason was EtcdRecoveryJobFailedReason, updating status. A unit test verifies these behaviors. Sequence DiagramsequenceDiagram
participant Reconciler
participant ETCDJob as ETCD Recovery Job
participant ETCDSet as ETCD StatefulSet
participant HostedCluster as HostedCluster Status
Reconciler->>ETCDJob: Check if Job exists and failed
alt Job Failed
Reconciler->>ETCDSet: Fetch StatefulSet readiness
alt ReadyReplicas==3 && AvailableReplicas==3
Reconciler->>ETCDJob: Delete recovery Job and objects
Reconciler->>HostedCluster: Set EtcdRecoveryActive=False (Reason: AsExpectedReason)
Reconciler->>HostedCluster: Update status
else StatefulSet Not Fully Ready
Reconciler->>HostedCluster: Set EtcdRecoveryActive=True/Failed (Reason: EtcdRecoveryJobFailedReason)
Reconciler->>HostedCluster: Update status
end
else No Failing Pod Detected
Reconciler->>ETCDSet: Check if fully available
alt StatefulSet Fully Available
Reconciler->>HostedCluster: If prior Reason==EtcdRecoveryJobFailedReason, clear EtcdRecoveryActive (Status=False, Reason=AsExpectedReason)
Reconciler->>HostedCluster: Update status
end
end
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vsolanki12 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8406 +/- ##
==========================================
+ Coverage 39.85% 40.44% +0.59%
==========================================
Files 751 755 +4
Lines 92554 93265 +711
==========================================
+ Hits 36889 37725 +836
+ Misses 52992 52830 -162
- Partials 2673 2710 +37
... and 45 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6622-6627: ⚡ Quick winAssert the failed recovery job is gone in the cleanup case.
The recovered-after-failed-job case only checks the condition reason. If cleanup regresses and the stale
Jobstays behind, this test still passes even though one of the main behaviors in this PR is broken.Suggested assertion
testCases := []struct { name string objects []crclient.Object conditions []metav1.Condition expectedReason string conditionExists bool + expectJobDeleted bool }{ { name: "When failed job exists but etcd recovered it should cleanup job and clear condition", conditions: []metav1.Condition{staleCondition}, objects: append(healthyEtcdPods(), healthyStatefulSet, failedJob), expectedReason: hyperv1.AsExpectedReason, conditionExists: true, + expectJobDeleted: true, }, } @@ if tc.conditionExists { g.Expect(condition).ToNot(BeNil()) g.Expect(condition.Reason).To(Equal(tc.expectedReason)) } else { g.Expect(condition).To(BeNil()) } + + if tc.expectJobDeleted { + job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS) + err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job) + g.Expect(errors2.IsNotFound(err)).To(BeTrue()) + } }) } }Also applies to: 6677-6686
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6622 - 6627, The test cases that use the failedJob fixture (the case with name "When failed job exists but etcd recovered it should cleanup job and clear condition" and the similar case around lines 6677-6686) only assert condition reasons; add an assertion after reconciliation that the failed Job (failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in the test namespace and assert the client returns NotFound or that listing Jobs returns zero matching entries. Locate the assertions around expectedReason/conditionExists in hostedcluster_controller_test.go and add the cleanup check for failedJob for both test cases so a lingering Job causes the test to fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6622-6627: The test cases that use the failedJob fixture (the case
with name "When failed job exists but etcd recovered it should cleanup job and
clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d69011b7-ad73-45d5-aa19-6f7fc30bdb73
📒 Files selected for processing (4)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.gohypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
d28aec7 to
a9c44fb
Compare
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9462d10 to
5873d33
Compare
|
/auto-cc |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
/retest |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6640-6677: ⚡ Quick winAdd a regression case for StatefulSet read failure in failed-job path.
Current cases validate healthy/unhealthy outcomes, but not the new non-
NotFounderror branch (production Line 61-64 inreconcileETCDMemberRecovery). A dedicated case would lock in that behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6640 - 6677, Add a regression test case in the testCases slice that covers the non-NotFound StatefulSet read error path in reconcileETCDMemberRecovery: include failedJob and the etcd pod objects (e.g., use healthyEtcdPods()), ensure the test harness simulates a GET error for the StatefulSet (not a NotFound) instead of returning unhealthyStatefulSet/healthyStatefulSet, and assert the behavior you expect for that branch (e.g., expectedReason remains hyperv1.EtcdRecoveryJobFailedReason and expectJobDeleted is false). Reference the existing identifiers failedJob, reconcileETCDMemberRecovery, healthyEtcdPods, unhealthyStatefulSet/healthyStatefulSet and add the new case to the testCases array so the fake client error path is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6640-6677: Add a regression test case in the testCases slice that
covers the non-NotFound StatefulSet read error path in
reconcileETCDMemberRecovery: include failedJob and the etcd pod objects (e.g.,
use healthyEtcdPods()), ensure the test harness simulates a GET error for the
StatefulSet (not a NotFound) instead of returning
unhealthyStatefulSet/healthyStatefulSet, and assert the behavior you expect for
that branch (e.g., expectedReason remains hyperv1.EtcdRecoveryJobFailedReason
and expectJobDeleted is false). Reference the existing identifiers failedJob,
reconcileETCDMemberRecovery, healthyEtcdPods,
unhealthyStatefulSet/healthyStatefulSet and add the new case to the testCases
array so the fake client error path is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6fb89972-970a-4f3f-b1e7-2cadf40e46f3
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
a2a7e34 to
01953c1
Compare
| etcdRecoveryActiveCondition.Status = metav1.ConditionFalse | ||
| etcdRecoveryActiveCondition.Reason = hyperv1.AsExpectedReason | ||
| etcdRecoveryActiveCondition.LastTransitionTime = r.now() | ||
| meta.SetStatusCondition(&hcluster.Status.Conditions, etcdRecoveryActiveCondition) |
There was a problem hiding this comment.
What about etcdRecoveryActiveCondition.Message in this path?
There was a problem hiding this comment.
Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.
| Type: string(hyperv1.EtcdRecoveryActive), | ||
| Status: metav1.ConditionFalse, | ||
| Reason: hyperv1.AsExpectedReason, | ||
| ObservedGeneration: hcluster.Generation, |
There was a problem hiding this comment.
What about etcdRecoveryActiveCondition.Message in this path?
There was a problem hiding this comment.
Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.
| var pods []crclient.Object | ||
| for i := 0; i < 3; i++ { | ||
| pods = append(pods, &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ |
There was a problem hiding this comment.
May you add a case where RestartCount is > 0 and Running state?
To cover case when POD recovers.
There was a problem hiding this comment.
Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.
|
/cc @sdminonne |
01953c1 to
7788d6b
Compare
|
/retest |
|
/verified by me in local env |
|
@vsolanki12: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
supported logs: After fix: |
|
/test e2e-v2-aws |
7788d6b to
6e9192b
Compare
sdminonne
left a comment
There was a problem hiding this comment.
Generally OK (aside that we're hard-coding helthy etcd repliacs = 3) Need to improve coverage (for example in case of errors).
Can we create instead a isETCDHealthy func?
| Namespace: hcpNS, | ||
| }, | ||
| Status: appsv1.StatefulSetStatus{ | ||
| ReadyReplicas: 2, |
There was a problem hiding this comment.
Is this unhealthy 'cause we hardcode etcd spec.replicas = 3? Mind adding the spec then?
There was a problem hiding this comment.
Yes, made the changed as per suggestions.
| if !apierrors.IsNotFound(err) { | ||
| return false, fmt.Errorf("failed to get etcd statefulset: %w", err) | ||
| } | ||
| } else if etcdStatefulSet.Status.ReadyReplicas == 3 && etcdStatefulSet.Status.AvailableReplicas == 3 { |
There was a problem hiding this comment.
this is hardcoding etcdStatefulSet.Status.ReadyReplicas = 3 right?
There was a problem hiding this comment.
Yes, made the changed as per suggestions.
6e9192b to
d34dc0d
Compare
sdminonne
left a comment
There was a problem hiding this comment.
Thanks nice work! One follow-up only and one pre-existing bug :(
Mind having a look?
There was a problem hiding this comment.
Mind using here the brand new isEtcdStatefulSetHealthy
There was a problem hiding this comment.
Thanks I have updated and push in latest commit.
|
|
||
| func (r *HostedClusterReconciler) detectAndTriggerEtcdRecovery(ctx context.Context, log logr.Logger, hcluster *hyperv1.HostedCluster, hcpNS string, recoveryJob *batchv1.Job, createOrUpdate upsert.CreateOrUpdateFN) (*time.Duration, error) { | ||
| etcdStatefulSet := etcdrecoverymanifests.EtcdStatefulSet(hcpNS) | ||
| if err := r.Get(ctx, crclient.ObjectKeyFromObject(etcdStatefulSet), etcdStatefulSet); err != nil { |
There was a problem hiding this comment.
Here we've a bug, I know you didn't modify this part but since we're here may you fix it?
In case err is different that NotFound we pass through and keep running (for example API transient error).
There was a problem hiding this comment.
Thanks I have updated and push in latest commit.
d34dc0d to
9b22420
Compare
bryan-cox
left a comment
There was a problem hiding this comment.
Staff-Level Review
The fix correctly addresses the reported problem — stale EtcdRecoveryJobFailed condition persisting after etcd self-heals. The approach is architecturally sound, isEtcdStatefulSetHealthy is a good extraction, and the metric integration is correct. See inline comments for items to address before merge.
| return false, err | ||
| } | ||
| return true, nil | ||
| } |
There was a problem hiding this comment.
[blocking] The fallthrough to this existing setEtcdRecoveryCondition call is the "etcd is still unhealthy" case, but it is easy to miss after the new health-check block above. A short comment would help:
// Etcd is still unhealthy after the failed recovery job; report manual intervention needed.There was a problem hiding this comment.
Thank you @bryan-cox,
Done, added the comment.
|
|
||
| oldCondition := meta.FindStatusCondition(hcluster.Status.Conditions, string(hyperv1.EtcdRecoveryActive)) | ||
| if oldCondition == nil || oldCondition.Status != condition.Status { | ||
| if oldCondition == nil || oldCondition.Status != condition.Status || oldCondition.Reason != condition.Reason { |
There was a problem hiding this comment.
[blocking] This guard change is necessary for the fix (both old and new condition use Status=False), but the intent is non-obvious. Please add a comment explaining why Reason is now compared:
// Update the condition if the status or reason changed. The reason check
// is needed to transition from EtcdRecoveryJobFailed -> AsExpected when
// etcd self-heals (both use Status=False).There was a problem hiding this comment.
Thank you @bryan-cox,
Done, added the comment.
| log.Info("etcd is healthy but EtcdRecoveryActive has stale failure condition, clearing it") | ||
| if err := r.setEtcdRecoveryCondition(ctx, hcluster, metav1.ConditionFalse, hyperv1.AsExpectedReason, "ETCD cluster is healthy."); err != nil { | ||
| return nil, err | ||
| } |
There was a problem hiding this comment.
[question] This only clears stale conditions with Reason == EtcdRecoveryJobFailedReason. If someone manually deletes the failed recovery Job while recovery is active, the condition could be stuck at Status=True/Reason=AsExpected — that would not be caught here.
Is manual job deletion during active recovery a scenario worth guarding against? If not, a brief comment explaining the scope would be helpful.
There was a problem hiding this comment.
Added scope comment. Not a scenario worth guarding against active recovery conditions are managed by handleExistingEtcdRecoveryJob.
| return false, fmt.Errorf("failed to get etcd statefulset: %w", err) | ||
| } | ||
| } else if isEtcdStatefulSetHealthy(etcdStatefulSet) { | ||
| log.Info("etcd recovered despite failed recovery job, cleaning up") |
There was a problem hiding this comment.
[nit] New messages use "ETCD" (all-caps) while existing code at line 93 uses "Etcd" (title-case). Not blocking, but consider unifying the capitalization.
There was a problem hiding this comment.
Done, unified to "Etcd" throughout.
| g.Expect(err).To(HaveOccurred()) | ||
| g.Expect(err.Error()).To(ContainSubstring("failed to get etcd statefulset")) | ||
| }) | ||
| } |
There was a problem hiding this comment.
[blocking] Missing test: the transient error test only exercises detectAndTriggerEtcdRecovery (no job seeded). Please add a test where a failed job exists AND the StatefulSet Get fails, to verify the new Get in handleExistingEtcdRecoveryJob (line 76) properly returns the error rather than falling through.
There was a problem hiding this comment.
Done, added when failed job exists and StatefulSet Get fails with transient error it should return the error.
9b22420 to
73b3599
Compare
…when etcd is healthy When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message even when the cluster was fully healthy. This fix adds two checks: - When a failed recovery job exists but etcd StatefulSet is fully available (3/3), clean up the job and clear the condition. Transient API errors are propagated instead of silently falling through to the failure path. - When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
73b3599 to
b6be386
Compare
|
@vsolanki12: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I now have the complete root cause analysis. Here is the final report: Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth Enterprise Contract (EC) failures are caused by outdated Tekton task bundle digests in the PR branch's Root CauseThe Konflux build pipeline ( PR #8406's branch is based on commit Specifically, 3 tasks had version bumps (not just digest rotations):
The remaining 15 tasks had digest-only rotations (same version, new build). Two of these stale references are flagged as EC violations (failures), producing the Timeline proof:
Recommendations
Evidence
|
What this PR does / why we need it:
When the etcd recovery job fails but etcd self-heals, the
EtcdRecoveryJobFailedcondition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True,Degraded=False,EtcdAvailable=True).This fix adds two checks in
reconcileETCDMemberRecovery:EtcdRecoveryJobFailedcondition.Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-84577
Special notes for your reviewer:
ENABLE_ETCD_RECOVERYenv var and only applies to managed, highly-available etcd clusters.etcd_manual_intervention_requiredmetric (which reads this condition) will also correctly reset.Checklist:
Summary by CodeRabbit
Bug Fixes
Tests