OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain#1605
Conversation
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughTNF lifecycle orchestration is expanded across startup, health monitoring, cleanup, reconciliation, and day-2 runners. Shared node/IP and job helpers, Pacemaker helper APIs, RBAC/manifests, and lifecycle documentation are also updated. ChangesTNF lifecycle orchestration
Sequence Diagram(s)sequenceDiagram
participant ControllerFramework
participant PacemakerLifecycleManager
participant StaticPodOperatorClient
participant StartJobControllers
participant CleanupOrphanedJobs
participant MonitorHealth
participant ReconcilePacemakerConfig
ControllerFramework->>PacemakerLifecycleManager: sync()
PacemakerLifecycleManager->>StaticPodOperatorClient: HasExternalEtcdCompletedTransition
PacemakerLifecycleManager->>StartJobControllers: start controllers
PacemakerLifecycleManager->>CleanupOrphanedJobs: cleanup orphaned jobs and pods
alt external etcd transition completed
PacemakerLifecycleManager->>MonitorHealth: MonitorHealth(ctx)
PacemakerLifecycleManager->>ReconcilePacemakerConfig: ReconcilePacemakerConfig(ctx)
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery |
|
@jaypoulz: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ba7540-43e1-11f1-8734-dc441d1b18ce-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/tnf/update-setup/runner_test.go (1)
18-102: Tests validate API primitives, but not the runner branch decisions end-to-end.This table test confirms fake-client
Get/IsNotFoundbehavior, but it doesn’t exerciseRunTnfUpdateSetup(or a helper) to assert “skip vs proceed vs fail” return-path decisions directly. For this critical flow, consider extracting the node-existence decision into a small function and unit-test that function.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner_test.go` around lines 18 - 102, The test is only exercising the fake client Get behavior, not the runner's branch decisions; extract the node-existence decision into a small function (e.g., func checkNodeExists(ctx context.Context, kube kubernetes.Interface, nodeName string) (exists bool, err error)) and update RunTnfUpdateSetup to call that helper; then replace the table test to unit-test checkNodeExists directly (using fake.NewClientset) to assert the three return paths (exists/skip, not found/proceed, unexpected error/fail). Ensure the helper is named uniquely (checkNodeExists) and referenced from RunTnfUpdateSetup so the new unit tests verify the runner branch logic end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 124-126: The test passes a non-error value (&runtime.Unknown{}) to
apierrors.NewInternalError via simulateError; replace the runtime.Unknown value
with an actual error (e.g., errors.New(...) or fmt.Errorf(...)) or format the
runtime.Unknown into an error string so apierrors.NewInternalError receives an
error interface; update the simulateError assignment that calls
apierrors.NewInternalError in runner_test.go accordingly (referencing
simulateError, apierrors.NewInternalError, and runtime.Unknown).
---
Nitpick comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 18-102: The test is only exercising the fake client Get behavior,
not the runner's branch decisions; extract the node-existence decision into a
small function (e.g., func checkNodeExists(ctx context.Context, kube
kubernetes.Interface, nodeName string) (exists bool, err error)) and update
RunTnfUpdateSetup to call that helper; then replace the table test to unit-test
checkNodeExists directly (using fake.NewClientset) to assert the three return
paths (exists/skip, not found/proceed, unexpected error/fail). Ensure the helper
is named uniquely (checkNodeExists) and referenced from RunTnfUpdateSetup so the
new unit tests verify the runner branch logic end-to-end.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b46e7555-1b53-4882-b0b2-fa6f26787667
📒 Files selected for processing (2)
pkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
|
/jira refresh |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
e9ed644 to
0533a44
Compare
|
|
||
| // Check if offline node still exists in Kubernetes | ||
| // Node replacement requires user to DELETE the node from Kubernetes | ||
| // If the node still exists, it's likely just rebooted (e.g., during upgrade) |
There was a problem hiding this comment.
This job only runs when we have 2 ready nodes available... but yeah, pacemaker status might be behind, or reboot by upgrade just started... good catch 🙈
Might be good enough to compare offlineNodeName with currentNodeName and otherNodeName... when one of them matches, skip. But the result is the same I guess :)
There was a problem hiding this comment.
What I realized from this comment is that both approaches have the same problem. Same-name node replacement requires us to re-add the node. So the k8s node check doesn't work because it would block it on the same name, and the currentNodeName and otherNodeName check doesn't work they would match the replaced node.
I think we need to run update-setup to reconcile on node delete because offline is insufficient to distinguish between temporary outage and node-that-needs-replacement.
There was a problem hiding this comment.
OK - I'm pushing a new approach. Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state. This means that deleting the node in k8s results in us removing it from pacemaker.
This allows us to determine user intention based on the declarative k8s state. So now if you reboot, the k8s nodes are the same, so no-op. If you delete a node, that's an intentional change, so we clean it up on the backend.
It's more involved though, so I'll need to do some soak testing for this.
There was a problem hiding this comment.
Same-name node replacement
Ah yes, of course. Has been too long ago already, me dealing with TNF 😎
Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state
Hm, not sure about that tbh...
The idea still was: we have 2 ready nodes, but it's not the initial setup, so "update" pacemaker if needed to reflect reality. Changing the trigger would be a risky change IMHO, and can introduce regressions. I mean, this worked well so far for the intended use case, not?
IMHO we only need to better understand the situation during job execution: do we need to deal with the offline node or not? So: can we check something on the node resource to find out?
Pragmatic idea: check the creation date...? If the node is "new" run the setup, else skip it. 🤔
Maybe there is something better.
Or: I don't remember details, but I think I considered putting a label on nodes after they were successfully configured for pacemaker, but that didn't happen... could be useful in this case.
There was a problem hiding this comment.
I think we can afford the extra iteration time to get this right since the bug is easy enough to work around at upgrade time. Beyond it fixing this edge case, the other reason I like reworking the reconciliation approach is that I think we will need to in the future to support Adaptable topology.
Scaling from 1 to 2 to 3 nodes with automatic quorum recovery will require us to set up pacemaker on a single node cluster first, then scale up as we go. This would be one of multiple things that we would need to harden to handle that scale-up / scale-down logic.
There was a problem hiding this comment.
One thing is absolutely true though - we can't move forward with this patch before updated podman-etcd to handle the single-node case in the monitor operation first.
0533a44 to
3bf1712
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 121-139: The current logic uses jobsExist (from tnfSetupJobsExist)
as a post-setup signal which is set true as soon as startTnfJobcontrollers
creates setup/auth/after-setup jobs; change the gate to require a real
completion signal instead: replace or extend tnfSetupJobsExist with a check for
a TNF setup completion marker (for example a specific Job completion condition,
a "tnf-setup-completed" ConfigMap/CR, or a Job status check) and expose it as
something like tnfSetupCompleted(ctx, kubeClient) (or return a second boolean
from tnfSetupJobsExist indicating completion); then use that completion boolean
in the node count checks (i.e. when evaluating len(nodeList) < 2, consult
setupCompleted not jobsExist) so a partially-created setup after a CEO restart
is treated as not completed and the initial 2-node flow is preserved.
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 149-215: The test in runner_test.go reimplements the
reconciliation logic instead of exercising the production code in runner.go;
extract the reconciliation decision into a single exported or package-private
helper (e.g., ReconcileNodes or decideNodeAddRemove) in runner.go that takes
k8sNodes []string and pacemakerOnline/offline []string and returns
(nodeToRemove, nodeToAdd, skip bool), then update the table-driven tests to call
that helper directly (or alternatively drive RunTnfUpdateSetup using
fakes/mocks) so the tests validate the actual implementation rather than a
duplicate inline algorithm.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 313-326: Replace the unsafe substring check
strings.Contains(peerURL, nodeIP) with exact host comparison: parse each peerURL
(use url.Parse) inside the loop over member.PeerURLs, extract the host portion
(handle optional port with net.SplitHostPort and IPv6 brackets), then compare
the extracted host string exactly to nodeIP before proceeding to format
member.ID, log, and call exec.Execute to remove the member; only call member
remove when the host equals nodeIP to avoid accidental matches.
- Around line 168-176: The removal path must not abort when nodeToRemove isn't
present in cfg.NodeName1/NodeName2; instead, stop using
cfg.NodeName1/NodeName2/NodeIP1/NodeIP2 to derive the member to remove and query
Pacemaker/etcd state directly (e.g., list pacemaker nodes or etcd members) to
find the member ID and peerURL/IP for nodeToRemove, then use that ID to perform
the removal; update the logic in the block that references nodeToRemove (and the
similar code at the other instance around lines 239-248) to call the
pacemaker/etcd discovery functions, handle not-found as a no-op or
log-and-continue, and only fall back to cfg lookups if the live discovery fails.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3032f008-368a-4f16-b3ad-fdf0e5c3b3e4
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
78f3141 to
fad9968
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/tnf/update-setup/runner.go (2)
271-283:⚠️ Potential issue | 🔴 CriticalDeleted nodes cannot be resolved through the current cluster config.
nodeNameshere are the pacemaker members that Kubernetes no longer has. For that case,getNodeIP(nodeName)fails by definition, so the code aborts after pacemaker removal and leaves the stale etcd member behind.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 271 - 283, The loop in removeEtcdMembersForNodes currently aborts when getNodeIP(nodeName) fails (which happens for pacemaker-only/deleted nodes) and leaves stale etcd members; change the logic so a failed getNodeIP does not return but instead logs a warning and attempts removal by node identity: implement a fallback removal path (e.g., add removeEtcdMemberByName(nodeName) or enhance removeEtcdMemberByIP to accept a nodeName fallback) and call that when getNodeIP returns an error; ensure removeEtcdMembersForNodes still logs success/failure via klog and only returns on unrecoverable errors from the removal call (removeEtcdMemberByIP or new removeEtcdMemberByName).
311-324:⚠️ Potential issue | 🔴 CriticalMatch the etcd peer host exactly before removing a member.
strings.Contains(peerURL, nodeIP)is unsafe for values like10.0.0.1vs10.0.0.10, and it is brittle with ports/IPv6 formatting. Parse the peer URL and compare the hostname exactly before callingmember remove.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 311 - 324, The code in the loop over memberList.Members uses strings.Contains(peerURL, nodeIP) which can match substrings and fail on ports/IPv6; update the logic in the loop that iterates member.PeerURLs (the block around strings.Contains(peerURL, nodeIP)) to parse each peerURL (use url.Parse), extract the host component and if it contains a port use net.SplitHostPort to get the exact hostname (handling IPv6 brackets), then compare that hostname exactly to nodeIP before constructing memberIDHex and running exec.Execute for member remove; keep the existing memberIDHex formatting and logging but only call etcdctl member remove when the parsed host equals nodeIP.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 96-99: When building the k8sNodes set, skip empty node names so
the map does not contain "" which causes reconciliation to treat a non-existent
node as present; update the construction of k8sNodes to only insert
cfg.NodeName1 and cfg.NodeName2 when they are non-empty (e.g. check
len(cfg.NodeName1) > 0 and len(cfg.NodeName2) > 0 before adding), keeping the
map type and usage same so downstream logic that checks membership remains
unchanged.
---
Duplicate comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 271-283: The loop in removeEtcdMembersForNodes currently aborts
when getNodeIP(nodeName) fails (which happens for pacemaker-only/deleted nodes)
and leaves stale etcd members; change the logic so a failed getNodeIP does not
return but instead logs a warning and attempts removal by node identity:
implement a fallback removal path (e.g., add removeEtcdMemberByName(nodeName) or
enhance removeEtcdMemberByIP to accept a nodeName fallback) and call that when
getNodeIP returns an error; ensure removeEtcdMembersForNodes still logs
success/failure via klog and only returns on unrecoverable errors from the
removal call (removeEtcdMemberByIP or new removeEtcdMemberByName).
- Around line 311-324: The code in the loop over memberList.Members uses
strings.Contains(peerURL, nodeIP) which can match substrings and fail on
ports/IPv6; update the logic in the loop that iterates member.PeerURLs (the
block around strings.Contains(peerURL, nodeIP)) to parse each peerURL (use
url.Parse), extract the host component and if it contains a port use
net.SplitHostPort to get the exact hostname (handling IPv6 brackets), then
compare that hostname exactly to nodeIP before constructing memberIDHex and
running exec.Execute for member remove; keep the existing memberIDHex formatting
and logging but only call etcdctl member remove when the parsed host equals
nodeIP.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 975b05c2-c5db-40cd-ab9f-96db7d22ec11
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- pkg/tnf/operator/nodehandler_test.go
- pkg/tnf/update-setup/runner_test.go
- pkg/tnf/operator/nodehandler.go
fad9968 to
03a2a57
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
pkg/tnf/operator/nodehandler_test.go (1)
354-360: Consider using a prefix check instead of exact name matches.The hardcoded list of job names is fragile. If tests add new master names (e.g.,
tnf-after-setup-master-2), this condition would need updating.♻️ Suggested improvement
labels := map[string]string{ "app.kubernetes.io/component": "two-node-fencing-setup", } // Mark after-setup jobs to indicate initial setup completion - if name == "tnf-after-setup" || name == "tnf-after-setup-master-0" || name == "tnf-after-setup-master-1" { + if strings.HasPrefix(name, "tnf-after-setup") { labels["app.kubernetes.io/name"] = "tnf-after-setup" }This would require adding
"strings"to the imports.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/operator/nodehandler_test.go` around lines 354 - 360, The code currently checks exact job names when setting labels for after-setup jobs (the block that updates the labels map when name == "tnf-after-setup" || name == "tnf-after-setup-master-0" || name == "tnf-after-setup-master-1"); replace these exact-match checks with a prefix check (e.g., use strings.HasPrefix(name, "tnf-after-setup")) so new masters like "tnf-after-setup-master-2" are handled automatically, and add the "strings" import to the test file.pkg/tnf/update-setup/runner.go (1)
293-296: Unused parameters:nodeNamesandgetNodeIPare never used.The function removes all unstarted etcd members based on
member.Name == "", but the passed parameters are never referenced. This makes the function signature misleading and the caller's effort to passgetNodeIPpointless.♻️ Suggested signature simplification
If the intent is to remove all unstarted members regardless of which nodes were removed:
-func removeEtcdMembersForNodes(ctx context.Context, nodeNames []string, getNodeIP func(string) (string, error)) error { - if len(nodeNames) == 0 { - return nil - } +func removeUnstartedEtcdMembers(ctx context.Context) error {Update call sites at lines 180 and 230 accordingly:
- if err := removeEtcdMembersForNodes(ctx, nodesToRemove, getNodeIP); err != nil { + if len(nodesToRemove) > 0 { + if err := removeUnstartedEtcdMembers(ctx); err != nil { + return err + } + }Alternatively, if you want to keep the node-specific behavior for future use, consider using
getNodeIPto match members by IP viaremoveEtcdMemberByIP.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/tnf/update-setup/runner.go` around lines 293 - 296, The function removeEtcdMembersForNodes has unused parameters nodeNames and getNodeIP; either simplify the signature by removing those parameters and update callers to call removeEtcdMembersForNodes(ctx) (i.e., make it remove all unstarted members as it already does), or implement node-specific removal by iterating nodeNames, calling getNodeIP(node) to obtain IPs and then call removeEtcdMemberByIP(ctx, ip) for matches; update removeEtcdMembersForNodes to reference getNodeIP and nodeNames (or remove them) and adjust call sites accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 349-412: The function removeEtcdMemberByIP is dead code (never
called); either delete it or wire it into the existing workflow — specifically
replace or refactor the current removeEtcdMembersForNodes usage to call
removeEtcdMemberByIP for each nodeIP, or remove the entire removeEtcdMemberByIP
function and clean up any now-unused imports (e.g., url, net, strings) and
tests; locate the existing caller(s) of removeEtcdMembersForNodes and either
iterate node IPs invoking removeEtcdMemberByIP (preserving its parsing/hex-ID
logic) or remove the function and ensure no references remain.
---
Nitpick comments:
In `@pkg/tnf/operator/nodehandler_test.go`:
- Around line 354-360: The code currently checks exact job names when setting
labels for after-setup jobs (the block that updates the labels map when name ==
"tnf-after-setup" || name == "tnf-after-setup-master-0" || name ==
"tnf-after-setup-master-1"); replace these exact-match checks with a prefix
check (e.g., use strings.HasPrefix(name, "tnf-after-setup")) so new masters like
"tnf-after-setup-master-2" are handled automatically, and add the "strings"
import to the test file.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 293-296: The function removeEtcdMembersForNodes has unused
parameters nodeNames and getNodeIP; either simplify the signature by removing
those parameters and update callers to call removeEtcdMembersForNodes(ctx)
(i.e., make it remove all unstarted members as it already does), or implement
node-specific removal by iterating nodeNames, calling getNodeIP(node) to obtain
IPs and then call removeEtcdMemberByIP(ctx, ip) for matches; update
removeEtcdMembersForNodes to reference getNodeIP and nodeNames (or remove them)
and adjust call sites accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 273e7f58-c686-46d5-b227-f828930b53ef
📒 Files selected for processing (4)
pkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/update-setup/runner_test.go
fonta-rh
left a comment
There was a problem hiding this comment.
Review: OCPBUGS-84695 — Fix TNF Pacemaker Split-Brain
The core reconciliation approach is solid — comparing k8s ConfigMap (desired state) vs pacemaker XML membership (actual state) is fundamentally better than the old grep Offline | awk shell parsing. Three findings from review:
1. Dead code: removeEtcdMemberByIP() + unused getNodeIP parameter
removeEtcdMemberByIP() (runner.go:472-535, 63 lines) is defined but never called anywhere in this PR. Additionally, the getNodeIP callback is passed to removeEtcdMembersForNodes() at lines 320 and 370, but it's never dereferenced inside the function body — removeEtcdMembersForNodes removes ALL unstarted members regardless of node identity.
Was removeEtcdMemberByIP intended to be wired in as a fallback (e.g., when unstarted-name detection fails) but forgotten? If it's not needed, removing both the function and the dead getNodeIP parameter would clean things up. If removing, consider renaming removeEtcdMembersForNodes → removeAllUnstartedEtcdMembers to match the actual behavior (blanket cleanup, not per-node targeting).
2. Missing tests for intermediate setup states
The tnfSetupJobsExist() semantics changed significantly — from "any TNF job exists" to "after-setup job completed successfully (Succeeded > 0)." This is the behavioral hinge of the fix, but two intermediate states aren't tested:
-
After-setup in progress:
tnf-after-setup-master-0exists withSucceeded: 0→tnfSetupJobsExist()returnsfalse. With 1 node present,handleNodesblocks. ThecreateTNFJob()fixture hardcodesSucceeded: 1, so this path can't currently be exercised. -
Partial setup (setup running, after-setup not yet created):
tnf-setupjob exists but notnf-after-setupjobs. The old test usedcreateTNFJob("tnf-setup")— that coverage was lost when fixtures were changed totnf-after-setup-*.
Both are realistic timing windows during normal cluster lifecycle. Suggested additions:
"1 node, after-setup job in progress (not completed) - skip reconciliation""2 nodes, setup jobs exist but no after-setup jobs - start controllers only"
3. removeEtcdMembersForNodes docstring vs behavior mismatch
The docstring says "removes unstarted etcd members after nodes are removed from pacemaker," implying per-node targeting. The actual behavior removes ALL unstarted members regardless of the nodeNames parameter (which is only used for logging). The behavior is correct and safe for a 2-node cluster, but the docstring could mislead a future engineer debugging member removal issues. Updating it to reflect the blanket-cleanup behavior would prevent misunderstandings.
Overall the approach is a clear improvement. The reconciliation logic in reconcileNodes() is clean, well-extracted for testability, and the test coverage for the core diff logic is thorough (14 cases). Nice work.
f3779b1 to
880d8ca
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/tnf/operator/ARCHITECTURE.md (1)
1-984: 💤 Low valueDocumentation is comprehensive and clear.
The architecture documentation provides excellent coverage of the PacemakerLifecycleManager flows, data structures, and concurrency protections. The ASCII diagrams clearly illustrate the control flow for each operation.
Static analysis flagged missing language specifiers on fenced code blocks (e.g., line 9) and minor grammar suggestions. These are optional style improvements.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/ARCHITECTURE.md` around lines 1 - 984, The markdown has fenced code blocks missing language specifiers and a few minor grammar issues; update ARCHITECTURE.md by adding appropriate language tags (e.g., ```go for Go snippets, ```text or ```bash where applicable) to every fenced code block (notably the File Organization block and all code examples like NewPacemakerLifecycleManager(), sync(), StartJobControllers(), startTnfJobcontrollers(), NodeTarget, and PacemakerLifecycleManager sections) and apply the small grammar fixes suggested by static analysis (straighten pluralization/tense and punctuation in headings and bullets) so the file passes markdown linters and improves readability.pkg/tnf/update-setup/runner_test.go (1)
96-118: ⚡ Quick winAdd an “all generations invalid” case to lock the error-path contract.
Current coverage checks empty input and mixed valid/invalid generations, but not the case where every ConfigMap has a non-numeric
generation. That is a distinct branch in the picker and worth pinning with a test.Proposed diff
{ name: "skips invalid generation", items: []corev1.ConfigMap{ { ObjectMeta: metav1.ObjectMeta{Name: "valid"}, Data: map[string]string{ "generation": "5", "eventType": "add", }, }, { ObjectMeta: metav1.ObjectMeta{Name: "invalid"}, Data: map[string]string{ "generation": "not-a-number", "eventType": "add", }, }, }, expectName: "valid", expectGen: "5", expectError: false, }, + { + name: "all invalid generations returns error", + items: []corev1.ConfigMap{ + { + ObjectMeta: metav1.ObjectMeta{Name: "invalid-1"}, + Data: map[string]string{ + "generation": "x", + }, + }, + { + ObjectMeta: metav1.ObjectMeta{Name: "invalid-2"}, + Data: map[string]string{ + "generation": "y", + }, + }, + }, + expectError: true, + expectErrorMsg: "no update-setup ConfigMap found", + }, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner_test.go` around lines 96 - 118, Add a new test case in pkg/tnf/update-setup/runner_test.go within the existing test table (the slice of test cases used by the test function) named "all generations invalid" that supplies only ConfigMaps whose Data["generation"] values are non-numeric (e.g., "not-a-number" and "also-bad"); set expectError to true and leave expectName/expectGen empty (or their zero values) so the test exercises the picker’s error path when every generation fails to parse. Ensure the test case mirrors the structure of the existing cases (using corev1.ConfigMap with ObjectMeta and Data) so the same test loop uses it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/lifecycle_helpers.go`:
- Around line 1-11: The file uses klog.Warningf in lifecycle_helpers.go but the
klog package is not imported; add an import for "k8s.io/klog/v2" to the import
block so calls like klog.Warningf(...) compile. Locate the import section at the
top of lifecycle_helpers.go (near existing imports such as corev1, pacmkrv1,
ceohelpers, pacemaker) and include the klog import; no code changes to the klog
call sites (e.g., where klog.Warningf is used) are needed.
In `@pkg/tnf/operator/lifecycle_manager.go`:
- Around line 130-145: PacemakerLifecycleManager currently creates lifecycleCtx
and lifecycleCtxCancel in NewPacemakerLifecycleManager but never calls
lifecycleCtxCancel, leaking goroutines spawned by the event handlers; fix by
making the lifecycle context tied to the controller Run context or by exposing a
Shutdown that calls lifecycleCtxCancel: either (A) change
NewPacemakerLifecycleManager to not create an independent background context and
instead set lifecycleCtx = ctx passed into Run (or derive with
context.WithCancel(ctx)) so cancellation propagates when controller.Run(ctx,
...) exits, or (B) add a Shutdown() method on PacemakerLifecycleManager that
calls lifecycleCtxCancel and ensure callers (e.g., where controller is started)
invoke Shutdown after controller stops; update event handler usages that read
lifecycleCtx to use the manager's lifecycleCtx so they are cancelled properly.
In `@pkg/tnf/pkg/pcs/auth.go`:
- Around line 49-53: The pcs auth command is vulnerable to shell injection
because auth.go builds a fmt.Sprintf string with cfg.NodeName1/2 and
cfg.NodeIP1/2 that is passed to the shell via pkg/tnf/pkg/exec/exec.go; fix by
switching to argv-style execution or strict validation: update the code that
constructs the command in the function that sets the `command` variable in
pkg/tnf/pkg/pcs/auth.go to build an argument slice (e.g.,
[]string{"/usr/sbin/pcs","host","auth",...}) and modify the exec caller in
pkg/tnf/pkg/exec/exec.go to accept and exec without "/bin/bash -c", or
alternatively add tight allow-list/regex validation for NodeName* and NodeIP*
before using them (ensure only valid hostnames and IP formats are permitted) so
that the string interpolation cannot inject shell metacharacters.
---
Nitpick comments:
In `@pkg/tnf/operator/ARCHITECTURE.md`:
- Around line 1-984: The markdown has fenced code blocks missing language
specifiers and a few minor grammar issues; update ARCHITECTURE.md by adding
appropriate language tags (e.g., ```go for Go snippets, ```text or ```bash where
applicable) to every fenced code block (notably the File Organization block and
all code examples like NewPacemakerLifecycleManager(), sync(),
StartJobControllers(), startTnfJobcontrollers(), NodeTarget, and
PacemakerLifecycleManager sections) and apply the small grammar fixes suggested
by static analysis (straighten pluralization/tense and punctuation in headings
and bullets) so the file passes markdown linters and improves readability.
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 96-118: Add a new test case in pkg/tnf/update-setup/runner_test.go
within the existing test table (the slice of test cases used by the test
function) named "all generations invalid" that supplies only ConfigMaps whose
Data["generation"] values are non-numeric (e.g., "not-a-number" and "also-bad");
set expectError to true and leave expectName/expectGen empty (or their zero
values) so the test exercises the picker’s error path when every generation
fails to parse. Ensure the test case mirrors the structure of the existing cases
(using corev1.ConfigMap with ObjectMeta and Data) so the same test loop uses it.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e48cf5b1-9520-49a4-ace1-71e0fc89d0fb
📒 Files selected for processing (32)
bindata/tnfdeployment/role.yamlpkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/ceohelpers/node_helpers.gopkg/operator/ceohelpers/node_helpers_test.gopkg/tnf/auth/runner.gopkg/tnf/operator/ARCHITECTURE.mdpkg/tnf/operator/lifecycle_cleanup.gopkg/tnf/operator/lifecycle_health.gopkg/tnf/operator/lifecycle_helpers.gopkg/tnf/operator/lifecycle_job_controllers.gopkg/tnf/operator/lifecycle_job_controllers_test.gopkg/tnf/operator/lifecycle_manager.gopkg/tnf/operator/lifecycle_manager_test.gopkg/tnf/operator/lifecycle_reconciliation.gopkg/tnf/operator/lifecycle_reconciliation_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/pacemaker/client_helpers.gopkg/tnf/pkg/pacemaker/healthcheck.gopkg/tnf/pkg/pacemaker/healthcheck_test.gopkg/tnf/pkg/pacemaker/statuscollector.gopkg/tnf/pkg/pacemaker/statuscollector_test.gopkg/tnf/pkg/pacemaker/test_helpers.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/tnf_labels.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
💤 Files with no reviewable changes (3)
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
- pkg/tnf/operator/nodehandler.go
- pkg/tnf/operator/nodehandler_test.go
🚧 Files skipped from review as they are similar to previous changes (21)
- pkg/tnf/pkg/pacemaker/statuscollector.go
- pkg/tnf/auth/runner.go
- pkg/operator/ceohelpers/common.go
- pkg/tnf/pkg/pacemaker/client_helpers.go
- pkg/operator/ceohelpers/node_helpers.go
- bindata/tnfdeployment/role.yaml
- pkg/tnf/pkg/pacemaker/statuscollector_test.go
- pkg/operator/ceohelpers/node_helpers_test.go
- pkg/operator/ceohelpers/common_test.go
- pkg/tnf/pkg/jobs/tnf.go
- pkg/tnf/operator/lifecycle_reconciliation_test.go
- pkg/tnf/pkg/jobs/tnf_test.go
- pkg/tnf/operator/lifecycle_cleanup.go
- pkg/tnf/operator/starter_test.go
- pkg/tnf/operator/lifecycle_job_controllers_test.go
- pkg/tnf/operator/lifecycle_job_controllers.go
- pkg/tnf/pkg/pacemaker/test_helpers.go
- pkg/tnf/update-setup/runner.go
- pkg/tnf/operator/lifecycle_reconciliation.go
- pkg/tnf/operator/lifecycle_health.go
- pkg/tnf/pkg/pacemaker/healthcheck.go
| // Create lifecycle context for goroutines spawned by event handlers | ||
| lifecycleCtx, lifecycleCtxCancel := context.WithCancel(context.Background()) | ||
|
|
||
| c := &PacemakerLifecycleManager{ | ||
| operatorClient: operatorClient, | ||
| kubeClient: kubeClient, | ||
| eventRecorder: eventRecorder, | ||
| pacemakerInformer: informer, | ||
| nodeInformer: nodeInformer, | ||
| controllerContext: controllerContext, | ||
| kubeInformersForNamespaces: kubeInformersForNamespaces, | ||
| etcdInformer: etcdInformer, | ||
| recordedEvents: make(map[string]time.Time), | ||
| lifecycleCtx: lifecycleCtx, | ||
| lifecycleCtxCancel: lifecycleCtxCancel, | ||
| } |
There was a problem hiding this comment.
Add lifecycle context cancellation on controller shutdown.
The lifecycleCtx is created but lifecycleCtxCancel is never called. This violates the Go coding guideline requiring context.Context for cancellation and timeouts, and leads to goroutine leaks when the controller stops. Event handlers spawn goroutines at lines 197-201, 210-214, 226-230, and 249-253 that rely on this context; without cancellation, these goroutines will continue running after the controller shuts down.
As per coding guidelines, context.Context must be used for cancellation and timeouts.
Recommended fix
The controller lifecycle is managed by the library-go framework, which doesn't expose a shutdown hook in the factory.Controller interface returned by NewPacemakerLifecycleManager. The typical pattern is to cancel the context when the parent context (passed to controller.Run(ctx, 1) in starter.go:259) is cancelled.
However, since event handlers need the lifecycle context to survive across sync cycles, you should either:
Option 1: Store a reference to the controller and add a Shutdown() method:
func NewPacemakerLifecycleManager(...) (factory.Controller, *PacemakerLifecycleManager, cache.SharedIndexInformer, error) {
// ... existing code ...
controller := factory.New().
WithSyncContext(syncCtx).
ResyncEvery(pacemaker.HealthCheckResyncInterval).
WithSync(c.sync).
WithInformers(
operatorClient.Informer(),
informer,
nodeInformer,
).ToController(controllerNamePacemakerLifecycle, syncCtx.Recorder())
+ // Wrap the sync function to cancel lifecycle context on shutdown
+ // This ensures event handler goroutines are cleaned up
return controller, c, informer, nil
}
+// Shutdown cancels the lifecycle context, stopping all event handler goroutines
+func (c *PacemakerLifecycleManager) Shutdown() {
+ c.lifecycleCtxCancel()
+}Then call Shutdown() in starter.go after the controller stops, or better yet, derive lifecycleCtx from the controller context passed to Run().
Option 2 (simpler): Change event handlers to use the sync context instead of a separate lifecycle context, and pass the context from the caller:
- lifecycleCtx, lifecycleCtxCancel := context.WithCancel(context.Background())
-
c := &PacemakerLifecycleManager{
// ... existing fields ...
- lifecycleCtx: lifecycleCtx,
- lifecycleCtxCancel: lifecycleCtxCancel,
}And update event handlers to use context.Background() or a context derived from the controller's Run context.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/lifecycle_manager.go` around lines 130 - 145,
PacemakerLifecycleManager currently creates lifecycleCtx and lifecycleCtxCancel
in NewPacemakerLifecycleManager but never calls lifecycleCtxCancel, leaking
goroutines spawned by the event handlers; fix by making the lifecycle context
tied to the controller Run context or by exposing a Shutdown that calls
lifecycleCtxCancel: either (A) change NewPacemakerLifecycleManager to not create
an independent background context and instead set lifecycleCtx = ctx passed into
Run (or derive with context.WithCancel(ctx)) so cancellation propagates when
controller.Run(ctx, ...) exits, or (B) add a Shutdown() method on
PacemakerLifecycleManager that calls lifecycleCtxCancel and ensure callers
(e.g., where controller is started) invoke Shutdown after controller stops;
update event handler usages that read lifecycleCtx to use the manager's
lifecycleCtx so they are cancelled properly.
Source: Coding guidelines
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/tnf-lifecycle-architecture.md (1)
589-611: ⚡ Quick winDocument the Node-exists gate before
updateSetup.The reconciliation flow still reads like “any drift →
updateSetup”, but the PR’s central rule is to skip replacement when pacemaker marks a node offline and the Kubernetes Node still exists. Please call that branch out here so the doc matches the implemented behavior.♻️ Suggested doc patch
+ Pacemaker node offline but Kubernetes Node still exists? + ├── Yes → skip replacement / return nil + └── No → continue to updateSetup()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf-lifecycle-architecture.md` around lines 589 - 611, The diagram and text currently imply any detected drift leads to updateSetup, but the implementation skips replacement when Pacemaker marks a node offline while the Kubernetes Node still exists; update the flow to insert a "Node-exists gate" before calling updateSetup: after detectDrift(...) and isUpdateSetupRunning(...) pass, call getIntersection(k8sNodes, pacemakerNodes) and if the intersection is non-empty proceed to updateSetup(), otherwise return/skip; update the diagram and accompanying prose to explicitly name this gate and reference the functions detectDrift, isUpdateSetupRunning, getIntersection, and updateSetup so the doc matches the code behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/lifecycle_manager_test.go`:
- Line 129: The test currently ignores the error returned by store.Add(obj);
update the test to handle that error by either asserting success with
require.NoError(t, store.Add(obj)) if you have access to t *testing.T, or by
capturing and failing on error (e.g., if err := store.Add(obj); err != nil {
t.Fatalf("store.Add failed: %v", err) }); locate the call to store.Add in
lifecycle_manager_test.go and replace the blank-assignment `_ = store.Add(obj)`
with one of these explicit error-handling patterns.
- Around line 1294-1297: The loop is ignoring the error return from
nodeIndexer.Add which can hide failures; update the loop that iterates
tt.k8sNodes to check the returned error from nodeIndexer.Add and fail the test
on error (e.g., if err := nodeIndexer.Add(node); err != nil { t.Fatalf("failed
to add node to indexer: %v", err) }) so that problems with
cache.MetaNamespaceKeyFunc or Add surface immediately; the change touches the
nodeIndexer.Add call in the test using the existing t testing.T instance.
---
Nitpick comments:
In `@docs/tnf-lifecycle-architecture.md`:
- Around line 589-611: The diagram and text currently imply any detected drift
leads to updateSetup, but the implementation skips replacement when Pacemaker
marks a node offline while the Kubernetes Node still exists; update the flow to
insert a "Node-exists gate" before calling updateSetup: after detectDrift(...)
and isUpdateSetupRunning(...) pass, call getIntersection(k8sNodes,
pacemakerNodes) and if the intersection is non-empty proceed to updateSetup(),
otherwise return/skip; update the diagram and accompanying prose to explicitly
name this gate and reference the functions detectDrift, isUpdateSetupRunning,
getIntersection, and updateSetup so the doc matches the code behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8077d43e-3cb9-4d1a-9a15-475ccd180238
📒 Files selected for processing (32)
bindata/tnfdeployment/role.yamldocs/tnf-lifecycle-architecture.mdpkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/ceohelpers/node_helpers.gopkg/operator/ceohelpers/node_helpers_test.gopkg/tnf/auth/runner.gopkg/tnf/operator/lifecycle_cleanup.gopkg/tnf/operator/lifecycle_health.gopkg/tnf/operator/lifecycle_helpers.gopkg/tnf/operator/lifecycle_job_controllers.gopkg/tnf/operator/lifecycle_job_controllers_test.gopkg/tnf/operator/lifecycle_manager.gopkg/tnf/operator/lifecycle_manager_test.gopkg/tnf/operator/lifecycle_reconciliation.gopkg/tnf/operator/lifecycle_reconciliation_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/pacemaker/client_helpers.gopkg/tnf/pkg/pacemaker/healthcheck.gopkg/tnf/pkg/pacemaker/healthcheck_test.gopkg/tnf/pkg/pacemaker/statuscollector.gopkg/tnf/pkg/pacemaker/statuscollector_test.gopkg/tnf/pkg/pacemaker/test_helpers.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/tnf_labels.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
💤 Files with no reviewable changes (3)
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
- pkg/tnf/operator/nodehandler.go
- pkg/tnf/operator/nodehandler_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/pkg/pacemaker/statuscollector_test.go
🚧 Files skipped from review as they are similar to previous changes (23)
- pkg/tnf/pkg/pacemaker/statuscollector.go
- pkg/tnf/pkg/pcs/auth.go
- pkg/tnf/pkg/pacemaker/client_helpers.go
- pkg/tnf/pkg/tools/tnf_labels.go
- pkg/tnf/operator/lifecycle_reconciliation_test.go
- pkg/operator/ceohelpers/node_helpers.go
- pkg/operator/ceohelpers/common_test.go
- pkg/tnf/update-setup/runner_test.go
- bindata/tnfdeployment/role.yaml
- pkg/tnf/operator/lifecycle_cleanup.go
- pkg/tnf/operator/starter_test.go
- pkg/tnf/operator/lifecycle_job_controllers.go
- pkg/tnf/pkg/pacemaker/test_helpers.go
- pkg/tnf/operator/lifecycle_manager.go
- pkg/operator/ceohelpers/node_helpers_test.go
- pkg/tnf/pkg/jobs/tnf.go
- pkg/tnf/pkg/jobs/tnf_test.go
- pkg/tnf/update-setup/runner.go
- pkg/tnf/operator/starter.go
- pkg/tnf/operator/lifecycle_helpers.go
- pkg/tnf/operator/lifecycle_job_controllers_test.go
- pkg/tnf/pkg/pacemaker/healthcheck.go
- pkg/tnf/operator/lifecycle_health.go
888e2f1 to
ff4e65d
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/pkg/jobs/tnf_test.go (1)
404-417:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDecouple controller-start expectation from
expectWaitForStopped.
expectWaitForStoppedis a path flag, but Line 412 uses it as an outcome gate. That makes timeout/delete-error cases assert controller startup even when the test expects an error path. Add an explicitexpectControllerStartedfield and assert against that directly.Proposed fix
tests := []struct { name string jobType tools.JobType nodeTarget *NodeTarget scheduleOnNode *string setupClient func() *fake.Clientset expectError bool errorContains string expectJobDeleted bool expectWaitForStopped bool + expectControllerStarted bool }{ { name: "Job does not exist - just runs controller", @@ - expectWaitForStopped: false, + expectWaitForStopped: false, + expectControllerStarted: true, }, @@ - if !tt.expectError || tt.expectWaitForStopped { + if tt.expectControllerStarted { runningControllersMutex.Lock() isRunning := runningControllers[jobName] runningControllersMutex.Unlock() require.True(t, isRunning, "Expected controller to be started") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/jobs/tnf_test.go` around lines 404 - 417, The test currently gates the controller-start assertion on expectWaitForStopped which is a path flag, causing tests that expect errors to still assert controller startup; add a new boolean test field expectControllerStarted and replace the condition "if !tt.expectError || tt.expectWaitForStopped" with a check against tt.expectControllerStarted, then use runningControllersMutex, runningControllers[jobName], and require.True/False accordingly to assert whether the controller was started or not; update test cases to set expectControllerStarted where appropriate and keep jobName computed via tt.jobType.GetJobName(&tt.nodeTarget.Name) as before.
♻️ Duplicate comments (1)
pkg/tnf/pkg/pcs/auth.go (1)
49-53:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winShell interpolation is still unsafe here.
cfg.NodeName*andcfg.NodeIP*are still embedded directly into thepcs host authshell command, so the injection risk from the earlier review remains in both branches. Please switch this path to argv-style execution or validate/escape the node values before building the command. As per coding guidelines, shell input from trust boundaries must not be parsed by a shell.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/pcs/auth.go` around lines 49 - 53, The code builds a shell command string using untrusted cfg.NodeName1/2 and cfg.NodeIP1/2 (and TokenPath) which allows shell injection; replace both fmt.Sprintf branches in auth.go so you do not invoke a shell: construct an argv-style slice (e.g. args := []string{"host","auth", cfg.NodeName1, "addr="+cfg.NodeIP1, ...} including the second node conditionally and the "--token", TokenPath and "--debug" flags) and call exec.Command / exec.CommandContext with the command "/usr/sbin/pcs" plus that args slice, or alternatively strictly validate/escape node names and IPs against a whitelist/regex before using them; ensure you update the code paths handling both the single-node and two-node cases to use the argv-style execution and remove any fmt.Sprintf shell interpolation.Source: Coding guidelines
🧹 Nitpick comments (1)
docs/tnf-lifecycle-architecture.md (1)
9-53: 💤 Low valueAdd language tags to the fenced blocks.
The document uses unlabeled fences in several places, which is why markdownlint is reporting MD040 here. Tag the diagrams/snippets as
text(or a more specific language where applicable) so the new architecture doc stays lint-clean. Per the markdownlint warnings, the same fix applies to the rest of the file.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf-lifecycle-architecture.md` around lines 9 - 53, Unlabeled fenced code blocks cause markdownlint MD040; update each fenced block (e.g., the diagram describing NewPacemakerLifecycleManager() in lifecycle_manager.go) to include a language tag such as ```text (or a more specific language where appropriate) at the opening fence, and apply the same change to all other fenced blocks in docs/tnf-lifecycle-architecture.md so the file passes markdownlint checks.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/lifecycle_job_controllers_test.go`:
- Around line 340-342: Replace the ignored error from nodeIndexer.Add in the
test setup with a failing assertion so fixture setup fails fast: change the loop
that currently does `_ = nodeIndexer.Add(node)` to call `require.NoError(t,
nodeIndexer.Add(node))` (or `require.NoErrorf` with context) for each node;
ensure `t` (testing.T) and the testify/require import are available in the test
file.
---
Outside diff comments:
In `@pkg/tnf/pkg/jobs/tnf_test.go`:
- Around line 404-417: The test currently gates the controller-start assertion
on expectWaitForStopped which is a path flag, causing tests that expect errors
to still assert controller startup; add a new boolean test field
expectControllerStarted and replace the condition "if !tt.expectError ||
tt.expectWaitForStopped" with a check against tt.expectControllerStarted, then
use runningControllersMutex, runningControllers[jobName], and require.True/False
accordingly to assert whether the controller was started or not; update test
cases to set expectControllerStarted where appropriate and keep jobName computed
via tt.jobType.GetJobName(&tt.nodeTarget.Name) as before.
---
Duplicate comments:
In `@pkg/tnf/pkg/pcs/auth.go`:
- Around line 49-53: The code builds a shell command string using untrusted
cfg.NodeName1/2 and cfg.NodeIP1/2 (and TokenPath) which allows shell injection;
replace both fmt.Sprintf branches in auth.go so you do not invoke a shell:
construct an argv-style slice (e.g. args := []string{"host","auth",
cfg.NodeName1, "addr="+cfg.NodeIP1, ...} including the second node conditionally
and the "--token", TokenPath and "--debug" flags) and call exec.Command /
exec.CommandContext with the command "/usr/sbin/pcs" plus that args slice, or
alternatively strictly validate/escape node names and IPs against a
whitelist/regex before using them; ensure you update the code paths handling
both the single-node and two-node cases to use the argv-style execution and
remove any fmt.Sprintf shell interpolation.
---
Nitpick comments:
In `@docs/tnf-lifecycle-architecture.md`:
- Around line 9-53: Unlabeled fenced code blocks cause markdownlint MD040;
update each fenced block (e.g., the diagram describing
NewPacemakerLifecycleManager() in lifecycle_manager.go) to include a language
tag such as ```text (or a more specific language where appropriate) at the
opening fence, and apply the same change to all other fenced blocks in
docs/tnf-lifecycle-architecture.md so the file passes markdownlint checks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e70b5b5c-0861-4649-b961-103af3804f9e
📒 Files selected for processing (32)
bindata/tnfdeployment/role.yamldocs/tnf-lifecycle-architecture.mdpkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/ceohelpers/node_helpers.gopkg/operator/ceohelpers/node_helpers_test.gopkg/tnf/auth/runner.gopkg/tnf/operator/lifecycle_cleanup.gopkg/tnf/operator/lifecycle_health.gopkg/tnf/operator/lifecycle_helpers.gopkg/tnf/operator/lifecycle_job_controllers.gopkg/tnf/operator/lifecycle_job_controllers_test.gopkg/tnf/operator/lifecycle_manager.gopkg/tnf/operator/lifecycle_manager_test.gopkg/tnf/operator/lifecycle_reconciliation.gopkg/tnf/operator/lifecycle_reconciliation_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/pacemaker/client_helpers.gopkg/tnf/pkg/pacemaker/healthcheck.gopkg/tnf/pkg/pacemaker/healthcheck_test.gopkg/tnf/pkg/pacemaker/statuscollector.gopkg/tnf/pkg/pacemaker/statuscollector_test.gopkg/tnf/pkg/pacemaker/test_helpers.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/tnf_labels.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
💤 Files with no reviewable changes (3)
- pkg/tnf/operator/nodehandler.go
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
- pkg/tnf/operator/nodehandler_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/tnf/auth/runner.go
🚧 Files skipped from review as they are similar to previous changes (20)
- pkg/tnf/pkg/pacemaker/statuscollector.go
- pkg/tnf/pkg/tools/tnf_labels.go
- pkg/tnf/pkg/pacemaker/statuscollector_test.go
- pkg/operator/ceohelpers/common.go
- pkg/tnf/update-setup/runner_test.go
- pkg/operator/ceohelpers/common_test.go
- bindata/tnfdeployment/role.yaml
- pkg/operator/ceohelpers/node_helpers_test.go
- pkg/tnf/pkg/pacemaker/client_helpers.go
- pkg/tnf/pkg/jobs/tnf.go
- pkg/operator/ceohelpers/node_helpers.go
- pkg/tnf/operator/lifecycle_manager.go
- pkg/tnf/operator/starter_test.go
- pkg/tnf/pkg/pacemaker/test_helpers.go
- pkg/tnf/update-setup/runner.go
- pkg/tnf/operator/lifecycle_cleanup.go
- pkg/tnf/operator/lifecycle_health.go
- pkg/tnf/operator/starter.go
- pkg/tnf/pkg/pacemaker/healthcheck.go
- pkg/tnf/operator/lifecycle_reconciliation.go
bf8e8df to
3f964fb
Compare
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/pkg/pacemaker/healthcheck.go (1)
192-217:⚠️ Potential issue | 🟠 Major | ⚡ Quick winTreat missing node health metadata as unhealthy, not healthy.
Line 194 drops nodes with no conditions, and Line 215 also short-circuits when
NodeHealthyis absent. That lets a partially populatedPacemakerClusterstatus come backStatusHealthy, which then clearsPacemakerHealthCheckDegradeddownstream inpkg/tnf/operator/lifecycle_health.go. Missing per-node conditions/summary data should surface as an error orStatusUnknown, not a healthy node.Suggested direction
func checkNodeConditions(node pacmkrv1.PacemakerClusterNodeStatus, status *HealthStatus) { conditions := node.Conditions if len(conditions) == 0 { - return + status.Errors = append(status.Errors, fmt.Sprintf(msgNodeUnhealthy, node.NodeName, "missing node conditions")) + return } @@ healthyCondition := FindCondition(conditions, pacmkrv1.NodeHealthyConditionType) - if healthyCondition == nil || healthyCondition.Status == metav1.ConditionTrue { + if healthyCondition == nil { + status.Errors = append(status.Errors, fmt.Sprintf(msgNodeUnhealthy, node.NodeName, "missing Healthy condition")) + return + } + if healthyCondition.Status == metav1.ConditionTrue { return // Node is healthy (except for warnings already captured above) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/pacemaker/healthcheck.go` around lines 192 - 217, The function checkNodeConditions currently treats missing per-node conditions as healthy; change it so missing or incomplete node condition metadata is considered unhealthy/unknown: in checkNodeConditions, when len(node.Conditions) == 0 append an error (e.g., fmt.Sprintf(msgNodeMissingConditions, node.NodeName) or similar) to status.Errors or set status to unknown and return, and likewise when FindCondition(..., pacmkrv1.NodeHealthyConditionType) returns nil do not short-circuit as healthy — append an error/warning for node.NodeName and proceed to report fencing warnings and other conditions; update references in the function (checkNodeConditions, node.Conditions, FindCondition, NodeHealthyConditionType, NodeOnlineConditionType, getFencingWarnings, HealthStatus) accordingly.
🧹 Nitpick comments (1)
docs/tnf-lifecycle-architecture.md (1)
9-9: ⚡ Quick winAdd language specifier to all fenced code blocks.
The document contains 19 fenced code blocks (ASCII art flow diagrams) without a language specifier. This triggers markdown linting warnings (MD040). Since these blocks contain plain-text ASCII diagrams rather than executable code, add the
textlanguage identifier to each code block.Example fix: Change
```to```texton each affected line.♻️ Example: Fix code block at line 30 and others
-``` +```text ┌─────────────────────────────────────────────────────────────┐ │ NewPacemakerLifecycleManager() │Apply the same fix to lines: 9, 30, 65, 122, 199, 216, 259, 363, 393, 440, 471, 527, 592, 672, 724, 803, 827, 882, 908, 919 (all fenced code blocks containing ASCII art).
Also applies to: 30-30, 65-65, 122-122, 199-199, 216-216, 259-259, 363-363, 393-393, 440-440, 471-471, 527-527, 592-592, 672-672, 724-724, 803-803, 827-827, 882-882, 908-908, 919-919
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf-lifecycle-architecture.md` at line 9, Update every fenced ASCII-art code block in tnf-lifecycle-architecture.md to include the text language specifier by changing the opening fence from ``` to ```text for each ASCII diagram block (occurrences around the blocks referenced at lines 9, 30, 65, 122, 199, 216, 259, 363, 393, 440, 471, 527, 592, 672, 724, 803, 827, 882, 908, 919); ensure you only modify the opening fence of each diagram block so the linter MD040 is satisfied while preserving the ASCII content and closing fences unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/lifecycle_health.go`:
- Around line 74-79: The code currently conflates "last valid (non-Unknown)" and
"last observed" status which breaks recovery event detection; change the APIs
and state so you track both separately: introduce a previousObserved (last
snapshot regardless of Unknown) and previousValid (last non-StatusUnknown) and
return/pass both from getPacemakerStatus, then update callers like
recordHealthTransitionEvents to use previousValid for grace-period calculations
and previousObserved (or a boolean previousWasUnknown derived from it) for
detecting recovery-from-Unknown transitions; update any logic that updates
stored state (the variables updated after getPacemakerStatus) to set both
previousObserved and previousValid appropriately and ensure comparisons that
check StatusUnknown use the previousValid/previousObserved distinction instead
of the single previous variable.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 130-138: Replace any logging that prints full topology, node IPs
or raw command stdout/stderr with redacted or debug-only output: modify the
klog.Infof calls around decodeNodeList and getNodeNames to log counts and node
IDs (e.g., len(capturedNodes) and node.Name) or masked values instead of
cm.Data["nodes"] or full getNodeNames output; similarly ensure places that log
pcs/etcdctl command lines and command output (the code paths that format and log
stdout/stderr) only emit sanitized summaries or are gated behind a debug flag.
Locate usages in functions/methods referencing decodeNodeList, getNodeNames, and
any direct klog.Infof/klog.Errorf calls that include command strings or cm.Data
entries and change them to emit non-sensitive summaries (counts, partial IDs, or
“REDACTED”) or use debug-level logging.
---
Outside diff comments:
In `@pkg/tnf/pkg/pacemaker/healthcheck.go`:
- Around line 192-217: The function checkNodeConditions currently treats missing
per-node conditions as healthy; change it so missing or incomplete node
condition metadata is considered unhealthy/unknown: in checkNodeConditions, when
len(node.Conditions) == 0 append an error (e.g.,
fmt.Sprintf(msgNodeMissingConditions, node.NodeName) or similar) to
status.Errors or set status to unknown and return, and likewise when
FindCondition(..., pacmkrv1.NodeHealthyConditionType) returns nil do not
short-circuit as healthy — append an error/warning for node.NodeName and proceed
to report fencing warnings and other conditions; update references in the
function (checkNodeConditions, node.Conditions, FindCondition,
NodeHealthyConditionType, NodeOnlineConditionType, getFencingWarnings,
HealthStatus) accordingly.
---
Nitpick comments:
In `@docs/tnf-lifecycle-architecture.md`:
- Line 9: Update every fenced ASCII-art code block in
tnf-lifecycle-architecture.md to include the text language specifier by changing
the opening fence from ``` to ```text for each ASCII diagram block (occurrences
around the blocks referenced at lines 9, 30, 65, 122, 199, 216, 259, 363, 393,
440, 471, 527, 592, 672, 724, 803, 827, 882, 908, 919); ensure you only modify
the opening fence of each diagram block so the linter MD040 is satisfied while
preserving the ASCII content and closing fences unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bbbdeff4-4a04-4bc1-af16-105ebe6d367a
📒 Files selected for processing (32)
bindata/tnfdeployment/role.yamldocs/tnf-lifecycle-architecture.mdpkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/ceohelpers/node_helpers.gopkg/operator/ceohelpers/node_helpers_test.gopkg/tnf/auth/runner.gopkg/tnf/operator/lifecycle_cleanup.gopkg/tnf/operator/lifecycle_health.gopkg/tnf/operator/lifecycle_helpers.gopkg/tnf/operator/lifecycle_job_controllers.gopkg/tnf/operator/lifecycle_job_controllers_test.gopkg/tnf/operator/lifecycle_manager.gopkg/tnf/operator/lifecycle_manager_test.gopkg/tnf/operator/lifecycle_reconciliation.gopkg/tnf/operator/lifecycle_reconciliation_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/pacemaker/client_helpers.gopkg/tnf/pkg/pacemaker/healthcheck.gopkg/tnf/pkg/pacemaker/healthcheck_test.gopkg/tnf/pkg/pacemaker/statuscollector.gopkg/tnf/pkg/pacemaker/statuscollector_test.gopkg/tnf/pkg/pacemaker/test_helpers.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/tnf_labels.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
💤 Files with no reviewable changes (3)
- pkg/tnf/operator/nodehandler.go
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
- pkg/tnf/operator/nodehandler_test.go
🚧 Files skipped from review as they are similar to previous changes (23)
- pkg/operator/ceohelpers/common_test.go
- pkg/operator/ceohelpers/common.go
- pkg/tnf/pkg/pacemaker/statuscollector.go
- pkg/tnf/pkg/pcs/auth.go
- pkg/tnf/pkg/pacemaker/statuscollector_test.go
- pkg/tnf/pkg/tools/tnf_labels.go
- pkg/tnf/auth/runner.go
- bindata/tnfdeployment/role.yaml
- pkg/operator/ceohelpers/node_helpers.go
- pkg/tnf/operator/lifecycle_manager.go
- pkg/tnf/operator/lifecycle_reconciliation_test.go
- pkg/tnf/operator/starter_test.go
- pkg/tnf/update-setup/runner_test.go
- pkg/tnf/pkg/jobs/tnf_test.go
- pkg/tnf/operator/lifecycle_job_controllers.go
- pkg/tnf/operator/lifecycle_helpers.go
- pkg/tnf/pkg/jobs/tnf.go
- pkg/tnf/pkg/pacemaker/test_helpers.go
- pkg/tnf/operator/lifecycle_cleanup.go
- pkg/tnf/operator/starter.go
- pkg/tnf/operator/lifecycle_job_controllers_test.go
- pkg/tnf/operator/lifecycle_manager_test.go
- pkg/tnf/operator/lifecycle_reconciliation.go
47dec5c to
0a1dee8
Compare
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/tnf/pkg/jobs/tnf.go (1)
60-67:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
scheduleOnNodeis not reapplied once a cluster-wide controller is already running.
RunTNFJobControllerdeduplicates only by job name, so a later call with a differentscheduleOnNodeis skipped and the existing controller keeps its original scheduling behavior. That makes update-setup placement effectively “first call wins” for the process lifetime.This is especially risky when a controller was first started without a scheduling hint and later restart requests provide one.
Also applies to: 85-92, 161-162
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/jobs/tnf.go` around lines 60 - 67, The deduplication logic in RunTNFJobController only includes the job name in the controllerKey but ignores the scheduleOnNode parameter, causing later calls with different scheduling configurations to be skipped and the existing controller to retain its original scheduling behavior. Modify the controllerKey construction to include the scheduleOnNode value in addition to jobType and node information, so that controllers with different scheduling parameters are treated as distinct and not incorrectly deduplicated. This ensures that scheduling hints provided in later calls are properly respected rather than being ignored due to "first call wins" behavior.
🧹 Nitpick comments (1)
pkg/tnf/operator/lifecycle_manager_test.go (1)
1369-1374: ⚡ Quick winUse explicit UID labels in orphaned-job fixtures.
Most test nodes use
UID == Name, andcreateTNFJobWithNodeLabelLMwrites thenodelabel from a name-like parameter. Since this PR’s cleanup contract is UID-based, give fixtures distinct names and UIDs and pass the UID label explicitly so the tests catch job creation/cleanup label mismatches.Also applies to: 1410-1418
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/lifecycle_manager_test.go` around lines 1369 - 1374, The test fixtures currently use identical values for node Name and UID (using name as UID for simplicity), and rely on name-based node labels in job creation. Since the cleanup contract is UID-based, refactor the fixtures to have distinct names and UIDs: in the createReadyNodeLM function (anchor location), replace the UID assignment to use a unique identifier different from the name, and similarly update the companion fixture at line range 1410-1418 with distinct name and UID values. Additionally, when calling createTNFJobWithNodeLabelLM, pass the actual UID value explicitly via the node label parameter instead of deriving it from the name parameter, ensuring that all test cases verify proper job creation and cleanup using UID-based matching rather than name-based fallbacks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/ceohelpers/node_helpers.go`:
- Around line 23-30: The ListNodesFromInformer function is returning all nodes
via labels.Everything() on line 29, but downstream logic expects only
control-plane nodes to be returned. This causes worker nodes to be incorrectly
counted as control-plane nodes, leading to false positive checks and skipped
pacemaker actions. Modify the label selector in the lister.List() call to filter
for control-plane nodes only by using a label selector that matches the
control-plane node role label instead of labels.Everything().
In `@pkg/tnf/operator/lifecycle_health.go`:
- Around line 139-160: The assignment to c.previous at line 159 happens before
the updateOperatorStatus call completes, causing skipped retries on status
update failures. Move the c.previous = currentStatus assignment to execute only
after updateOperatorStatus succeeds, ensuring that the timestamp is only marked
as processed when the side effect (status update) completes successfully. This
prevents the sync loop from skipping the same CR timestamp due to stale data in
c.previous when retrying after a failed updateOperatorStatus call.
- Around line 191-200: The logic in the StatusUnknown case prevents degradation
when previous is nil, but this causes missing or unpopulated PacemakerCluster
CRs to never degrade since previous remains nil forever. Modify the health
tracking to record the initial time when StatusUnknown is first observed, then
check if the grace period (StatusUnknownDegradedThreshold) has elapsed since
that time. Update the condition that currently suppresses degradation to allow
degradation once the grace period expires, ensuring that missing CRs follow the
same grace-period behavior as stale ones. The key change is adding a field to
track when Unknown status began being observed and comparing the current time
against that tracked time plus the threshold, rather than only relying on the
previous.CRLastUpdated check.
- Around line 416-424: The current condition for recording
PacemakerWarningsCleared only checks previousHadWarnings and
currentHasNoWarnings, but StatusUnknown carries no warnings, causing the event
to fire incorrectly during Warning → Unknown transitions when visibility is lost
rather than warnings being resolved. Add an additional guard condition to the if
statement to verify the current status is a known status (not Unknown) before
calling c.eventRecorder.Eventf with EventReasonWarningsCleared. This ensures
warnings cleared is only recorded when the status is actually known and
confirmed to have no warnings.
In `@pkg/tnf/operator/lifecycle_job_controllers.go`:
- Around line 75-90: The variable k8sNodes is populated from
ceohelpers.ListNodesFromInformer which returns all nodes in the cluster, not
just control-plane nodes. However, the subsequent validation logic (checking if
length is greater than 2 or less than 2) assumes k8sNodes contains only
control-plane nodes. Filter k8sNodes to extract only control-plane nodes before
applying the control-plane specific gating checks that compare the count to 2.
This affects the logic in the initial section near line 75-90 (anchor site in
lifecycle_job_controllers.go) and also at lines 115-121 (sibling site in the
same file where similar gating checks are applied).
In `@pkg/tnf/operator/lifecycle_manager_test.go`:
- Around line 503-507: The synthetic status objects in the transition-event
tests (errorStatus at lines 503-507, and similar healthy/warning/error status
objects at lines 523-529, 560-564, 800-804, 815-819, and 833-837) are missing
the CRLastUpdated field, which causes recordHealthTransitionEvents to treat the
previous timestamp as Unknown and record unwanted extra PacemakerHealthy events.
Add a recent fixed CRLastUpdated timestamp to each of these status objects using
the same fixed timestamp value across all locations in the transition tests.
This ensures the test assertions accurately reflect the expected behavior when
previous and current statuses are separated by a known time interval.
In `@pkg/tnf/operator/lifecycle_reconciliation_test.go`:
- Around line 126-127: The test at line 126 calls determineReconciliationActions
which does not exist; the correct exported function is
DetermineReconciliationActions from the tools package located in
pkg/tnf/pkg/tools/nodes.go. Add an import for the tools package and replace the
function call determineReconciliationActions with
tools.DetermineReconciliationActions. Verify that the parameters tt.k8sNodes and
tt.pacemakerNodes match the expected function signature of
DetermineReconciliationActions(desiredNodes, currentNodes map[string]string),
adjusting the parameter types if necessary to ensure they are map[string]string.
In `@pkg/tnf/operator/lifecycle_reconciliation.go`:
- Around line 23-26: The process-local counter updateSetupGeneration resets to 0
on operator restart, causing stale ConfigMaps with higher generation values from
before the restart to be selected during reconciliation. Replace the in-memory
counter with a restart-safe generation mechanism such as using timestamps or
deriving the generation from existing ConfigMap metadata (like modification
timestamps or actual generation fields) to ensure ordering remains consistent
across operator restarts and stale snapshots cannot be ranked higher than new
reconciliations.
- Around line 99-138: The code acquires the reconcilePacemakerConfigMutex lock
but proceeds with stale drift state that was checked before acquiring the lock,
creating a race condition where a waiting goroutine can trigger reconciliation
with outdated information after another goroutine has already reconciled. After
acquiring the reconcilePacemakerConfigMutex lock and before calling
updateSetupFunc, re-validate the drift condition (the comparison between
k8sNodes and pacemakerNodes) to ensure the detected drift is still valid and
hasn't been resolved by another concurrent reconciliation that completed while
this goroutine was waiting for the lock.
- Around line 81-86: The error handling in the getPacemakerNodes call is
suppressing all errors as if they are expected NotFound errors during initial
setup, but this masks actual API, RBAC, and read failures that should be
returned. Modify the error handling to check the error type: only skip
reconciliation and log at V(4) if the error is a NotFound error (indicating the
CR doesn't exist yet), otherwise return the error to allow proper error
propagation up the call stack. This ensures legitimate failures are not silently
suppressed while still allowing the initial setup case to proceed.
In `@pkg/tnf/pkg/pacemaker/healthcheck.go`:
- Around line 83-95: The doc comment for the BuildHealthStatusFromCR function
states it returns nil when the CR has no status populated, but the
implementation actually returns a non-nil HealthStatus with OverallStatus set to
StatusUnknown and an error message appended. Update the doc comment to
accurately reflect this behavior so that callers understand the function will
always return a non-nil HealthStatus, and the status will contain StatusUnknown
with error details when the CR is unpopulated, rather than expecting a nil
return value.
- Around line 354-363: The GetEventReasonForError function's pattern matching is
out of sync with actual error messages being emitted by the health check logic.
Update the errorPatternsByPriority slice to include patterns that match the
actual error messages now being generated, such as "PacemakerCluster CR not
found in cache" for missing-CR cases and patterns that correctly identify
resource failures versus node failures (for example, messages containing "node
is unhealthy" should map to node error reasons while messages about specific
resources should map to resource error reasons). Consider also adding patterns
with a stable resource marker in the error format to ensure filtering by
resource event reasons continues to work correctly, and verify the patterns work
by testing with actual error messages produced by BuildHealthStatusFromCR.
In `@pkg/tnf/pkg/pcs/auth.go`:
- Around line 49-53: The conditional check for two-node cluster authentication
only proceeds with two-node auth when both cfg.NodeName2 and cfg.NodeIP2 are
present, but silently degrades to single-node auth if only one is set, which
masks incomplete configuration. Add validation to ensure cfg.NodeName2 and
cfg.NodeIP2 are treated as an atomic pair: if either one is non-empty, both must
be non-empty, otherwise return an error indicating inconsistent cluster
configuration. Only fall back to single-node auth when both are empty, and
proceed with two-node auth only when both are populated.
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 21-22: Update the coverage note on lines 21-22 of runner_test.go
that references the now-deleted nodehandler_test.go file. Remove or correct the
stale reference to point readers to the appropriate current source of truth for
reconciliation logic testing, or update the comment to accurately reflect where
this testing now occurs in the refactored codebase.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 333-336: The line accessing strings.Fields(value)[0] can panic if
the value is empty or contains only whitespace, since strings.Fields will return
an empty slice. Add a guard check after the strings.Fields call to verify the
resulting slice is not empty before attempting to access index [0]. If the slice
is empty, return an appropriate error instead of allowing the panic. This
ensures the runner handles malformed node_ip_map values gracefully rather than
crashing.
- Around line 173-179: The code currently skips nodes from the desiredNodes map
when GetNodeIPForPacemaker fails, which can cause incorrect pacemaker member
removals. Instead of continuing with the warning log in the loop over
capturedNodes, return an error from the containing function to abort
reconciliation when a desired-state node cannot be represented with a valid IP.
This ensures the reconciliation fails fast rather than silently dropping the
node from desiredNodes.
---
Outside diff comments:
In `@pkg/tnf/pkg/jobs/tnf.go`:
- Around line 60-67: The deduplication logic in RunTNFJobController only
includes the job name in the controllerKey but ignores the scheduleOnNode
parameter, causing later calls with different scheduling configurations to be
skipped and the existing controller to retain its original scheduling behavior.
Modify the controllerKey construction to include the scheduleOnNode value in
addition to jobType and node information, so that controllers with different
scheduling parameters are treated as distinct and not incorrectly deduplicated.
This ensures that scheduling hints provided in later calls are properly
respected rather than being ignored due to "first call wins" behavior.
---
Nitpick comments:
In `@pkg/tnf/operator/lifecycle_manager_test.go`:
- Around line 1369-1374: The test fixtures currently use identical values for
node Name and UID (using name as UID for simplicity), and rely on name-based
node labels in job creation. Since the cleanup contract is UID-based, refactor
the fixtures to have distinct names and UIDs: in the createReadyNodeLM function
(anchor location), replace the UID assignment to use a unique identifier
different from the name, and similarly update the companion fixture at line
range 1410-1418 with distinct name and UID values. Additionally, when calling
createTNFJobWithNodeLabelLM, pass the actual UID value explicitly via the node
label parameter instead of deriving it from the name parameter, ensuring that
all test cases verify proper job creation and cleanup using UID-based matching
rather than name-based fallbacks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0f349e5f-58c2-4b60-80f4-0cea35630dbf
📒 Files selected for processing (33)
bindata/tnfdeployment/role.yamldocs/tnf-lifecycle-architecture.mdpkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/ceohelpers/node_helpers.gopkg/operator/ceohelpers/node_helpers_test.gopkg/tnf/auth/runner.gopkg/tnf/operator/lifecycle_cleanup.gopkg/tnf/operator/lifecycle_health.gopkg/tnf/operator/lifecycle_helpers.gopkg/tnf/operator/lifecycle_job_controllers.gopkg/tnf/operator/lifecycle_job_controllers_test.gopkg/tnf/operator/lifecycle_manager.gopkg/tnf/operator/lifecycle_manager_test.gopkg/tnf/operator/lifecycle_reconciliation.gopkg/tnf/operator/lifecycle_reconciliation_test.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/pacemaker/client_helpers.gopkg/tnf/pkg/pacemaker/healthcheck.gopkg/tnf/pkg/pacemaker/healthcheck_test.gopkg/tnf/pkg/pacemaker/statuscollector.gopkg/tnf/pkg/pacemaker/statuscollector_test.gopkg/tnf/pkg/pacemaker/test_helpers.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/nodes.gopkg/tnf/pkg/tools/tnf_labels.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
💤 Files with no reviewable changes (3)
- pkg/tnf/operator/nodehandler.go
- pkg/tnf/operator/nodehandler_test.go
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
| func ListNodesFromInformer(informer cache.SharedIndexInformer) ([]*corev1.Node, error) { | ||
| if informer == nil { | ||
| return nil, fmt.Errorf("informer is nil") | ||
| } | ||
|
|
||
| lister := corev1listers.NewNodeLister(informer.GetIndexer()) | ||
| return lister.List(labels.Everything()) | ||
| } |
There was a problem hiding this comment.
ListNodesFromInformer is returning all nodes while downstream logic treats it as control-plane input.
Line 29 uses labels.Everything(). Downstream reconciliation/startup paths consume this result as “control-plane nodes”, so worker nodes can cause false >2 checks and skip pacemaker actions.
Suggested fix
func ListNodesFromInformer(informer cache.SharedIndexInformer) ([]*corev1.Node, error) {
if informer == nil {
return nil, fmt.Errorf("informer is nil")
}
- lister := corev1listers.NewNodeLister(informer.GetIndexer())
- return lister.List(labels.Everything())
+ lister := corev1listers.NewNodeLister(informer.GetIndexer())
+ selector := labels.SelectorFromSet(map[string]string{
+ ControlPlaneNodeLabelSelector: "",
+ })
+ return lister.List(selector)
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/operator/ceohelpers/node_helpers.go` around lines 23 - 30, The
ListNodesFromInformer function is returning all nodes via labels.Everything() on
line 29, but downstream logic expects only control-plane nodes to be returned.
This causes worker nodes to be incorrectly counted as control-plane nodes,
leading to false positive checks and skipped pacemaker actions. Modify the label
selector in the lister.List() call to filter for control-plane nodes only by
using a label selector that matches the control-plane node role label instead of
labels.Everything().
| // Check if lastUpdated timestamp has changed since last sync. | ||
| // If unchanged, skip processing to avoid redundant work on timer-triggered syncs. | ||
| // Use previous.CRLastUpdated for comparison (only set for non-Unknown status). | ||
| if previous != nil && !previous.CRLastUpdated.IsZero() && crLastUpdated.Equal(previous.CRLastUpdated) { | ||
| klog.V(4).Infof("Skipping sync: lastUpdated timestamp unchanged (%v)", crLastUpdated) | ||
| return nil, nil, nil | ||
| } | ||
|
|
||
| // Build health status from the CRD status fields | ||
| currentStatus := pacemaker.BuildHealthStatusFromCR(pacemakerCR) | ||
| currentStatus.CRLastUpdated = crLastUpdated | ||
|
|
||
| // Only update previous for non-Unknown status (preserves last valid for grace period) | ||
| // Concurrency safety: check timestamp when updating to prevent stale writes from overwriting fresh data. | ||
| // Two concurrent syncs can read c.previous, process different CR versions, then both try to update. | ||
| // We only update if currentStatus is fresher than what's already in c.previous. | ||
| if currentStatus.OverallStatus != pacemaker.StatusUnknown { | ||
| c.previousMu.Lock() | ||
| // Only update if we have fresher data (or c.previous is nil/uninitialized) | ||
| if c.previous == nil || c.previous.CRLastUpdated.IsZero() || currentStatus.CRLastUpdated.After(c.previous.CRLastUpdated) { | ||
| c.previous = currentStatus | ||
| } |
There was a problem hiding this comment.
Don’t mark this CR timestamp as processed before side effects succeed.
Line 159 advances c.previous before updateOperatorStatus runs. If that update fails, the sync loop retries, but Lines 142-145 then skip the same LastUpdated, so the degraded condition is never retried until the CR changes again. Commit the processed/previous timestamp only after the status update path succeeds, or track “last valid” separately from “last successfully processed.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/lifecycle_health.go` around lines 139 - 160, The assignment
to c.previous at line 159 happens before the updateOperatorStatus call
completes, causing skipped retries on status update failures. Move the
c.previous = currentStatus assignment to execute only after updateOperatorStatus
succeeds, ensuring that the timestamp is only marked as processed when the side
effect (status update) completes successfully. This prevents the sync loop from
skipping the same CR timestamp due to stale data in c.previous when retrying
after a failed updateOperatorStatus call.
| // Record PacemakerWarningsCleared when warnings are resolved. | ||
| // This fires whenever the previous status had warnings and the current one doesn't, | ||
| // regardless of overall cluster health. Warnings are independent of error state. | ||
| // Examples: | ||
| // - Warning → Healthy: warnings resolved, cluster fully healthy | ||
| // - Error+Warning → Error: warnings resolved, but cluster still degraded for other reasons | ||
| // - Error+Warning → Healthy: warnings resolved AND cluster recovered (fires both events) | ||
| if previousHadWarnings && currentHasNoWarnings { | ||
| c.eventRecorder.Eventf(pacemaker.EventReasonWarningsCleared, msgPacemakerWarningsCleared) |
There was a problem hiding this comment.
Don’t emit “warnings cleared” while status is Unknown.
StatusUnknown carries no warnings, so a Warning → Unknown transition currently records PacemakerWarningsCleared even though the controller lost visibility rather than confirmed remediation. Gate this transition on a known current status.
✅ Proposed guard
- if previousHadWarnings && currentHasNoWarnings {
+ if current.OverallStatus != pacemaker.StatusUnknown && previousHadWarnings && currentHasNoWarnings {
c.eventRecorder.Eventf(pacemaker.EventReasonWarningsCleared, msgPacemakerWarningsCleared)
klog.Infof("Pacemaker cluster warnings cleared")
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/lifecycle_health.go` around lines 416 - 424, The current
condition for recording PacemakerWarningsCleared only checks previousHadWarnings
and currentHasNoWarnings, but StatusUnknown carries no warnings, causing the
event to fire incorrectly during Warning → Unknown transitions when visibility
is lost rather than warnings being resolved. Add an additional guard condition
to the if statement to verify the current status is a known status (not Unknown)
before calling c.eventRecorder.Eventf with EventReasonWarningsCleared. This
ensures warnings cleared is only recorded when the status is actually known and
confirmed to have no warnings.
| // Get K8s control plane nodes | ||
| k8sNodes, err := ceohelpers.ListNodesFromInformer(c.nodeInformer) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to list control plane nodes: %w", err) | ||
| } | ||
|
|
||
| // Before transition: require exactly 2 ready control plane nodes for initial setup | ||
| // After transition: require all control plane nodes ready (handles single-node case) | ||
| if !transitionComplete { | ||
| // Pacemaker only supports 2 control plane nodes - initial setup requires exactly 2 | ||
| if len(k8sNodes) > 2 { | ||
| klog.Warningf("Found %d control plane nodes - pacemaker only supports 2, skipping initial setup", len(k8sNodes)) | ||
| return nil | ||
| } | ||
| if len(k8sNodes) < 2 { | ||
| klog.V(4).Infof("Waiting for 2 control plane nodes for initial setup (current: %d)", len(k8sNodes)) |
There was a problem hiding this comment.
Control-plane gating is being applied to the full node set.
The startup logic assumes k8sNodes is control-plane-only, but the source helper returns all nodes. This can incorrectly skip startup (>2 nodes) or block startup on non-control-plane readiness.
Proposed fix
- // Get K8s control plane nodes
- k8sNodes, err := ceohelpers.ListNodesFromInformer(c.nodeInformer)
+ // Get K8s nodes and filter to control-plane only
+ allNodes, err := ceohelpers.ListNodesFromInformer(c.nodeInformer)
if err != nil {
return fmt.Errorf("failed to list control plane nodes: %w", err)
}
+ k8sNodes := make([]*corev1.Node, 0, len(allNodes))
+ for _, node := range allNodes {
+ if _, ok := node.Labels["node-role.kubernetes.io/control-plane"]; ok {
+ k8sNodes = append(k8sNodes, node)
+ continue
+ }
+ if _, ok := node.Labels["node-role.kubernetes.io/master"]; ok {
+ k8sNodes = append(k8sNodes, node)
+ }
+ }Also applies to: 115-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/lifecycle_job_controllers.go` around lines 75 - 90, The
variable k8sNodes is populated from ceohelpers.ListNodesFromInformer which
returns all nodes in the cluster, not just control-plane nodes. However, the
subsequent validation logic (checking if length is greater than 2 or less than
2) assumes k8sNodes contains only control-plane nodes. Filter k8sNodes to
extract only control-plane nodes before applying the control-plane specific
gating checks that compare the count to 2. This affects the logic in the initial
section near line 75-90 (anchor site in lifecycle_job_controllers.go) and also
at lines 115-121 (sibling site in the same file where similar gating checks are
applied).
| // GetEventReasonForError determines the appropriate event reason based on error content. | ||
| // Used for both deduplication (key) and event recording. | ||
| func getEventReasonForError(errorMsg string) string { | ||
| func GetEventReasonForError(errorMsg string) string { | ||
| for _, entry := range errorPatternsByPriority { | ||
| if strings.Contains(errorMsg, entry.pattern) { | ||
| return entry.reason | ||
| } | ||
| } | ||
| return EventReasonGenericError | ||
| } |
There was a problem hiding this comment.
Align event-reason matching with the errors now emitted.
GetEventReasonForError does not match current missing-CR text like PacemakerCluster CR not found in cache, and resource failures built as "<node> node is unhealthy: Kubelet has failed" are classified as node errors instead of resource errors. Add patterns or adjust the emitted messages so filtering by CR/resource event reasons keeps working.
🧭 Possible direction
// CR-related errors (check before more generic patterns)
{"Failed to get PacemakerCluster CR", EventReasonCRNotFound},
+ {"PacemakerCluster CR not found", EventReasonCRNotFound},
{"CR has no status", EventReasonCRNotFound},Also consider making resource summaries include a stable resource marker or adding tests that feed actual BuildHealthStatusFromCR errors through GetEventReasonForError.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/pkg/pacemaker/healthcheck.go` around lines 354 - 363, The
GetEventReasonForError function's pattern matching is out of sync with actual
error messages being emitted by the health check logic. Update the
errorPatternsByPriority slice to include patterns that match the actual error
messages now being generated, such as "PacemakerCluster CR not found in cache"
for missing-CR cases and patterns that correctly identify resource failures
versus node failures (for example, messages containing "node is unhealthy"
should map to node error reasons while messages about specific resources should
map to resource error reasons). Consider also adding patterns with a stable
resource marker in the error format to ensure filtering by resource event
reasons continues to work correctly, and verify the patterns work by testing
with actual error messages produced by BuildHealthStatusFromCR.
| if cfg.NodeName2 != "" && cfg.NodeIP2 != "" { | ||
| command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath) | ||
| } else { | ||
| command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath) | ||
| } |
There was a problem hiding this comment.
Validate node-2 config as an atomic pair before falling back to single-node auth.
If only one of cfg.NodeName2 / cfg.NodeIP2 is populated, this path silently degrades to single-node auth instead of failing fast on inconsistent cluster config. That can mask upstream state corruption and run an incomplete auth flow.
Suggested fix
- if cfg.NodeName2 != "" && cfg.NodeIP2 != "" {
+ hasNode2Name := cfg.NodeName2 != ""
+ hasNode2IP := cfg.NodeIP2 != ""
+ if hasNode2Name != hasNode2IP {
+ return false, fmt.Errorf("incomplete node2 configuration: both NodeName2 and NodeIP2 must be set together")
+ }
+ if hasNode2Name {
command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath)
} else {
command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if cfg.NodeName2 != "" && cfg.NodeIP2 != "" { | |
| command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath) | |
| } else { | |
| command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath) | |
| } | |
| hasNode2Name := cfg.NodeName2 != "" | |
| hasNode2IP := cfg.NodeIP2 != "" | |
| if hasNode2Name != hasNode2IP { | |
| return false, fmt.Errorf("incomplete node2 configuration: both NodeName2 and NodeIP2 must be set together") | |
| } | |
| if hasNode2Name { | |
| command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath) | |
| } else { | |
| command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/pkg/pcs/auth.go` around lines 49 - 53, The conditional check for
two-node cluster authentication only proceeds with two-node auth when both
cfg.NodeName2 and cfg.NodeIP2 are present, but silently degrades to single-node
auth if only one is set, which masks incomplete configuration. Add validation to
ensure cfg.NodeName2 and cfg.NodeIP2 are treated as an atomic pair: if either
one is non-empty, both must be non-empty, otherwise return an error indicating
inconsistent cluster configuration. Only fall back to single-node auth when both
are empty, and proceed with two-node auth only when both are populated.
| - Reconciliation logic: tested in nodehandler_test.go (single source of truth) | ||
| - Helper functions (decodeStringList): implicitly tested through business logic |
There was a problem hiding this comment.
Update stale coverage note that still references nodehandler_test.go.
That file was removed in this refactor, so this comment now points readers to a non-existent source of truth.
Suggested fix
-- Reconciliation logic: tested in nodehandler_test.go (single source of truth)
+- Reconciliation logic: tested in lifecycle reconciliation/update-setup test suites🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/update-setup/runner_test.go` around lines 21 - 22, Update the
coverage note on lines 21-22 of runner_test.go that references the now-deleted
nodehandler_test.go file. Remove or correct the stale reference to point readers
to the appropriate current source of truth for reconciliation logic testing, or
update the comment to accurately reflect where this testing now occurs in the
refactored codebase.
| for _, node := range capturedNodes { | ||
| ip, err := tools.GetNodeIPForPacemaker(node) | ||
| if err != nil { | ||
| klog.Warningf("Skipping node %s from desired state (no valid IP): %v", node.Name, err) | ||
| continue | ||
| } | ||
| desiredNodes[node.Name] = ip |
There was a problem hiding this comment.
Fail fast when a desired node has no usable IP instead of silently dropping it.
Skipping a captured node here shrinks desiredNodes, which can incorrectly drive remove actions for still-valid pacemaker members. This path should return an error and abort reconciliation when a desired-state node cannot be represented.
Suggested fix
desiredNodes := make(map[string]string) // nodeName -> IP
for _, node := range capturedNodes {
ip, err := tools.GetNodeIPForPacemaker(node)
if err != nil {
- klog.Warningf("Skipping node %s from desired state (no valid IP): %v", node.Name, err)
- continue
+ return fmt.Errorf("failed to build desired state for node %q: %w", node.Name, err)
}
desiredNodes[node.Name] = ip
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/update-setup/runner.go` around lines 173 - 179, The code currently
skips nodes from the desiredNodes map when GetNodeIPForPacemaker fails, which
can cause incorrect pacemaker member removals. Instead of continuing with the
warning log in the loop over capturedNodes, return an error from the containing
function to abort reconciliation when a desired-state node cannot be represented
with a valid IP. This ensures the reconciliation fails fast rather than silently
dropping the node from desiredNodes.
| value := strings.TrimPrefix(parts[1], "\"") | ||
| value = strings.SplitN(value, "\"", 2)[0] // Get text before closing quote | ||
| value = strings.Fields(value)[0] // Get first word (handles unquoted case) | ||
| return value, nil |
There was a problem hiding this comment.
Guard against empty node_ip_map before indexing parsed fields.
strings.Fields(value)[0] can panic on empty values (e.g., node_ip_map=""). This should return a normal error instead of crashing the runner.
Suggested fix
value := strings.TrimPrefix(parts[1], "\"")
value = strings.SplitN(value, "\"", 2)[0] // Get text before closing quote
- value = strings.Fields(value)[0] // Get first word (handles unquoted case)
- return value, nil
+ fields := strings.Fields(value)
+ if len(fields) == 0 {
+ return "", fmt.Errorf("node_ip_map parameter is empty")
+ }
+ return fields[0], nil
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/update-setup/runner.go` around lines 333 - 336, The line accessing
strings.Fields(value)[0] can panic if the value is empty or contains only
whitespace, since strings.Fields will return an empty slice. Add a guard check
after the strings.Fields call to verify the resulting slice is not empty before
attempting to access index [0]. If the slice is empty, return an appropriate
error instead of allowing the panic. This ensures the runner handles malformed
node_ip_map values gracefully rather than crashing.
9722e18 to
12cff1f
Compare
Implements the foundational framework for multi-node job scheduling with round-robin retry logic and drift detection: - Multi-node job support with NodeTarget struct for node-specific jobs - Round-robin retry logic with state tracking (AttemptNumber, NodeIndex, LastJobConfig) - Job config drift detection (resets retry state when config changes) - Degradation when retry attempts exhausted - IsControllerRunning() helper to prevent duplicate controllers - Shared node utility functions (GetNodeNames, DetermineReconciliationActions, IsNodeReady) - Job type labels for filtering and cleanup See docs/tnf/job-controllers.md for detailed multi-node retry behavior and config drift examples. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements job controller startup with dual-mode behavior (bootstrap vs runtime) and individual job implementations: - StartJobControllers() with bootstrap vs runtime separation - Bootstrap mode: exponential backoff retry, requires exactly 2 ready nodes, sets degraded condition - Runtime mode: idempotent restart support, handles 1-2 nodes, fast path when setup job complete - startTnfJobcontrollers(): waits for etcd bootstrap, stable revision, creates per-node and cluster-wide controllers - Auth job implementation: sets hacluster password (required before Pacemaker operations) - After-setup job implementation: * Pre-transition: waits for setup job completion * Post-transition: waits for update-setup job containing this node * Disables kubelet service (Pacemaker owns it now) * Prevents kubelet disable race during Day 2 node addition - Update-setup controller check: on operator restart, starts controller for stopped update-setup job See docs/tnf/job-controllers.md for dual-mode operation and bootstrap sequence. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements drift detection and reconciliation for Pacemaker cluster membership: - ReconcilePacemakerConfig(): unified entry point for drift reconciliation - Detects membership drift: compares K8s nodes vs Pacemaker nodes - Detects IP changes: compares node IPs to detect node replacements - ConfigMap generation tracking: prevents re-running same update-setup job - Target node selection: intersection of K8s and Pacemaker nodes (update-setup must run on active Pacemaker node) - ConfigMap reuse: when desired state unchanged, reuses existing ConfigMap generation - Transition protection: only runs after HasExternalEtcdCompletedTransition + setup job complete - Update-setup implementation: adds/removes nodes, configures fencing, updates etcd resource - Setup job implementation: waits for auth jobs before Pacemaker operations Event handlers: - Node Add/Ready: both call ReconcilePacemakerConfig (fixes race) - Node Update: detects IP changes - Node Delete: triggers reconciliation See docs/tnf/reconciliation.md for drift detection scenarios and reconciliation workflow. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements the PacemakerLifecycleManager controller for continuous monitoring and maintenance: - PacemakerLifecycleManager: main controller with 30s sync loop - Event handlers: Node Add/Update/Delete, PacemakerCluster Update - Health monitoring: * Tracks PacemakerCluster CR staleness (grace period based on last transition time) * Detects stale status from status collector failures * Records events for health transitions (Healthy→Warning, Warning→Error, etc.) * Prevents duplicate event spam with deduplication - Orphaned job cleanup: * Detects jobs for deleted nodes (node UID no longer exists) * Deletes jobs when node permanently removed * Logs cleanup actions - Status collector CronJob: * Creates/updates CronJob for collecting Pacemaker status * Runs every 30s on control plane nodes * Updates PacemakerCluster CR with health status - Replaces nodehandler.go with lifecycle-based approach See docs/tnf/controller-lifecycle.md for sync loop, event handling, and health monitoring details. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds comprehensive architecture documentation and makes alert configuration non-fatal: Documentation: - docs/tnf/README.md: Quick start guide, component overview, workflow summary - docs/tnf/architecture-overview.md: Detailed architecture, dual-mode operation, job types table Alert configuration fix: - setup/update-setup jobs now log warnings instead of failing when alert configuration fails - Allows Pacemaker setup to succeed even if alerts can't be configured - Provides clear troubleshooting guidance in warning messages See docs/tnf/architecture-overview.md for canonical Job Types table and system architecture. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
12cff1f to
04fcb8d
Compare
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (1)
pkg/tnf/pkg/tools/nodes.go (1)
80-100: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winSort reconciliation actions before returning.
desiredNodesandcurrentNodesare maps, sonodesToRemove/nodesToAddordering can change between runs. Sort both slices before returning to keep reconciliation command order and tests deterministic.Proposed fix
func DetermineReconciliationActions(desiredNodes, currentNodes map[string]string) (nodesToRemove, nodesToAdd []string) { @@ - return nodesToRemove, nodesToAdd + sort.Strings(nodesToRemove) + sort.Strings(nodesToAdd) + return nodesToRemove, nodesToAdd }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/tools/nodes.go` around lines 80 - 100, Sort the reconciliation results before returning from the node diff logic in the function that builds nodesToRemove and nodesToAdd, because iterating over currentNodes and desiredNodes maps produces nondeterministic ordering. Update the return path in this nodes.go reconciliation helper to sort both slices after they are populated so the command order and tests stay deterministic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tnf/architecture-overview.md`:
- Around line 315-318: Update the Node-specific jobs section to match the
UID-based tracking model: replace the node-name labeling language with UID-based
references, and make the cleanup description consistent with job-controller
behavior. Use the same terminology as the job-controller docs and related
node-job handling so the architecture overview no longer implies node-scoped
cleanup by name.
In `@docs/tnf/job-controllers.md`:
- Around line 88-96: The json.Marshal example in fencingJobConfigFunc has a
return type mismatch because it directly returns []byte, error from json.Marshal
while the function expects string, error. Update the example to marshal config
into a byte slice, handle the error explicitly, and return the marshaled payload
as a string with nil error on success; use fencingJobConfigFunc and json.Marshal
as the key symbols to locate the snippet.
In `@docs/tnf/reconciliation.md`:
- Around line 280-284: The job flow in the reconciliation diagram still shows
fencing inside the update setup path, but fencing is now handled by the separate
fencing job. Remove this “Update Fencing” block from the update-setup flow and
place it under the fencing job section in the docs/reconciliation content so the
flow matches the runner behavior.
- Around line 140-148: The drift example in the reconciliation docs uses a
3-node Kubernetes cluster, but the current reconciliation flow only supports up
to two nodes. Update the example in the scenario section to use a supported
one-to-two-node case, or explicitly label the 3-node case as future work; keep
the guidance aligned with the reconciliation behavior described by the TNF
scenario text.
In `@manifests/0000_20_etcd-operator_06_deployment.yaml`:
- Line 59: The default deployment for the operator still hardcodes the verbose
logging flag, which should not be shipped cluster-wide. Remove the -v=4 entry
from the deployment manifest and, if needed, make verbose logging available only
through a temporary debug override or an explicit conditional flag in the
deployment configuration.
In `@pkg/tnf/after-setup/runner.go`:
- Around line 144-146: The ConfigMap lookup in after-setup is using the wrong
label selector, which can miss the update-setup ConfigMap and cause polling
timeouts. Update the selector in the ConfigMaps list call within runner.go to
match the same update-setup label used by the other readers, and keep it
consistent with the update-setup producer’s component label rather than the job
name label.
- Around line 195-208: The update-setup wait logic in runner.go is matching any
completed Job with the label, which can let an older generation satisfy the
check too early. Update the completion check in the after-setup loop to verify
the Job belongs to the selected generation by comparing its generation/config
hash against the chosen ConfigMap before returning true. Keep the change
localized around the updateSetupJobs list and the jobs.IsConditionTrue
completion path so only the intended generation can unblock.
In `@pkg/tnf/fencing/runner.go`:
- Around line 113-114: The info-level log in runner.go is exposing internal
hostnames by printing the full nodeNames list; update the logging around
ConfigureFencing to log only a count or summary at info level, and move the full
nodeNames details to a debug/redacted path using the existing fencing runner
context so sensitive hostnames are not emitted in normal logs.
- Around line 154-177: The waitForUpdateSetupCompletion helper currently treats
a missing or failed update-setup Job as success, which lets fencing continue too
early. Update the logic in waitForUpdateSetupCompletion so that only
jobs.IsComplete returns done=true, while apierrors.IsNotFound and jobs.IsFailed
should both return false with a non-nil error (or equivalent retry-triggering
failure) so reconciliation keeps waiting or retries instead of proceeding. Keep
the changes localized to updateSetupDone and preserve the existing logging
around kubeClient.BatchV1().Jobs(...).Get and jobs.IsComplete/jobs.IsFailed.
In `@pkg/tnf/operator/lifecycle_helpers.go`:
- Around line 49-55: Missing node addresses should make the Pacemaker CR look
unavailable, not produce a partial membership map. In the helper that builds the
node map from pacemakerCR.Status.Nodes, stop skipping entries with empty
Addresses and instead return an error (preferably a typed “CR
incomplete/unavailable” error) so ReconcilePacemakerConfig() and
getActivePacemakerNodes() do not treat incomplete status as authoritative state.
Keep the map-building logic in this helper focused on only fully populated node
entries, and let callers handle the error path rather than acting on a partial
snapshot.
In `@pkg/tnf/operator/lifecycle_reconciliation.go`:
- Around line 713-717: The desired-state snapshot in the node loop is being
truncated when tools.GetNodeIPForPacemaker fails, because the code only logs a
warning and continues. Update the reconciliation path in
lifecycle_reconciliation.go so this failure is returned from the surrounding
reconciliation function instead of skipping the node, using the existing node
iteration logic around tools.GetNodeIPForPacemaker and the ConfigMap assembly to
abort and retry on error. Keep the node list complete for the snapshot rather
than writing a partial membership.
In `@pkg/tnf/pkg/jobs/tnf.go`:
- Around line 133-135: The retry logic in the TNF job flow is treating
target-node callback order as semantic, which can cause state resets or
incorrect node indexing when the same nodes arrive in a different order.
Canonicalize the target node list by sorting a copy by node name before calling
GetNodeNames and before using targetNodes[nodeIndex]. Apply the same ordering
normalization anywhere the retry state is compared or reconstructed so the logic
in the TNF job path remains stable across retries.
- Around line 153-155: The config-drift log in the job reset path is exposing
raw serialized job config via the klog.Infof call in the configChanged branch.
Update the logging in the job handling flow around the
JobConfigFunc/state.LastJobConfig comparison so it only reports that drift was
detected, or emits a non-sensitive fingerprint/digest instead of Old/New config
values. Keep the reset/delete behavior unchanged, but remove any direct logging
of serialized config contents.
In `@pkg/tnf/update-setup/runner.go`:
- Around line 370-382: The removePacemakerNodeGlobally flow is incorrectly
treating a failed pacemaker removal as success when
pcsClusterNodeRemoveOutputContains401 matches, allowing later etcd cleanup and
node_ip_map mutation to proceed unsafely. Update removePacemakerNodeGlobally to
return a retryable or explicit partial-failure error instead of nil in the 401
path, and have the caller that performs etcd member cleanup and node_ip_map
updates stop before mutating state unless pacemaker removal truly succeeded.
- Around line 510-539: decodeNodeList currently trusts ConfigMap-provided name
and ip fields, which later flow into shell command strings and node_ip_map
usage. Update decodeNodeList to validate each nodeInfo before constructing
corev1.Node: reject empty or non-allow-listed names and reject invalid IP values
at decode time. Keep the validation localized in decodeNodeList so callers like
the cluster add/remove flow never receive unsafe values.
---
Nitpick comments:
In `@pkg/tnf/pkg/tools/nodes.go`:
- Around line 80-100: Sort the reconciliation results before returning from the
node diff logic in the function that builds nodesToRemove and nodesToAdd,
because iterating over currentNodes and desiredNodes maps produces
nondeterministic ordering. Update the return path in this nodes.go
reconciliation helper to sort both slices after they are populated so the
command order and tests stay deterministic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 71eb8ee1-fd94-45cd-a1fe-81ca12dc8c3a
📒 Files selected for processing (45)
bindata/tnfdeployment/cronjob.yamlbindata/tnfdeployment/role.yamldocs/tnf/README.mddocs/tnf/architecture-overview.mddocs/tnf/controller-lifecycle.mddocs/tnf/job-controllers.mddocs/tnf/reconciliation.mdmanifests/0000_20_etcd-operator_06_deployment.yamlpkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/ceohelpers/node_helpers.gopkg/operator/ceohelpers/node_helpers_test.gopkg/tnf/after-setup/runner.gopkg/tnf/auth/runner.gopkg/tnf/fencing/runner.gopkg/tnf/operator/lifecycle_cleanup.gopkg/tnf/operator/lifecycle_health.gopkg/tnf/operator/lifecycle_helpers.gopkg/tnf/operator/lifecycle_job_controllers.gopkg/tnf/operator/lifecycle_job_controllers_test.gopkg/tnf/operator/lifecycle_manager.gopkg/tnf/operator/lifecycle_manager_test.gopkg/tnf/operator/lifecycle_reconciliation.gopkg/tnf/operator/lifecycle_reconciliation_test.gopkg/tnf/operator/lifecycle_status_collector.gopkg/tnf/operator/nodehandler.gopkg/tnf/operator/nodehandler_test.gopkg/tnf/operator/starter.gopkg/tnf/operator/starter_test.gopkg/tnf/pkg/jobs/jobcontroller.gopkg/tnf/pkg/jobs/tnf.gopkg/tnf/pkg/jobs/tnf_test.gopkg/tnf/pkg/jobs/utils.gopkg/tnf/pkg/pacemaker/client_helpers.gopkg/tnf/pkg/pacemaker/healthcheck.gopkg/tnf/pkg/pacemaker/healthcheck_test.gopkg/tnf/pkg/pacemaker/statuscollector.gopkg/tnf/pkg/pacemaker/statuscollector_test.gopkg/tnf/pkg/pacemaker/test_helpers.gopkg/tnf/pkg/pcs/auth.gopkg/tnf/pkg/tools/jobs.gopkg/tnf/pkg/tools/nodes.gopkg/tnf/setup/runner.gopkg/tnf/update-setup/runner.gopkg/tnf/update-setup/runner_test.go
💤 Files with no reviewable changes (3)
- pkg/tnf/operator/nodehandler_test.go
- pkg/tnf/operator/nodehandler.go
- pkg/tnf/pkg/pacemaker/healthcheck_test.go
✅ Files skipped from review due to trivial changes (2)
- bindata/tnfdeployment/cronjob.yaml
- docs/tnf/README.md
🚧 Files skipped from review as they are similar to previous changes (17)
- bindata/tnfdeployment/role.yaml
- pkg/tnf/operator/lifecycle_reconciliation_test.go
- pkg/tnf/auth/runner.go
- pkg/tnf/update-setup/runner_test.go
- pkg/tnf/pkg/pacemaker/client_helpers.go
- pkg/operator/ceohelpers/common.go
- pkg/operator/ceohelpers/common_test.go
- pkg/tnf/pkg/pacemaker/statuscollector_test.go
- pkg/tnf/operator/lifecycle_job_controllers_test.go
- pkg/tnf/operator/starter_test.go
- pkg/tnf/pkg/pcs/auth.go
- pkg/tnf/operator/lifecycle_manager.go
- pkg/tnf/pkg/jobs/tnf_test.go
- pkg/tnf/operator/starter.go
- pkg/tnf/pkg/pacemaker/test_helpers.go
- pkg/tnf/pkg/pacemaker/healthcheck.go
- pkg/tnf/operator/lifecycle_manager_test.go
| **Node-specific jobs** (auth, after-setup): | ||
| - Labeled with `node: <node-name>` | ||
| - Tied to node lifecycle via UID | ||
| - Cleaned up when node is deleted |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Align this section with the UID-based job tracking change.
This still says node-scoped jobs are labeled by node name, but the PR objective and the later job-controller doc describe UID-based tracking/cleanup. That inconsistency will mislead anyone debugging orphan cleanup behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/tnf/architecture-overview.md` around lines 315 - 318, Update the
Node-specific jobs section to match the UID-based tracking model: replace the
node-name labeling language with UID-based references, and make the cleanup
description consistent with job-controller behavior. Use the same terminology as
the job-controller docs and related node-job handling so the architecture
overview no longer implies node-scoped cleanup by name.
| ```go | ||
| // jobConfigFunc captures node UIDs and fencing secret UIDs to detect infrastructure changes | ||
| fencingJobConfigFunc := func() (string, error) { | ||
| config := map[string]interface{}{ | ||
| "nodeUIDs": []string{node1.UID, node2.UID}, // Detect node replacements | ||
| "secretUIDs": []string{secret1.UID, secret2.UID}, // Detect secret rotation | ||
| } | ||
| return json.Marshal(config) | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🌐 Web query:
What is the Go standard library signature of encoding/json.Marshal, and what is the idiomatic way to return its result from a function with signature func() (string, error)?
💡 Result:
The standard library signature for encoding/json.Marshal is: func Marshal(v any) ([]byte, error) [1][2] The idiomatic way to return its result from a function with signature func (string, error) is to perform the marshalling, check the error immediately, and then convert the resulting byte slice to a string if successful [3][4]. Here is the idiomatic implementation pattern: func GetJSONString (string, error) { bytes, err:= json.Marshal(someData) if err!= nil { return "", err } return string(bytes), nil } In Go, it is standard practice to separate the declaration of the assignment (bytes, err:=...) and the error check (if err!= nil) [3]. This approach is clear, readable, and consistent with how errors are handled throughout the Go ecosystem [3][4].
Citations:
- 1: https://pkg.go.dev/encoding/json@go1.26.4
- 2: https://tip.golang.org/pkg/json
- 3: https://stackoverflow.com/questions/46024818/idiomatic-go-happy-path
- 4: https://jsonic.io/guides/json-marshal-unmarshal-go
Fix the json.Marshal example. json.Marshal returns ([]byte, error), so return json.Marshal(config) does not match func() (string, error). Marshal into a byte slice, check the error, then return string(payload), nil.
Suggested fix
payload, err := json.Marshal(config)
if err != nil {
return "", err
}
return string(payload), nil📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ```go | |
| // jobConfigFunc captures node UIDs and fencing secret UIDs to detect infrastructure changes | |
| fencingJobConfigFunc := func() (string, error) { | |
| config := map[string]interface{}{ | |
| "nodeUIDs": []string{node1.UID, node2.UID}, // Detect node replacements | |
| "secretUIDs": []string{secret1.UID, secret2.UID}, // Detect secret rotation | |
| } | |
| return json.Marshal(config) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/tnf/job-controllers.md` around lines 88 - 96, The json.Marshal example
in fencingJobConfigFunc has a return type mismatch because it directly returns
[]byte, error from json.Marshal while the function expects string, error. Update
the example to marshal config into a byte slice, handle the error explicitly,
and return the marshaled payload as a string with nil error on success; use
fencingJobConfigFunc and json.Marshal as the key symbols to locate the snippet.
| **Scenario 1: Node Added to Kubernetes** | ||
|
|
||
| ```text | ||
| Kubernetes: [master-0, master-1, master-2] | ||
| Pacemaker: [master-0, master-1] | ||
|
|
||
| Drift: YES - master-2 exists in K8s but not Pacemaker | ||
| Action: Add master-2 to Pacemaker cluster | ||
| ``` |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Keep the drift example within supported TNF node counts.
This scenario shows adding master-2, but the implementation rejects clusters with more than two nodes. Use a one-to-two-node example or call out that 3-node reconciliation is future work.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/tnf/reconciliation.md` around lines 140 - 148, The drift example in the
reconciliation docs uses a 3-node Kubernetes cluster, but the current
reconciliation flow only supports up to two nodes. Update the example in the
scenario section to use a supported one-to-two-node case, or explicitly label
the 3-node case as future work; keep the guidance aligned with the
reconciliation behavior described by the TNF scenario text.
| ┌────────────────────────────────┐ | ||
| │ Update Fencing: │ | ||
| │ - Update STONITH configuration │ | ||
| │ - Add/remove fence agents │ | ||
| └────────────────────────────────┘ |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Update the job flow to show fencing as a separate job.
The runner now explicitly says fencing is handled by the fencing job, not inline in update-setup. This block should be removed or moved under the fencing job section.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/tnf/reconciliation.md` around lines 280 - 284, The job flow in the
reconciliation diagram still shows fencing inside the update setup path, but
fencing is now handled by the separate fencing job. Remove this “Update Fencing”
block from the update-setup flow and place it under the fencing job section in
the docs/reconciliation content so the flow matches the runner behavior.
| - "--terminate-on-files=/var/run/secrets/etcd-client/tls.key" | ||
| - "--terminate-on-files=/var/run/configmaps/etcd-ca/ca-bundle.crt" | ||
| - "--terminate-on-files=/var/run/configmaps/etcd-service-ca/service-ca.crt" | ||
| - "-v=4" |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Avoid shipping -v=4 in the default deployment.
This enables verbose operator logging cluster-wide, and the new TNF paths emit control-plane node and reconciliation details at V(4). That widens internal-hostname exposure in aggregated logs and increases log volume for every cluster. Prefer a temporary debug override or a conditional debug flag instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@manifests/0000_20_etcd-operator_06_deployment.yaml` at line 59, The default
deployment for the operator still hardcodes the verbose logging flag, which
should not be shipped cluster-wide. Remove the -v=4 entry from the deployment
manifest and, if needed, make verbose logging available only through a temporary
debug override or an explicit conditional flag in the deployment configuration.
| ip, err := tools.GetNodeIPForPacemaker(*node) | ||
| if err != nil { | ||
| klog.Warningf("Failed to get IP for node %s: %v - skipping from ConfigMap", node.Name, err) | ||
| continue | ||
| } |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win
Don’t silently drop nodes from the desired-state snapshot.
If tools.GetNodeIPForPacemaker fails here, the node is omitted from the ConfigMap instead of aborting reconciliation. update-setup then sees a smaller desired membership and can reconfigure pacemaker down to one node — exactly the split-brain path this PR is trying to prevent during reboot windows. Bubble the error up and retry instead of encoding a partial node list.
Suggested fix
for _, node := range nodes {
ip, err := tools.GetNodeIPForPacemaker(*node)
if err != nil {
- klog.Warningf("Failed to get IP for node %s: %v - skipping from ConfigMap", node.Name, err)
- continue
+ return "", fmt.Errorf("failed to get IP for node %s: %w", node.Name, err)
}
infos = append(infos, nodeInfo{
Name: node.Name,
IP: ip,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ip, err := tools.GetNodeIPForPacemaker(*node) | |
| if err != nil { | |
| klog.Warningf("Failed to get IP for node %s: %v - skipping from ConfigMap", node.Name, err) | |
| continue | |
| } | |
| ip, err := tools.GetNodeIPForPacemaker(*node) | |
| if err != nil { | |
| return "", fmt.Errorf("failed to get IP for node %s: %w", node.Name, err) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/operator/lifecycle_reconciliation.go` around lines 713 - 717, The
desired-state snapshot in the node loop is being truncated when
tools.GetNodeIPForPacemaker fails, because the code only logs a warning and
continues. Update the reconciliation path in lifecycle_reconciliation.go so this
failure is returned from the surrounding reconciliation function instead of
skipping the node, using the existing node iteration logic around
tools.GetNodeIPForPacemaker and the ConfigMap assembly to abort and retry on
error. Keep the node list complete for the snapshot rather than writing a
partial membership.
| currentTargetNodes := tools.GetNodeNames(targetNodes) | ||
| nodesChanged := !slicesEqual(state.TargetNodes, currentTargetNodes) | ||
|
|
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Canonicalize target-node order before comparing and indexing.
The retry state treats callback order as semantic. If targetNodesFunc returns the same nodes in a different order, this can reset/delete an in-flight job or schedule NodeIndex against a different node. Sort a copy of target nodes by name before GetNodeNames and before selecting targetNodes[nodeIndex].
Also applies to: 286-305, 310-320
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/pkg/jobs/tnf.go` around lines 133 - 135, The retry logic in the TNF
job flow is treating target-node callback order as semantic, which can cause
state resets or incorrect node indexing when the same nodes arrive in a
different order. Canonicalize the target node list by sorting a copy by node
name before calling GetNodeNames and before using targetNodes[nodeIndex]. Apply
the same ordering normalization anywhere the retry state is compared or
reconstructed so the logic in the TNF job path remains stable across retries.
| if configChanged { | ||
| klog.Infof("Job %s config changed - resetting retry state and deleting job. Old: %s, New: %s", | ||
| jobName, state.LastJobConfig, currentConfig) |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Avoid logging raw serialized job config.
JobConfigFunc is a generic serialized config surface; logging old/new values can expose node identifiers, secret identifiers, or future sensitive fields. Log only that drift was detected, or log a non-sensitive fingerprint.
Proposed fix
- klog.Infof("Job %s config changed - resetting retry state and deleting job. Old: %s, New: %s",
- jobName, state.LastJobConfig, currentConfig)
+ klog.Infof("Job %s config changed - resetting retry state and deleting job", jobName)As per coding guidelines, flag logging that may expose passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if configChanged { | |
| klog.Infof("Job %s config changed - resetting retry state and deleting job. Old: %s, New: %s", | |
| jobName, state.LastJobConfig, currentConfig) | |
| if configChanged { | |
| klog.Infof("Job %s config changed - resetting retry state and deleting job", jobName) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/pkg/jobs/tnf.go` around lines 153 - 155, The config-drift log in the
job reset path is exposing raw serialized job config via the klog.Infof call in
the configChanged branch. Update the logging in the job handling flow around the
JobConfigFunc/state.LastJobConfig comparison so it only reports that drift was
detected, or emits a non-sensitive fingerprint/digest instead of Old/New config
values. Keep the reset/delete behavior unchanged, but remove any direct logging
of serialized config contents.
Source: Coding guidelines
| func removePacemakerNodeGlobally(ctx context.Context, nodeToRemove string) error { | ||
| cmd := fmt.Sprintf("/usr/sbin/pcs cluster node remove %s --force --skip-offline", nodeToRemove) | ||
| stdOut, stdErr, err := exec.Execute(ctx, cmd) | ||
| if err != nil { | ||
| if pcsClusterNodeRemoveOutputContains401(stdOut, stdErr) { | ||
| klog.Warningf( | ||
| "pcs cluster node remove for %q did not succeed and output contained 401: "+ | ||
| "the cluster may still see a live endpoint for that member that rejects this node's PCSD credentials — "+ | ||
| "commonly a replacement node is answering at the same IP the cluster associates with the old member. "+ | ||
| "Skipping fatal error; reconcile sync/start and fencing may still help. stdout=%q stderr=%q err=%v", | ||
| nodeToRemove, stdOut, stdErr, err, | ||
| ) | ||
| return nil |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Do not treat a failed pacemaker removal as success.
Returning nil on 401 lets the caller continue with etcd member cleanup and node_ip_map updates even though the pacemaker member may still exist. Return a retryable error or an explicit partial-success signal before mutating etcd state.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/update-setup/runner.go` around lines 370 - 382, The
removePacemakerNodeGlobally flow is incorrectly treating a failed pacemaker
removal as success when pcsClusterNodeRemoveOutputContains401 matches, allowing
later etcd cleanup and node_ip_map mutation to proceed unsafely. Update
removePacemakerNodeGlobally to return a retryable or explicit partial-failure
error instead of nil in the 401 path, and have the caller that performs etcd
member cleanup and node_ip_map updates stop before mutating state unless
pacemaker removal truly succeeded.
| func decodeNodeList(data string) ([]*corev1.Node, error) { | ||
| type nodeInfo struct { | ||
| Name string `json:"name"` | ||
| IP string `json:"ip"` | ||
| } | ||
|
|
||
| var infos []nodeInfo | ||
| if err := json.Unmarshal([]byte(data), &infos); err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| nodes := make([]*corev1.Node, len(infos)) | ||
| for i, info := range infos { | ||
| // Create minimal node with name and IP address | ||
| nodes[i] = &corev1.Node{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: info.Name, | ||
| }, | ||
| Status: corev1.NodeStatus{ | ||
| Addresses: []corev1.NodeAddress{ | ||
| { | ||
| Type: corev1.NodeInternalIP, | ||
| Address: info.IP, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| } | ||
|
|
||
| return nodes, nil |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | ⚡ Quick win
Validate ConfigMap node names and IPs before shell use.
decodeNodeList accepts arbitrary name / ip values, and those values later feed bash -c command strings such as pcs cluster node add/remove and node_ip_map. Reject empty or non-allow-listed node names and invalid IPs at decode time. As per path instructions, “Command: no shell=True ... with user input” and “Validate at trust boundaries with allow-lists, not deny-lists.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/tnf/update-setup/runner.go` around lines 510 - 539, decodeNodeList
currently trusts ConfigMap-provided name and ip fields, which later flow into
shell command strings and node_ip_map usage. Update decodeNodeList to validate
each nodeInfo before constructing corev1.Node: reject empty or non-allow-listed
names and reject invalid IP values at decode time. Keep the validation localized
in decodeNodeList so callers like the cluster add/remove flow never receive
unsafe values.
Source: Path instructions
OCPBUGS-84695: Prevent pacemaker split-brain during upgrades
Fixes race condition where node reboots during upgrades could leave pacemaker in split-brain state.
Root Cause:
During upgrades, there's a race condition:
Result: Split-brain (node-0 configured for 1-node cluster, node-1 configured for 2-node cluster)
The Fix:
The reconciliation logic now checks if the "missing" node still exists in Kubernetes before treating it as a replacement scenario:
This ensures update-setup only runs membership changes when a node is actually being replaced (deleted from K8s), not just temporarily offline during a reboot.
Additional Improvements:
Tests:
Related: Works in tandem with ClusterLabs/resource-agents#2155
Summary by CodeRabbit
New Features
Bug Fixes
Documentation