OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain by jaypoulz · Pull Request #1605 · openshift/cluster-etcd-operator

jaypoulz · 2026-04-29T15:34:39Z

OCPBUGS-84695: Prevent pacemaker split-brain during upgrades

Fixes race condition where node reboots during upgrades could leave pacemaker in split-brain state.

Root Cause:
During upgrades, there's a race condition:

First node reboots and becomes Ready
Both nodes are Ready → triggers update-setup reconciliation
Second node starts rebooting (goes offline in pacemaker but still exists in K8s)
Update-setup runs while second node is offline
Update-setup sees only 1 node online in pacemaker and reconfigures it as a 1-node cluster
Second node finishes reboot and comes back, still thinking it's in a 2-node cluster

Result: Split-brain (node-0 configured for 1-node cluster, node-1 configured for 2-node cluster)

The Fix:
The reconciliation logic now checks if the "missing" node still exists in Kubernetes before treating it as a replacement scenario:

Node offline in pacemaker BUT still exists in K8s → Skip replacement logic (node is just rebooting, will rejoin naturally)
Node offline in pacemaker AND deleted from K8s → Run replacement logic (actual node replacement)

This ensures update-setup only runs membership changes when a node is actually being replaced (deleted from K8s), not just temporarily offline during a reboot.

Additional Improvements:

Track jobs by node UID (not name) to handle true node replacement scenarios where same name reappears with different UID
Introduce NodeTarget struct to make node-specific vs cluster-wide jobs explicit
Refactor lifecycle manager into focused modules (health, reconciliation, job controllers, cleanup)
Add comprehensive ARCHITECTURE.md documenting lifecycle operations and data flows

Tests:

Node offline in pacemaker but exists in K8s (upgrade reboot) → skips replacement
Node offline in pacemaker and deleted from K8s (replacement) → runs replacement
UID-based job cleanup for true replacement scenarios
Drift detection and reconciliation flows

Related: Works in tandem with ClusterLabs/resource-agents#2155

Summary by CodeRabbit

New Features
- Added a more complete TNF lifecycle flow for startup, reconciliation, health monitoring, and cleanup.
- Improved handling of node changes, including safer job/controller recovery and support for node replacement scenarios.
- Added support for comparing IP addresses across different formats.
Bug Fixes
- Better cleanup of orphaned jobs and pods.
- Improved handling when cluster state is stale, missing, or partially unavailable.
- Made alerting and some setup steps non-fatal so workflows can continue when possible.
Documentation
- Added new TNF architecture, lifecycle, reconciliation, and job-controller docs.

openshift-ci-robot · 2026-04-29T15:34:47Z

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.

When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:

Node-0: Pacemaker configured with 1 node, has quorum

Node-1: Pacemaker configured with 2 nodes, no quorum

The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:

If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)

If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)

This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.

Tests cover:

Node exists in Kubernetes but offline in pacemaker (upgrade reboot) → Should skip replacement logic

Node deleted from Kubernetes and offline in pacemaker (replacement) → Should proceed with replacement logic

Error handling for NotFound vs other errors

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-29T15:34:55Z

Walkthrough

TNF lifecycle orchestration is expanded across startup, health monitoring, cleanup, reconciliation, and day-2 runners. Shared node/IP and job helpers, Pacemaker helper APIs, RBAC/manifests, and lifecycle documentation are also updated.

Changes

TNF lifecycle orchestration

Layer / File(s)	Summary
Shared node helpers `pkg/operator/ceohelpers/common.go`, `pkg/operator/ceohelpers/common_test.go`, `pkg/operator/ceohelpers/node_helpers.go`, `pkg/operator/ceohelpers/node_helpers_test.go`, `pkg/tnf/pkg/tools/nodes.go`	`IPAddressesEqual`, informer-backed node listing, node-name extraction, and node reconciliation action helpers are added with tests for IPv4/IPv6 normalization and informer-backed enumeration.
Pacemaker health helpers `pkg/tnf/pkg/pacemaker/client_helpers.go`, `pkg/tnf/pkg/pacemaker/healthcheck.go`, `pkg/tnf/pkg/pacemaker/test_helpers.go`, `pkg/tnf/pkg/pacemaker/statuscollector.go`, `pkg/tnf/pkg/pacemaker/statuscollector_test.go`	Pacemaker health is computed from CR status through exported helper functions, the controller-based healthcheck test suite is removed, and the status collector now uses exported client/config helpers and non-nil fencing-agent status slices.
TNF job runtime primitives `pkg/tnf/pkg/tools/jobs.go`, `pkg/tnf/pkg/pcs/auth.go`, `pkg/tnf/pkg/jobs/tnf.go`, `pkg/tnf/pkg/jobs/jobcontroller.go`, `pkg/tnf/pkg/jobs/utils.go`, `pkg/tnf/pkg/jobs/tnf_test.go`	TNF job controllers gain node-targeted and multi-node retry state, updated controller lifecycle detection, and improved job deletion and stuck-job recovery.
Lifecycle manager wiring `pkg/tnf/operator/lifecycle_manager.go`, `pkg/tnf/operator/lifecycle_helpers.go`, `pkg/tnf/operator/lifecycle_status_collector.go`, `pkg/tnf/operator/starter.go`, `pkg/tnf/operator/starter_test.go`, `bindata/tnfdeployment/cronjob.yaml`, `manifests/0000_20_etcd-operator_06_deployment.yaml`, `docs/tnf/README.md`, `docs/tnf/architecture-overview.md`, `docs/tnf/controller-lifecycle.md`	PacemakerLifecycleManager replaces the old nodehandler-driven startup path by watching PacemakerCluster, handling node events, and running periodic sync; starter wiring waits for CRD establishment before launching it and the status-collector CronJob, and the lifecycle docs and manifests are updated.
Bootstrap job orchestration `pkg/tnf/operator/lifecycle_job_controllers.go`, `pkg/tnf/operator/lifecycle_job_controllers_test.go`, `pkg/tnf/operator/lifecycle_cleanup.go`, `pkg/tnf/operator/lifecycle_manager_test.go`, `pkg/tnf/setup/runner.go`, `pkg/tnf/auth/runner.go`, `docs/tnf/job-controllers.md`	Job-controller startup now branches on external transition state, applies readiness gates and retry/degraded handling, cleans orphaned TNF jobs and pods, and the bootstrap auth/setup runners are updated to match that flow.
Day-2 reconciliation `pkg/tnf/operator/lifecycle_health.go`, `pkg/tnf/operator/lifecycle_reconciliation.go`, `pkg/tnf/operator/lifecycle_reconciliation_test.go`, `pkg/tnf/operator/lifecycle_manager_test.go`, `pkg/tnf/update-setup/runner.go`, `pkg/tnf/update-setup/runner_test.go`, `pkg/tnf/after-setup/runner.go`, `pkg/tnf/fencing/runner.go`, `pkg/tnf/operator/starter.go`, `pkg/tnf/operator/starter_test.go`, `bindata/tnfdeployment/role.yaml`, `docs/tnf/reconciliation.md`	Health monitoring reads PacemakerCluster status and records deduplicated events; reconciliation compares Kubernetes and Pacemaker membership, manages update-setup generations, and the fencing-secret, update-setup, and after-setup runners now follow the new ConfigMap-driven flow.

Sequence Diagram(s)

sequenceDiagram
  participant ControllerFramework
  participant PacemakerLifecycleManager
  participant StaticPodOperatorClient
  participant StartJobControllers
  participant CleanupOrphanedJobs
  participant MonitorHealth
  participant ReconcilePacemakerConfig

  ControllerFramework->>PacemakerLifecycleManager: sync()
  PacemakerLifecycleManager->>StaticPodOperatorClient: HasExternalEtcdCompletedTransition
  PacemakerLifecycleManager->>StartJobControllers: start controllers
  PacemakerLifecycleManager->>CleanupOrphanedJobs: cleanup orphaned jobs and pods
  alt external etcd transition completed
    PacemakerLifecycleManager->>MonitorHealth: MonitorHealth(ctx)
    PacemakerLifecycleManager->>ReconcilePacemakerConfig: ReconcilePacemakerConfig(ctx)
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

openshift/cluster-etcd-operator#1625: Updates the TNF alert-agent registration path and pcs.ConfigureAlerts(ctx) behavior that overlaps with the setup and update-setup runner changes in this PR.

Suggested reviewers

hasbro17
tjungblu
fonta-rh

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

openshift-ci-robot · 2026-04-29T15:36:36Z

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

The update-setup job has a rare chance of incorrectly treating upgrade-related node reboots as node replacement scenarios, causing pacemaker cluster split-brain.

When a node reboots during upgrade, it appears offline in pacemaker. The update-setup job detects this and runs node replacement logic (removing the offline node from cluster config), even though the node is just temporarily rebooting and will rejoin. This results in mismatched pacemaker configurations:

Node-0: Pacemaker configured with 1 node, has quorum

Node-1: Pacemaker configured with 2 nodes, no quorum

The fix adds a check to verify if the offline node still exists in Kubernetes before treating it as a replacement scenario. Node replacement requires the user to DELETE the node from Kubernetes, so:

If node offline in pacemaker BUT still exists in k8s → Skip replacement logic (node just rebooted, will rejoin naturally)

If node offline in pacemaker AND deleted from k8s → Proceed with replacement logic (actual replacement scenario)

This ensures update-setup only runs replacement logic when both conditions are met, preventing split-brain during upgrades.

Tests cover:

Node exists in Kubernetes but offline in pacemaker (upgrade reboot) → Should skip replacement logic

Node deleted from Kubernetes and offline in pacemaker (replacement) → Should proceed with replacement logic

Error handling for NotFound vs other errors

Summary by CodeRabbit

New Features

Added a pre-check during the update process to detect whether an offline pacemaker node has been deleted from the cluster. If the node no longer exists, the replacement sequence is automatically skipped; otherwise, replacement proceeds normally.

Tests

Added unit tests for node existence checking during update operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jaypoulz · 2026-04-29T15:37:57Z

/jira refresh

openshift-ci-robot · 2026-04-29T15:38:01Z

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jaypoulz · 2026-04-29T15:40:07Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery

openshift-ci · 2026-04-29T15:40:31Z

@jaypoulz: trigger 4 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-upgrade
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-1of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-2of3
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ba7540-43e1-11f1-8734-dc441d1b18ce-0

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/tnf/update-setup/runner_test.go (1)
18-102: Tests validate API primitives, but not the runner branch decisions end-to-end.

This table test confirms fake-client Get/IsNotFound behavior, but it doesn’t exercise RunTnfUpdateSetup (or a helper) to assert “skip vs proceed vs fail” return-path decisions directly. For this critical flow, consider extracting the node-existence decision into a small function and unit-test that function.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner_test.go` around lines 18 - 102, The test is only
exercising the fake client Get behavior, not the runner's branch decisions;
extract the node-existence decision into a small function (e.g., func
checkNodeExists(ctx context.Context, kube kubernetes.Interface, nodeName string)
(exists bool, err error)) and update RunTnfUpdateSetup to call that helper; then
replace the table test to unit-test checkNodeExists directly (using
fake.NewClientset) to assert the three return paths (exists/skip, not
found/proceed, unexpected error/fail). Ensure the helper is named uniquely
(checkNodeExists) and referenced from RunTnfUpdateSetup so the new unit tests
verify the runner branch logic end-to-end.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 124-126: The test passes a non-error value (&runtime.Unknown{}) to
apierrors.NewInternalError via simulateError; replace the runtime.Unknown value
with an actual error (e.g., errors.New(...) or fmt.Errorf(...)) or format the
runtime.Unknown into an error string so apierrors.NewInternalError receives an
error interface; update the simulateError assignment that calls
apierrors.NewInternalError in runner_test.go accordingly (referencing
simulateError, apierrors.NewInternalError, and runtime.Unknown).

---

Nitpick comments:
In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 18-102: The test is only exercising the fake client Get behavior,
not the runner's branch decisions; extract the node-existence decision into a
small function (e.g., func checkNodeExists(ctx context.Context, kube
kubernetes.Interface, nodeName string) (exists bool, err error)) and update
RunTnfUpdateSetup to call that helper; then replace the table test to unit-test
checkNodeExists directly (using fake.NewClientset) to assert the three return
paths (exists/skip, not found/proceed, unexpected error/fail). Ensure the helper
is named uniquely (checkNodeExists) and referenced from RunTnfUpdateSetup so the
new unit tests verify the runner branch logic end-to-end.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b46e7555-1b53-4882-b0b2-fa6f26787667

📥 Commits

Reviewing files that changed from the base of the PR and between 93d2b65 and e9ed644.

📒 Files selected for processing (2)

pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

jaypoulz · 2026-04-29T15:48:40Z

/jira refresh

openshift-ci-robot · 2026-04-29T15:48:44Z

@jaypoulz: This pull request references Jira Issue OCPBUGS-84695, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

slintes · 2026-04-29T16:43:50Z


+	// Check if offline node still exists in Kubernetes
+	// Node replacement requires user to DELETE the node from Kubernetes
+	// If the node still exists, it's likely just rebooted (e.g., during upgrade)


This job only runs when we have 2 ready nodes available... but yeah, pacemaker status might be behind, or reboot by upgrade just started... good catch 🙈

Might be good enough to compare offlineNodeName with currentNodeName and otherNodeName... when one of them matches, skip. But the result is the same I guess :)

What I realized from this comment is that both approaches have the same problem. Same-name node replacement requires us to re-add the node. So the k8s node check doesn't work because it would block it on the same name, and the currentNodeName and otherNodeName check doesn't work they would match the replaced node.

I think we need to run update-setup to reconcile on node delete because offline is insufficient to distinguish between temporary outage and node-that-needs-replacement.

OK - I'm pushing a new approach. Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state. This means that deleting the node in k8s results in us removing it from pacemaker.

This allows us to determine user intention based on the declarative k8s state. So now if you reboot, the k8s nodes are the same, so no-op. If you delete a node, that's an intentional change, so we clean it up on the backend.

It's more involved though, so I'll need to do some soak testing for this.

Same-name node replacement

Ah yes, of course. Has been too long ago already, me dealing with TNF 😎

Don't run setup until we have 2 nodes for initial setup, but after that, always reconcile pacemaker state to k8s state

Hm, not sure about that tbh...
The idea still was: we have 2 ready nodes, but it's not the initial setup, so "update" pacemaker if needed to reflect reality. Changing the trigger would be a risky change IMHO, and can introduce regressions. I mean, this worked well so far for the intended use case, not?
IMHO we only need to better understand the situation during job execution: do we need to deal with the offline node or not? So: can we check something on the node resource to find out?
Pragmatic idea: check the creation date...? If the node is "new" run the setup, else skip it. 🤔
Maybe there is something better.
Or: I don't remember details, but I think I considered putting a label on nodes after they were successfully configured for pacemaker, but that didn't happen... could be useful in this case.

I think we can afford the extra iteration time to get this right since the bug is easy enough to work around at upgrade time. Beyond it fixing this edge case, the other reason I like reworking the reconciliation approach is that I think we will need to in the future to support Adaptable topology.

Scaling from 1 to 2 to 3 nodes with automatic quorum recovery will require us to set up pacemaker on a single node cluster first, then scale up as we go. This would be one of multiple things that we would need to harden to handle that scale-up / scale-down logic.

One thing is absolutely true though - we can't move forward with this patch before updated podman-etcd to handle the single-node case in the monitor operation first.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/operator/nodehandler.go`:
- Around line 121-139: The current logic uses jobsExist (from tnfSetupJobsExist)
as a post-setup signal which is set true as soon as startTnfJobcontrollers
creates setup/auth/after-setup jobs; change the gate to require a real
completion signal instead: replace or extend tnfSetupJobsExist with a check for
a TNF setup completion marker (for example a specific Job completion condition,
a "tnf-setup-completed" ConfigMap/CR, or a Job status check) and expose it as
something like tnfSetupCompleted(ctx, kubeClient) (or return a second boolean
from tnfSetupJobsExist indicating completion); then use that completion boolean
in the node count checks (i.e. when evaluating len(nodeList) < 2, consult
setupCompleted not jobsExist) so a partially-created setup after a CEO restart
is treated as not completed and the initial 2-node flow is preserved.

In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 149-215: The test in runner_test.go reimplements the
reconciliation logic instead of exercising the production code in runner.go;
extract the reconciliation decision into a single exported or package-private
helper (e.g., ReconcileNodes or decideNodeAddRemove) in runner.go that takes
k8sNodes []string and pacemakerOnline/offline []string and returns
(nodeToRemove, nodeToAdd, skip bool), then update the table-driven tests to call
that helper directly (or alternatively drive RunTnfUpdateSetup using
fakes/mocks) so the tests validate the actual implementation rather than a
duplicate inline algorithm.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 313-326: Replace the unsafe substring check
strings.Contains(peerURL, nodeIP) with exact host comparison: parse each peerURL
(use url.Parse) inside the loop over member.PeerURLs, extract the host portion
(handle optional port with net.SplitHostPort and IPv6 brackets), then compare
the extracted host string exactly to nodeIP before proceeding to format
member.ID, log, and call exec.Execute to remove the member; only call member
remove when the host equals nodeIP to avoid accidental matches.
- Around line 168-176: The removal path must not abort when nodeToRemove isn't
present in cfg.NodeName1/NodeName2; instead, stop using
cfg.NodeName1/NodeName2/NodeIP1/NodeIP2 to derive the member to remove and query
Pacemaker/etcd state directly (e.g., list pacemaker nodes or etcd members) to
find the member ID and peerURL/IP for nodeToRemove, then use that ID to perform
the removal; update the logic in the block that references nodeToRemove (and the
similar code at the other instance around lines 239-248) to call the
pacemaker/etcd discovery functions, handle not-found as a no-op or
log-and-continue, and only fall back to cfg lookups if the live discovery fails.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3032f008-368a-4f16-b3ad-fdf0e5c3b3e4

📥 Commits

Reviewing files that changed from the base of the PR and between 0533a44 and 3bf1712.

📒 Files selected for processing (4)

pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

pkg/tnf/update-setup/runner.go (2)
271-283: ⚠️ Potential issue | 🔴 Critical

Deleted nodes cannot be resolved through the current cluster config.

nodeNames here are the pacemaker members that Kubernetes no longer has. For that case, getNodeIP(nodeName) fails by definition, so the code aborts after pacemaker removal and leaves the stale etcd member behind.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 271 - 283, The loop in
removeEtcdMembersForNodes currently aborts when getNodeIP(nodeName) fails (which
happens for pacemaker-only/deleted nodes) and leaves stale etcd members; change
the logic so a failed getNodeIP does not return but instead logs a warning and
attempts removal by node identity: implement a fallback removal path (e.g., add
removeEtcdMemberByName(nodeName) or enhance removeEtcdMemberByIP to accept a
nodeName fallback) and call that when getNodeIP returns an error; ensure
removeEtcdMembersForNodes still logs success/failure via klog and only returns
on unrecoverable errors from the removal call (removeEtcdMemberByIP or new
removeEtcdMemberByName).
311-324: ⚠️ Potential issue | 🔴 Critical

Match the etcd peer host exactly before removing a member.

strings.Contains(peerURL, nodeIP) is unsafe for values like 10.0.0.1 vs 10.0.0.10, and it is brittle with ports/IPv6 formatting. Parse the peer URL and compare the hostname exactly before calling member remove.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 311 - 324, The code in the loop
over memberList.Members uses strings.Contains(peerURL, nodeIP) which can match
substrings and fail on ports/IPv6; update the logic in the loop that iterates
member.PeerURLs (the block around strings.Contains(peerURL, nodeIP)) to parse
each peerURL (use url.Parse), extract the host component and if it contains a
port use net.SplitHostPort to get the exact hostname (handling IPv6 brackets),
then compare that hostname exactly to nodeIP before constructing memberIDHex and
running exec.Execute for member remove; keep the existing memberIDHex formatting
and logging but only call etcdctl member remove when the parsed host equals
nodeIP.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 96-99: When building the k8sNodes set, skip empty node names so
the map does not contain "" which causes reconciliation to treat a non-existent
node as present; update the construction of k8sNodes to only insert
cfg.NodeName1 and cfg.NodeName2 when they are non-empty (e.g. check
len(cfg.NodeName1) > 0 and len(cfg.NodeName2) > 0 before adding), keeping the
map type and usage same so downstream logic that checks membership remains
unchanged.

---

Duplicate comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 271-283: The loop in removeEtcdMembersForNodes currently aborts
when getNodeIP(nodeName) fails (which happens for pacemaker-only/deleted nodes)
and leaves stale etcd members; change the logic so a failed getNodeIP does not
return but instead logs a warning and attempts removal by node identity:
implement a fallback removal path (e.g., add removeEtcdMemberByName(nodeName) or
enhance removeEtcdMemberByIP to accept a nodeName fallback) and call that when
getNodeIP returns an error; ensure removeEtcdMembersForNodes still logs
success/failure via klog and only returns on unrecoverable errors from the
removal call (removeEtcdMemberByIP or new removeEtcdMemberByName).
- Around line 311-324: The code in the loop over memberList.Members uses
strings.Contains(peerURL, nodeIP) which can match substrings and fail on
ports/IPv6; update the logic in the loop that iterates member.PeerURLs (the
block around strings.Contains(peerURL, nodeIP)) to parse each peerURL (use
url.Parse), extract the host component and if it contains a port use
net.SplitHostPort to get the exact hostname (handling IPv6 brackets), then
compare that hostname exactly to nodeIP before constructing memberIDHex and
running exec.Execute for member remove; keep the existing memberIDHex formatting
and logging but only call etcdctl member remove when the parsed host equals
nodeIP.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 975b05c2-c5db-40cd-ab9f-96db7d22ec11

📥 Commits

Reviewing files that changed from the base of the PR and between 3bf1712 and 78f3141.

📒 Files selected for processing (4)

pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

🚧 Files skipped from review as they are similar to previous changes (3)

pkg/tnf/operator/nodehandler_test.go
pkg/tnf/update-setup/runner_test.go
pkg/tnf/operator/nodehandler.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

pkg/tnf/operator/nodehandler_test.go (1)
354-360: Consider using a prefix check instead of exact name matches.

The hardcoded list of job names is fragile. If tests add new master names (e.g., tnf-after-setup-master-2), this condition would need updating.
♻️ Suggested improvement
 	labels := map[string]string{
 		"app.kubernetes.io/component": "two-node-fencing-setup",
 	}
 	// Mark after-setup jobs to indicate initial setup completion
-	if name == "tnf-after-setup" || name == "tnf-after-setup-master-0" || name == "tnf-after-setup-master-1" {
+	if strings.HasPrefix(name, "tnf-after-setup") {
 		labels["app.kubernetes.io/name"] = "tnf-after-setup"
 	}
This would require adding "strings" to the imports.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/operator/nodehandler_test.go` around lines 354 - 360, The code
currently checks exact job names when setting labels for after-setup jobs (the
block that updates the labels map when name == "tnf-after-setup" || name ==
"tnf-after-setup-master-0" || name == "tnf-after-setup-master-1"); replace these
exact-match checks with a prefix check (e.g., use strings.HasPrefix(name,
"tnf-after-setup")) so new masters like "tnf-after-setup-master-2" are handled
automatically, and add the "strings" import to the test file.
pkg/tnf/update-setup/runner.go (1)
293-296: Unused parameters: nodeNames and getNodeIP are never used.

The function removes all unstarted etcd members based on member.Name == "", but the passed parameters are never referenced. This makes the function signature misleading and the caller's effort to pass getNodeIP pointless.
♻️ Suggested signature simplification

If the intent is to remove all unstarted members regardless of which nodes were removed:
-func removeEtcdMembersForNodes(ctx context.Context, nodeNames []string, getNodeIP func(string) (string, error)) error {
-	if len(nodeNames) == 0 {
-		return nil
-	}
+func removeUnstartedEtcdMembers(ctx context.Context) error {
Update call sites at lines 180 and 230 accordingly:
-	if err := removeEtcdMembersForNodes(ctx, nodesToRemove, getNodeIP); err != nil {
+	if len(nodesToRemove) > 0 {
+		if err := removeUnstartedEtcdMembers(ctx); err != nil {
+			return err
+		}
+	}
Alternatively, if you want to keep the node-specific behavior for future use, consider using getNodeIP to match members by IP via removeEtcdMemberByIP.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/tnf/update-setup/runner.go` around lines 293 - 296, The function
removeEtcdMembersForNodes has unused parameters nodeNames and getNodeIP; either
simplify the signature by removing those parameters and update callers to call
removeEtcdMembersForNodes(ctx) (i.e., make it remove all unstarted members as it
already does), or implement node-specific removal by iterating nodeNames,
calling getNodeIP(node) to obtain IPs and then call removeEtcdMemberByIP(ctx,
ip) for matches; update removeEtcdMembersForNodes to reference getNodeIP and
nodeNames (or remove them) and adjust call sites accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/tnf/update-setup/runner.go`:
- Around line 349-412: The function removeEtcdMemberByIP is dead code (never
called); either delete it or wire it into the existing workflow — specifically
replace or refactor the current removeEtcdMembersForNodes usage to call
removeEtcdMemberByIP for each nodeIP, or remove the entire removeEtcdMemberByIP
function and clean up any now-unused imports (e.g., url, net, strings) and
tests; locate the existing caller(s) of removeEtcdMembersForNodes and either
iterate node IPs invoking removeEtcdMemberByIP (preserving its parsing/hex-ID
logic) or remove the function and ensure no references remain.

---

Nitpick comments:
In `@pkg/tnf/operator/nodehandler_test.go`:
- Around line 354-360: The code currently checks exact job names when setting
labels for after-setup jobs (the block that updates the labels map when name ==
"tnf-after-setup" || name == "tnf-after-setup-master-0" || name ==
"tnf-after-setup-master-1"); replace these exact-match checks with a prefix
check (e.g., use strings.HasPrefix(name, "tnf-after-setup")) so new masters like
"tnf-after-setup-master-2" are handled automatically, and add the "strings"
import to the test file.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 293-296: The function removeEtcdMembersForNodes has unused
parameters nodeNames and getNodeIP; either simplify the signature by removing
those parameters and update callers to call removeEtcdMembersForNodes(ctx)
(i.e., make it remove all unstarted members as it already does), or implement
node-specific removal by iterating nodeNames, calling getNodeIP(node) to obtain
IPs and then call removeEtcdMemberByIP(ctx, ip) for matches; update
removeEtcdMembersForNodes to reference getNodeIP and nodeNames (or remove them)
and adjust call sites accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 273e7f58-c686-46d5-b227-f828930b53ef

📥 Commits

Reviewing files that changed from the base of the PR and between 78f3141 and 03a2a57.

📒 Files selected for processing (4)

pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/tnf/update-setup/runner_test.go

fonta-rh

Review: OCPBUGS-84695 — Fix TNF Pacemaker Split-Brain

The core reconciliation approach is solid — comparing k8s ConfigMap (desired state) vs pacemaker XML membership (actual state) is fundamentally better than the old grep Offline | awk shell parsing. Three findings from review:

1. Dead code: `removeEtcdMemberByIP()` + unused `getNodeIP` parameter

removeEtcdMemberByIP() (runner.go:472-535, 63 lines) is defined but never called anywhere in this PR. Additionally, the getNodeIP callback is passed to removeEtcdMembersForNodes() at lines 320 and 370, but it's never dereferenced inside the function body — removeEtcdMembersForNodes removes ALL unstarted members regardless of node identity.

Was removeEtcdMemberByIP intended to be wired in as a fallback (e.g., when unstarted-name detection fails) but forgotten? If it's not needed, removing both the function and the dead getNodeIP parameter would clean things up. If removing, consider renaming removeEtcdMembersForNodes → removeAllUnstartedEtcdMembers to match the actual behavior (blanket cleanup, not per-node targeting).

2. Missing tests for intermediate setup states

The tnfSetupJobsExist() semantics changed significantly — from "any TNF job exists" to "after-setup job completed successfully (Succeeded > 0)." This is the behavioral hinge of the fix, but two intermediate states aren't tested:

After-setup in progress: tnf-after-setup-master-0 exists with Succeeded: 0 → tnfSetupJobsExist() returns false. With 1 node present, handleNodes blocks. The createTNFJob() fixture hardcodes Succeeded: 1, so this path can't currently be exercised.
Partial setup (setup running, after-setup not yet created): tnf-setup job exists but no tnf-after-setup jobs. The old test used createTNFJob("tnf-setup") — that coverage was lost when fixtures were changed to tnf-after-setup-*.

Both are realistic timing windows during normal cluster lifecycle. Suggested additions:

"1 node, after-setup job in progress (not completed) - skip reconciliation"
"2 nodes, setup jobs exist but no after-setup jobs - start controllers only"

3. `removeEtcdMembersForNodes` docstring vs behavior mismatch

The docstring says "removes unstarted etcd members after nodes are removed from pacemaker," implying per-node targeting. The actual behavior removes ALL unstarted members regardless of the nodeNames parameter (which is only used for logging). The behavior is correct and safe for a 2-node cluster, but the docstring could mislead a future engineer debugging member removal issues. Updating it to reflect the blanket-cleanup behavior would prevent misunderstandings.

Overall the approach is a clear improvement. The reconciliation logic in reconcileNodes() is clean, well-extracted for testability, and the test coverage for the core diff logic is thorough (14 cases). Nice work.

jaypoulz · 2026-06-11T20:10:17Z

@coderabbitai review

coderabbitai · 2026-06-11T20:13:44Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pkg/tnf/operator/ARCHITECTURE.md (1)

1-984: 💤 Low value

Documentation is comprehensive and clear.

The architecture documentation provides excellent coverage of the PacemakerLifecycleManager flows, data structures, and concurrency protections. The ASCII diagrams clearly illustrate the control flow for each operation.

Static analysis flagged missing language specifiers on fenced code blocks (e.g., line 9) and minor grammar suggestions. These are optional style improvements.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/operator/ARCHITECTURE.md` around lines 1 - 984, The markdown has
fenced code blocks missing language specifiers and a few minor grammar issues;
update ARCHITECTURE.md by adding appropriate language tags (e.g., ```go for Go
snippets, ```text or ```bash where applicable) to every fenced code block
(notably the File Organization block and all code examples like
NewPacemakerLifecycleManager(), sync(), StartJobControllers(),
startTnfJobcontrollers(), NodeTarget, and PacemakerLifecycleManager sections)
and apply the small grammar fixes suggested by static analysis (straighten
pluralization/tense and punctuation in headings and bullets) so the file passes
markdown linters and improves readability.

pkg/tnf/update-setup/runner_test.go (1)

96-118: ⚡ Quick win

Add an “all generations invalid” case to lock the error-path contract.

Current coverage checks empty input and mixed valid/invalid generations, but not the case where every ConfigMap has a non-numeric generation. That is a distinct branch in the picker and worth pinning with a test.

Proposed diff

 		{
 			name: "skips invalid generation",
 			items: []corev1.ConfigMap{
 				{
 					ObjectMeta: metav1.ObjectMeta{Name: "valid"},
 					Data: map[string]string{
 						"generation": "5",
 						"eventType":  "add",
 					},
 				},
 				{
 					ObjectMeta: metav1.ObjectMeta{Name: "invalid"},
 					Data: map[string]string{
 						"generation": "not-a-number",
 						"eventType":  "add",
 					},
 				},
 			},
 			expectName:  "valid",
 			expectGen:   "5",
 			expectError: false,
 		},
+		{
+			name: "all invalid generations returns error",
+			items: []corev1.ConfigMap{
+				{
+					ObjectMeta: metav1.ObjectMeta{Name: "invalid-1"},
+					Data: map[string]string{
+						"generation": "x",
+					},
+				},
+				{
+					ObjectMeta: metav1.ObjectMeta{Name: "invalid-2"},
+					Data: map[string]string{
+						"generation": "y",
+					},
+				},
+			},
+			expectError:    true,
+			expectErrorMsg: "no update-setup ConfigMap found",
+		},
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/update-setup/runner_test.go` around lines 96 - 118, Add a new test
case in pkg/tnf/update-setup/runner_test.go within the existing test table (the
slice of test cases used by the test function) named "all generations invalid"
that supplies only ConfigMaps whose Data["generation"] values are non-numeric
(e.g., "not-a-number" and "also-bad"); set expectError to true and leave
expectName/expectGen empty (or their zero values) so the test exercises the
picker’s error path when every generation fails to parse. Ensure the test case
mirrors the structure of the existing cases (using corev1.ConfigMap with
ObjectMeta and Data) so the same test loop uses it.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tnf/operator/lifecycle_helpers.go`:
- Around line 1-11: The file uses klog.Warningf in lifecycle_helpers.go but the
klog package is not imported; add an import for "k8s.io/klog/v2" to the import
block so calls like klog.Warningf(...) compile. Locate the import section at the
top of lifecycle_helpers.go (near existing imports such as corev1, pacmkrv1,
ceohelpers, pacemaker) and include the klog import; no code changes to the klog
call sites (e.g., where klog.Warningf is used) are needed.

In `@pkg/tnf/operator/lifecycle_manager.go`:
- Around line 130-145: PacemakerLifecycleManager currently creates lifecycleCtx
and lifecycleCtxCancel in NewPacemakerLifecycleManager but never calls
lifecycleCtxCancel, leaking goroutines spawned by the event handlers; fix by
making the lifecycle context tied to the controller Run context or by exposing a
Shutdown that calls lifecycleCtxCancel: either (A) change
NewPacemakerLifecycleManager to not create an independent background context and
instead set lifecycleCtx = ctx passed into Run (or derive with
context.WithCancel(ctx)) so cancellation propagates when controller.Run(ctx,
...) exits, or (B) add a Shutdown() method on PacemakerLifecycleManager that
calls lifecycleCtxCancel and ensure callers (e.g., where controller is started)
invoke Shutdown after controller stops; update event handler usages that read
lifecycleCtx to use the manager's lifecycleCtx so they are cancelled properly.

In `@pkg/tnf/pkg/pcs/auth.go`:
- Around line 49-53: The pcs auth command is vulnerable to shell injection
because auth.go builds a fmt.Sprintf string with cfg.NodeName1/2 and
cfg.NodeIP1/2 that is passed to the shell via pkg/tnf/pkg/exec/exec.go; fix by
switching to argv-style execution or strict validation: update the code that
constructs the command in the function that sets the `command` variable in
pkg/tnf/pkg/pcs/auth.go to build an argument slice (e.g.,
[]string{"/usr/sbin/pcs","host","auth",...}) and modify the exec caller in
pkg/tnf/pkg/exec/exec.go to accept and exec without "/bin/bash -c", or
alternatively add tight allow-list/regex validation for NodeName* and NodeIP*
before using them (ensure only valid hostnames and IP formats are permitted) so
that the string interpolation cannot inject shell metacharacters.

---

Nitpick comments:
In `@pkg/tnf/operator/ARCHITECTURE.md`:
- Around line 1-984: The markdown has fenced code blocks missing language
specifiers and a few minor grammar issues; update ARCHITECTURE.md by adding
appropriate language tags (e.g., ```go for Go snippets, ```text or ```bash where
applicable) to every fenced code block (notably the File Organization block and
all code examples like NewPacemakerLifecycleManager(), sync(),
StartJobControllers(), startTnfJobcontrollers(), NodeTarget, and
PacemakerLifecycleManager sections) and apply the small grammar fixes suggested
by static analysis (straighten pluralization/tense and punctuation in headings
and bullets) so the file passes markdown linters and improves readability.

In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 96-118: Add a new test case in pkg/tnf/update-setup/runner_test.go
within the existing test table (the slice of test cases used by the test
function) named "all generations invalid" that supplies only ConfigMaps whose
Data["generation"] values are non-numeric (e.g., "not-a-number" and "also-bad");
set expectError to true and leave expectName/expectGen empty (or their zero
values) so the test exercises the picker’s error path when every generation
fails to parse. Ensure the test case mirrors the structure of the existing cases
(using corev1.ConfigMap with ObjectMeta and Data) so the same test loop uses it.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e48cf5b1-9520-49a4-ace1-71e0fc89d0fb

📥 Commits

Reviewing files that changed from the base of the PR and between 74474ac and 7fc5b3a.

📒 Files selected for processing (32)

bindata/tnfdeployment/role.yaml
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/auth/runner.go
pkg/tnf/operator/ARCHITECTURE.md
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

💤 Files with no reviewable changes (3)

pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go

🚧 Files skipped from review as they are similar to previous changes (21)

pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/auth/runner.go
pkg/operator/ceohelpers/common.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/operator/ceohelpers/node_helpers.go
bindata/tnfdeployment/role.yaml
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/operator/ceohelpers/common_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/starter_test.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/update-setup/runner.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/pkg/pacemaker/healthcheck.go

coderabbitai · 2026-06-11T20:18:33Z

+	// Create lifecycle context for goroutines spawned by event handlers
+	lifecycleCtx, lifecycleCtxCancel := context.WithCancel(context.Background())
+
+	c := &PacemakerLifecycleManager{
+		operatorClient:             operatorClient,
+		kubeClient:                 kubeClient,
+		eventRecorder:              eventRecorder,
+		pacemakerInformer:          informer,
+		nodeInformer:               nodeInformer,
+		controllerContext:          controllerContext,
+		kubeInformersForNamespaces: kubeInformersForNamespaces,
+		etcdInformer:               etcdInformer,
+		recordedEvents:             make(map[string]time.Time),
+		lifecycleCtx:               lifecycleCtx,
+		lifecycleCtxCancel:         lifecycleCtxCancel,
+	}


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Add lifecycle context cancellation on controller shutdown.

The lifecycleCtx is created but lifecycleCtxCancel is never called. This violates the Go coding guideline requiring context.Context for cancellation and timeouts, and leads to goroutine leaks when the controller stops. Event handlers spawn goroutines at lines 197-201, 210-214, 226-230, and 249-253 that rely on this context; without cancellation, these goroutines will continue running after the controller shuts down.

As per coding guidelines, context.Context must be used for cancellation and timeouts.

Recommended fix

The controller lifecycle is managed by the library-go framework, which doesn't expose a shutdown hook in the factory.Controller interface returned by NewPacemakerLifecycleManager. The typical pattern is to cancel the context when the parent context (passed to controller.Run(ctx, 1) in starter.go:259) is cancelled.

However, since event handlers need the lifecycle context to survive across sync cycles, you should either:

Option 1: Store a reference to the controller and add a Shutdown() method:

func NewPacemakerLifecycleManager(...) (factory.Controller, *PacemakerLifecycleManager, cache.SharedIndexInformer, error) { // ... existing code ... controller := factory.New(). WithSyncContext(syncCtx). ResyncEvery(pacemaker.HealthCheckResyncInterval). WithSync(c.sync). WithInformers( operatorClient.Informer(), informer, nodeInformer, ).ToController(controllerNamePacemakerLifecycle, syncCtx.Recorder()) + // Wrap the sync function to cancel lifecycle context on shutdown + // This ensures event handler goroutines are cleaned up return controller, c, informer, nil } +// Shutdown cancels the lifecycle context, stopping all event handler goroutines +func (c *PacemakerLifecycleManager) Shutdown() { + c.lifecycleCtxCancel() +}

Then call Shutdown() in starter.go after the controller stops, or better yet, derive lifecycleCtx from the controller context passed to Run().

Option 2 (simpler): Change event handlers to use the sync context instead of a separate lifecycle context, and pass the context from the caller:

- lifecycleCtx, lifecycleCtxCancel := context.WithCancel(context.Background()) - c := &PacemakerLifecycleManager{ // ... existing fields ... - lifecycleCtx: lifecycleCtx, - lifecycleCtxCancel: lifecycleCtxCancel, }

And update event handlers to use context.Background() or a context derived from the controller's Run context.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/lifecycle_manager.go` around lines 130 - 145, PacemakerLifecycleManager currently creates lifecycleCtx and lifecycleCtxCancel in NewPacemakerLifecycleManager but never calls lifecycleCtxCancel, leaking goroutines spawned by the event handlers; fix by making the lifecycle context tied to the controller Run context or by exposing a Shutdown that calls lifecycleCtxCancel: either (A) change NewPacemakerLifecycleManager to not create an independent background context and instead set lifecycleCtx = ctx passed into Run (or derive with context.WithCancel(ctx)) so cancellation propagates when controller.Run(ctx, ...) exits, or (B) add a Shutdown() method on PacemakerLifecycleManager that calls lifecycleCtxCancel and ensure callers (e.g., where controller is started) invoke Shutdown after controller stops; update event handler usages that read lifecycleCtx to use the manager's lifecycleCtx so they are cancelled properly.

Source: Coding guidelines

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

docs/tnf-lifecycle-architecture.md (1)

589-611: ⚡ Quick win

Document the Node-exists gate before updateSetup.

The reconciliation flow still reads like “any drift → updateSetup”, but the PR’s central rule is to skip replacement when pacemaker marks a node offline and the Kubernetes Node still exists. Please call that branch out here so the doc matches the implemented behavior.

♻️ Suggested doc patch

+        Pacemaker node offline but Kubernetes Node still exists?
+            ├── Yes → skip replacement / return nil
+            └── No  → continue to updateSetup()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tnf-lifecycle-architecture.md` around lines 589 - 611, The diagram and
text currently imply any detected drift leads to updateSetup, but the
implementation skips replacement when Pacemaker marks a node offline while the
Kubernetes Node still exists; update the flow to insert a "Node-exists gate"
before calling updateSetup: after detectDrift(...) and isUpdateSetupRunning(...)
pass, call getIntersection(k8sNodes, pacemakerNodes) and if the intersection is
non-empty proceed to updateSetup(), otherwise return/skip; update the diagram
and accompanying prose to explicitly name this gate and reference the functions
detectDrift, isUpdateSetupRunning, getIntersection, and updateSetup so the doc
matches the code behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tnf/operator/lifecycle_manager_test.go`:
- Line 129: The test currently ignores the error returned by store.Add(obj);
update the test to handle that error by either asserting success with
require.NoError(t, store.Add(obj)) if you have access to t *testing.T, or by
capturing and failing on error (e.g., if err := store.Add(obj); err != nil {
t.Fatalf("store.Add failed: %v", err) }); locate the call to store.Add in
lifecycle_manager_test.go and replace the blank-assignment `_ = store.Add(obj)`
with one of these explicit error-handling patterns.
- Around line 1294-1297: The loop is ignoring the error return from
nodeIndexer.Add which can hide failures; update the loop that iterates
tt.k8sNodes to check the returned error from nodeIndexer.Add and fail the test
on error (e.g., if err := nodeIndexer.Add(node); err != nil { t.Fatalf("failed
to add node to indexer: %v", err) }) so that problems with
cache.MetaNamespaceKeyFunc or Add surface immediately; the change touches the
nodeIndexer.Add call in the test using the existing t testing.T instance.

---

Nitpick comments:
In `@docs/tnf-lifecycle-architecture.md`:
- Around line 589-611: The diagram and text currently imply any detected drift
leads to updateSetup, but the implementation skips replacement when Pacemaker
marks a node offline while the Kubernetes Node still exists; update the flow to
insert a "Node-exists gate" before calling updateSetup: after detectDrift(...)
and isUpdateSetupRunning(...) pass, call getIntersection(k8sNodes,
pacemakerNodes) and if the intersection is non-empty proceed to updateSetup(),
otherwise return/skip; update the diagram and accompanying prose to explicitly
name this gate and reference the functions detectDrift, isUpdateSetupRunning,
getIntersection, and updateSetup so the doc matches the code behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8077d43e-3cb9-4d1a-9a15-475ccd180238

📥 Commits

Reviewing files that changed from the base of the PR and between 7fc5b3a and 880d8ca.

📒 Files selected for processing (32)

bindata/tnfdeployment/role.yaml
docs/tnf-lifecycle-architecture.md
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/auth/runner.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

💤 Files with no reviewable changes (3)

pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/tnf/pkg/pacemaker/statuscollector_test.go

🚧 Files skipped from review as they are similar to previous changes (23)

pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/common_test.go
pkg/tnf/update-setup/runner_test.go
bindata/tnfdeployment/role.yaml
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/starter_test.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/operator/lifecycle_manager.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/update-setup/runner.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/operator/lifecycle_health.go

jaypoulz · 2026-06-11T21:49:48Z

@coderabbitai review

coderabbitai · 2026-06-11T21:49:54Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/tnf/pkg/jobs/tnf_test.go (1)

404-417: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Decouple controller-start expectation from expectWaitForStopped.

expectWaitForStopped is a path flag, but Line 412 uses it as an outcome gate. That makes timeout/delete-error cases assert controller startup even when the test expects an error path. Add an explicit expectControllerStarted field and assert against that directly.

Proposed fix

 tests := []struct {
   name                 string
   jobType              tools.JobType
   nodeTarget           *NodeTarget
   scheduleOnNode       *string
   setupClient          func() *fake.Clientset
   expectError          bool
   errorContains        string
   expectJobDeleted     bool
   expectWaitForStopped bool
+  expectControllerStarted bool
 }{
   {
     name:                  "Job does not exist - just runs controller",
@@
-    expectWaitForStopped:  false,
+    expectWaitForStopped:  false,
+    expectControllerStarted: true,
   },
@@
-  if !tt.expectError || tt.expectWaitForStopped {
+  if tt.expectControllerStarted {
     runningControllersMutex.Lock()
     isRunning := runningControllers[jobName]
     runningControllersMutex.Unlock()
     require.True(t, isRunning, "Expected controller to be started")
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/jobs/tnf_test.go` around lines 404 - 417, The test currently
gates the controller-start assertion on expectWaitForStopped which is a path
flag, causing tests that expect errors to still assert controller startup; add a
new boolean test field expectControllerStarted and replace the condition "if
!tt.expectError || tt.expectWaitForStopped" with a check against
tt.expectControllerStarted, then use runningControllersMutex,
runningControllers[jobName], and require.True/False accordingly to assert
whether the controller was started or not; update test cases to set
expectControllerStarted where appropriate and keep jobName computed via
tt.jobType.GetJobName(&tt.nodeTarget.Name) as before.

♻️ Duplicate comments (1)

pkg/tnf/pkg/pcs/auth.go (1)
49-53: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Shell interpolation is still unsafe here.

cfg.NodeName* and cfg.NodeIP* are still embedded directly into the pcs host auth shell command, so the injection risk from the earlier review remains in both branches. Please switch this path to argv-style execution or validate/escape the node values before building the command. As per coding guidelines, shell input from trust boundaries must not be parsed by a shell.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/pcs/auth.go` around lines 49 - 53, The code builds a shell
command string using untrusted cfg.NodeName1/2 and cfg.NodeIP1/2 (and TokenPath)
which allows shell injection; replace both fmt.Sprintf branches in auth.go so
you do not invoke a shell: construct an argv-style slice (e.g. args :=
[]string{"host","auth", cfg.NodeName1, "addr="+cfg.NodeIP1, ...} including the
second node conditionally and the "--token", TokenPath and "--debug" flags) and
call exec.Command / exec.CommandContext with the command "/usr/sbin/pcs" plus
that args slice, or alternatively strictly validate/escape node names and IPs
against a whitelist/regex before using them; ensure you update the code paths
handling both the single-node and two-node cases to use the argv-style execution
and remove any fmt.Sprintf shell interpolation.
Source: Coding guidelines

🧹 Nitpick comments (1)

docs/tnf-lifecycle-architecture.md (1)
9-53: 💤 Low value

Add language tags to the fenced blocks.

The document uses unlabeled fences in several places, which is why markdownlint is reporting MD040 here. Tag the diagrams/snippets as text (or a more specific language where applicable) so the new architecture doc stays lint-clean. Per the markdownlint warnings, the same fix applies to the rest of the file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tnf-lifecycle-architecture.md` around lines 9 - 53, Unlabeled fenced
code blocks cause markdownlint MD040; update each fenced block (e.g., the
diagram describing NewPacemakerLifecycleManager() in lifecycle_manager.go) to
include a language tag such as ```text (or a more specific language where
appropriate) at the opening fence, and apply the same change to all other fenced
blocks in docs/tnf-lifecycle-architecture.md so the file passes markdownlint
checks.
Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tnf/operator/lifecycle_job_controllers_test.go`:
- Around line 340-342: Replace the ignored error from nodeIndexer.Add in the
test setup with a failing assertion so fixture setup fails fast: change the loop
that currently does `_ = nodeIndexer.Add(node)` to call `require.NoError(t,
nodeIndexer.Add(node))` (or `require.NoErrorf` with context) for each node;
ensure `t` (testing.T) and the testify/require import are available in the test
file.

---

Outside diff comments:
In `@pkg/tnf/pkg/jobs/tnf_test.go`:
- Around line 404-417: The test currently gates the controller-start assertion
on expectWaitForStopped which is a path flag, causing tests that expect errors
to still assert controller startup; add a new boolean test field
expectControllerStarted and replace the condition "if !tt.expectError ||
tt.expectWaitForStopped" with a check against tt.expectControllerStarted, then
use runningControllersMutex, runningControllers[jobName], and require.True/False
accordingly to assert whether the controller was started or not; update test
cases to set expectControllerStarted where appropriate and keep jobName computed
via tt.jobType.GetJobName(&tt.nodeTarget.Name) as before.

---

Duplicate comments:
In `@pkg/tnf/pkg/pcs/auth.go`:
- Around line 49-53: The code builds a shell command string using untrusted
cfg.NodeName1/2 and cfg.NodeIP1/2 (and TokenPath) which allows shell injection;
replace both fmt.Sprintf branches in auth.go so you do not invoke a shell:
construct an argv-style slice (e.g. args := []string{"host","auth",
cfg.NodeName1, "addr="+cfg.NodeIP1, ...} including the second node conditionally
and the "--token", TokenPath and "--debug" flags) and call exec.Command /
exec.CommandContext with the command "/usr/sbin/pcs" plus that args slice, or
alternatively strictly validate/escape node names and IPs against a
whitelist/regex before using them; ensure you update the code paths handling
both the single-node and two-node cases to use the argv-style execution and
remove any fmt.Sprintf shell interpolation.

---

Nitpick comments:
In `@docs/tnf-lifecycle-architecture.md`:
- Around line 9-53: Unlabeled fenced code blocks cause markdownlint MD040;
update each fenced block (e.g., the diagram describing
NewPacemakerLifecycleManager() in lifecycle_manager.go) to include a language
tag such as ```text (or a more specific language where appropriate) at the
opening fence, and apply the same change to all other fenced blocks in
docs/tnf-lifecycle-architecture.md so the file passes markdownlint checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e70b5b5c-0861-4649-b961-103af3804f9e

📥 Commits

Reviewing files that changed from the base of the PR and between 880d8ca and ff4e65d.

📒 Files selected for processing (32)

bindata/tnfdeployment/role.yaml
docs/tnf-lifecycle-architecture.md
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/auth/runner.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

💤 Files with no reviewable changes (3)

pkg/tnf/operator/nodehandler.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/operator/nodehandler_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/tnf/auth/runner.go

🚧 Files skipped from review as they are similar to previous changes (20)

pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/operator/ceohelpers/common.go
pkg/tnf/update-setup/runner_test.go
pkg/operator/ceohelpers/common_test.go
bindata/tnfdeployment/role.yaml
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/jobs/tnf.go
pkg/operator/ceohelpers/node_helpers.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/update-setup/runner.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/starter.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/operator/lifecycle_reconciliation.go

jaypoulz · 2026-06-12T20:36:30Z

@coderabbitai review

coderabbitai · 2026-06-12T20:36:35Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/tnf/pkg/pacemaker/healthcheck.go (1)

192-217: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat missing node health metadata as unhealthy, not healthy.

Line 194 drops nodes with no conditions, and Line 215 also short-circuits when NodeHealthy is absent. That lets a partially populated PacemakerCluster status come back StatusHealthy, which then clears PacemakerHealthCheckDegraded downstream in pkg/tnf/operator/lifecycle_health.go. Missing per-node conditions/summary data should surface as an error or StatusUnknown, not a healthy node.

Suggested direction

 func checkNodeConditions(node pacmkrv1.PacemakerClusterNodeStatus, status *HealthStatus) {
 	conditions := node.Conditions
 	if len(conditions) == 0 {
-		return
+		status.Errors = append(status.Errors, fmt.Sprintf(msgNodeUnhealthy, node.NodeName, "missing node conditions"))
+		return
 	}
@@
 	healthyCondition := FindCondition(conditions, pacmkrv1.NodeHealthyConditionType)
-	if healthyCondition == nil || healthyCondition.Status == metav1.ConditionTrue {
+	if healthyCondition == nil {
+		status.Errors = append(status.Errors, fmt.Sprintf(msgNodeUnhealthy, node.NodeName, "missing Healthy condition"))
+		return
+	}
+	if healthyCondition.Status == metav1.ConditionTrue {
 		return // Node is healthy (except for warnings already captured above)
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/pacemaker/healthcheck.go` around lines 192 - 217, The function
checkNodeConditions currently treats missing per-node conditions as healthy;
change it so missing or incomplete node condition metadata is considered
unhealthy/unknown: in checkNodeConditions, when len(node.Conditions) == 0 append
an error (e.g., fmt.Sprintf(msgNodeMissingConditions, node.NodeName) or similar)
to status.Errors or set status to unknown and return, and likewise when
FindCondition(..., pacmkrv1.NodeHealthyConditionType) returns nil do not
short-circuit as healthy — append an error/warning for node.NodeName and proceed
to report fencing warnings and other conditions; update references in the
function (checkNodeConditions, node.Conditions, FindCondition,
NodeHealthyConditionType, NodeOnlineConditionType, getFencingWarnings,
HealthStatus) accordingly.

🧹 Nitpick comments (1)

docs/tnf-lifecycle-architecture.md (1)
9-9: ⚡ Quick win

Add language specifier to all fenced code blocks.

The document contains 19 fenced code blocks (ASCII art flow diagrams) without a language specifier. This triggers markdown linting warnings (MD040). Since these blocks contain plain-text ASCII diagrams rather than executable code, add the text language identifier to each code block.

Example fix: Change ``` to ```text on each affected line.
♻️ Example: Fix code block at line 30 and others
-```
+```text
 ┌─────────────────────────────────────────────────────────────┐
 │ NewPacemakerLifecycleManager()                              │
Apply the same fix to lines: 9, 30, 65, 122, 199, 216, 259, 363, 393, 440, 471, 527, 592, 672, 724, 803, 827, 882, 908, 919 (all fenced code blocks containing ASCII art).
Also applies to: 30-30, 65-65, 122-122, 199-199, 216-216, 259-259, 363-363, 393-393, 440-440, 471-471, 527-527, 592-592, 672-672, 724-724, 803-803, 827-827, 882-882, 908-908, 919-919
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tnf-lifecycle-architecture.md` at line 9, Update every fenced ASCII-art
code block in tnf-lifecycle-architecture.md to include the text language
specifier by changing the opening fence from ``` to ```text for each ASCII
diagram block (occurrences around the blocks referenced at lines 9, 30, 65, 122,
199, 216, 259, 363, 393, 440, 471, 527, 592, 672, 724, 803, 827, 882, 908, 919);
ensure you only modify the opening fence of each diagram block so the linter
MD040 is satisfied while preserving the ASCII content and closing fences
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/tnf/operator/lifecycle_health.go`:
- Around line 74-79: The code currently conflates "last valid (non-Unknown)" and
"last observed" status which breaks recovery event detection; change the APIs
and state so you track both separately: introduce a previousObserved (last
snapshot regardless of Unknown) and previousValid (last non-StatusUnknown) and
return/pass both from getPacemakerStatus, then update callers like
recordHealthTransitionEvents to use previousValid for grace-period calculations
and previousObserved (or a boolean previousWasUnknown derived from it) for
detecting recovery-from-Unknown transitions; update any logic that updates
stored state (the variables updated after getPacemakerStatus) to set both
previousObserved and previousValid appropriately and ensure comparisons that
check StatusUnknown use the previousValid/previousObserved distinction instead
of the single previous variable.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 130-138: Replace any logging that prints full topology, node IPs
or raw command stdout/stderr with redacted or debug-only output: modify the
klog.Infof calls around decodeNodeList and getNodeNames to log counts and node
IDs (e.g., len(capturedNodes) and node.Name) or masked values instead of
cm.Data["nodes"] or full getNodeNames output; similarly ensure places that log
pcs/etcdctl command lines and command output (the code paths that format and log
stdout/stderr) only emit sanitized summaries or are gated behind a debug flag.
Locate usages in functions/methods referencing decodeNodeList, getNodeNames, and
any direct klog.Infof/klog.Errorf calls that include command strings or cm.Data
entries and change them to emit non-sensitive summaries (counts, partial IDs, or
“REDACTED”) or use debug-level logging.

---

Outside diff comments:
In `@pkg/tnf/pkg/pacemaker/healthcheck.go`:
- Around line 192-217: The function checkNodeConditions currently treats missing
per-node conditions as healthy; change it so missing or incomplete node
condition metadata is considered unhealthy/unknown: in checkNodeConditions, when
len(node.Conditions) == 0 append an error (e.g.,
fmt.Sprintf(msgNodeMissingConditions, node.NodeName) or similar) to
status.Errors or set status to unknown and return, and likewise when
FindCondition(..., pacmkrv1.NodeHealthyConditionType) returns nil do not
short-circuit as healthy — append an error/warning for node.NodeName and proceed
to report fencing warnings and other conditions; update references in the
function (checkNodeConditions, node.Conditions, FindCondition,
NodeHealthyConditionType, NodeOnlineConditionType, getFencingWarnings,
HealthStatus) accordingly.

---

Nitpick comments:
In `@docs/tnf-lifecycle-architecture.md`:
- Line 9: Update every fenced ASCII-art code block in
tnf-lifecycle-architecture.md to include the text language specifier by changing
the opening fence from ``` to ```text for each ASCII diagram block (occurrences
around the blocks referenced at lines 9, 30, 65, 122, 199, 216, 259, 363, 393,
440, 471, 527, 592, 672, 724, 803, 827, 882, 908, 919); ensure you only modify
the opening fence of each diagram block so the linter MD040 is satisfied while
preserving the ASCII content and closing fences unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bbbdeff4-4a04-4bc1-af16-105ebe6d367a

📥 Commits

Reviewing files that changed from the base of the PR and between ff4e65d and 3f964fb.

📒 Files selected for processing (32)

bindata/tnfdeployment/role.yaml
docs/tnf-lifecycle-architecture.md
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/auth/runner.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

💤 Files with no reviewable changes (3)

pkg/tnf/operator/nodehandler.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/operator/nodehandler_test.go

🚧 Files skipped from review as they are similar to previous changes (23)

pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/common.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/auth/runner.go
bindata/tnfdeployment/role.yaml
pkg/operator/ceohelpers/node_helpers.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/starter_test.go
pkg/tnf/update-setup/runner_test.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/tnf/pkg/jobs/tnf.go (1)
60-67: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

scheduleOnNode is not reapplied once a cluster-wide controller is already running.

RunTNFJobController deduplicates only by job name, so a later call with a different scheduleOnNode is skipped and the existing controller keeps its original scheduling behavior. That makes update-setup placement effectively “first call wins” for the process lifetime.

This is especially risky when a controller was first started without a scheduling hint and later restart requests provide one.

Also applies to: 85-92, 161-162
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/jobs/tnf.go` around lines 60 - 67, The deduplication logic in
RunTNFJobController only includes the job name in the controllerKey but ignores
the scheduleOnNode parameter, causing later calls with different scheduling
configurations to be skipped and the existing controller to retain its original
scheduling behavior. Modify the controllerKey construction to include the
scheduleOnNode value in addition to jobType and node information, so that
controllers with different scheduling parameters are treated as distinct and not
incorrectly deduplicated. This ensures that scheduling hints provided in later
calls are properly respected rather than being ignored due to "first call wins"
behavior.

🧹 Nitpick comments (1)

pkg/tnf/operator/lifecycle_manager_test.go (1)
1369-1374: ⚡ Quick win

Use explicit UID labels in orphaned-job fixtures.

Most test nodes use UID == Name, and createTNFJobWithNodeLabelLM writes the node label from a name-like parameter. Since this PR’s cleanup contract is UID-based, give fixtures distinct names and UIDs and pass the UID label explicitly so the tests catch job creation/cleanup label mismatches.

Also applies to: 1410-1418
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/operator/lifecycle_manager_test.go` around lines 1369 - 1374, The
test fixtures currently use identical values for node Name and UID (using name
as UID for simplicity), and rely on name-based node labels in job creation.
Since the cleanup contract is UID-based, refactor the fixtures to have distinct
names and UIDs: in the createReadyNodeLM function (anchor location), replace the
UID assignment to use a unique identifier different from the name, and similarly
update the companion fixture at line range 1410-1418 with distinct name and UID
values. Additionally, when calling createTNFJobWithNodeLabelLM, pass the actual
UID value explicitly via the node label parameter instead of deriving it from
the name parameter, ensuring that all test cases verify proper job creation and
cleanup using UID-based matching rather than name-based fallbacks.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/operator/ceohelpers/node_helpers.go`:
- Around line 23-30: The ListNodesFromInformer function is returning all nodes
via labels.Everything() on line 29, but downstream logic expects only
control-plane nodes to be returned. This causes worker nodes to be incorrectly
counted as control-plane nodes, leading to false positive checks and skipped
pacemaker actions. Modify the label selector in the lister.List() call to filter
for control-plane nodes only by using a label selector that matches the
control-plane node role label instead of labels.Everything().

In `@pkg/tnf/operator/lifecycle_health.go`:
- Around line 139-160: The assignment to c.previous at line 159 happens before
the updateOperatorStatus call completes, causing skipped retries on status
update failures. Move the c.previous = currentStatus assignment to execute only
after updateOperatorStatus succeeds, ensuring that the timestamp is only marked
as processed when the side effect (status update) completes successfully. This
prevents the sync loop from skipping the same CR timestamp due to stale data in
c.previous when retrying after a failed updateOperatorStatus call.
- Around line 191-200: The logic in the StatusUnknown case prevents degradation
when previous is nil, but this causes missing or unpopulated PacemakerCluster
CRs to never degrade since previous remains nil forever. Modify the health
tracking to record the initial time when StatusUnknown is first observed, then
check if the grace period (StatusUnknownDegradedThreshold) has elapsed since
that time. Update the condition that currently suppresses degradation to allow
degradation once the grace period expires, ensuring that missing CRs follow the
same grace-period behavior as stale ones. The key change is adding a field to
track when Unknown status began being observed and comparing the current time
against that tracked time plus the threshold, rather than only relying on the
previous.CRLastUpdated check.
- Around line 416-424: The current condition for recording
PacemakerWarningsCleared only checks previousHadWarnings and
currentHasNoWarnings, but StatusUnknown carries no warnings, causing the event
to fire incorrectly during Warning → Unknown transitions when visibility is lost
rather than warnings being resolved. Add an additional guard condition to the if
statement to verify the current status is a known status (not Unknown) before
calling c.eventRecorder.Eventf with EventReasonWarningsCleared. This ensures
warnings cleared is only recorded when the status is actually known and
confirmed to have no warnings.

In `@pkg/tnf/operator/lifecycle_job_controllers.go`:
- Around line 75-90: The variable k8sNodes is populated from
ceohelpers.ListNodesFromInformer which returns all nodes in the cluster, not
just control-plane nodes. However, the subsequent validation logic (checking if
length is greater than 2 or less than 2) assumes k8sNodes contains only
control-plane nodes. Filter k8sNodes to extract only control-plane nodes before
applying the control-plane specific gating checks that compare the count to 2.
This affects the logic in the initial section near line 75-90 (anchor site in
lifecycle_job_controllers.go) and also at lines 115-121 (sibling site in the
same file where similar gating checks are applied).

In `@pkg/tnf/operator/lifecycle_manager_test.go`:
- Around line 503-507: The synthetic status objects in the transition-event
tests (errorStatus at lines 503-507, and similar healthy/warning/error status
objects at lines 523-529, 560-564, 800-804, 815-819, and 833-837) are missing
the CRLastUpdated field, which causes recordHealthTransitionEvents to treat the
previous timestamp as Unknown and record unwanted extra PacemakerHealthy events.
Add a recent fixed CRLastUpdated timestamp to each of these status objects using
the same fixed timestamp value across all locations in the transition tests.
This ensures the test assertions accurately reflect the expected behavior when
previous and current statuses are separated by a known time interval.

In `@pkg/tnf/operator/lifecycle_reconciliation_test.go`:
- Around line 126-127: The test at line 126 calls determineReconciliationActions
which does not exist; the correct exported function is
DetermineReconciliationActions from the tools package located in
pkg/tnf/pkg/tools/nodes.go. Add an import for the tools package and replace the
function call determineReconciliationActions with
tools.DetermineReconciliationActions. Verify that the parameters tt.k8sNodes and
tt.pacemakerNodes match the expected function signature of
DetermineReconciliationActions(desiredNodes, currentNodes map[string]string),
adjusting the parameter types if necessary to ensure they are map[string]string.

In `@pkg/tnf/operator/lifecycle_reconciliation.go`:
- Around line 23-26: The process-local counter updateSetupGeneration resets to 0
on operator restart, causing stale ConfigMaps with higher generation values from
before the restart to be selected during reconciliation. Replace the in-memory
counter with a restart-safe generation mechanism such as using timestamps or
deriving the generation from existing ConfigMap metadata (like modification
timestamps or actual generation fields) to ensure ordering remains consistent
across operator restarts and stale snapshots cannot be ranked higher than new
reconciliations.
- Around line 99-138: The code acquires the reconcilePacemakerConfigMutex lock
but proceeds with stale drift state that was checked before acquiring the lock,
creating a race condition where a waiting goroutine can trigger reconciliation
with outdated information after another goroutine has already reconciled. After
acquiring the reconcilePacemakerConfigMutex lock and before calling
updateSetupFunc, re-validate the drift condition (the comparison between
k8sNodes and pacemakerNodes) to ensure the detected drift is still valid and
hasn't been resolved by another concurrent reconciliation that completed while
this goroutine was waiting for the lock.
- Around line 81-86: The error handling in the getPacemakerNodes call is
suppressing all errors as if they are expected NotFound errors during initial
setup, but this masks actual API, RBAC, and read failures that should be
returned. Modify the error handling to check the error type: only skip
reconciliation and log at V(4) if the error is a NotFound error (indicating the
CR doesn't exist yet), otherwise return the error to allow proper error
propagation up the call stack. This ensures legitimate failures are not silently
suppressed while still allowing the initial setup case to proceed.

In `@pkg/tnf/pkg/pacemaker/healthcheck.go`:
- Around line 83-95: The doc comment for the BuildHealthStatusFromCR function
states it returns nil when the CR has no status populated, but the
implementation actually returns a non-nil HealthStatus with OverallStatus set to
StatusUnknown and an error message appended. Update the doc comment to
accurately reflect this behavior so that callers understand the function will
always return a non-nil HealthStatus, and the status will contain StatusUnknown
with error details when the CR is unpopulated, rather than expecting a nil
return value.
- Around line 354-363: The GetEventReasonForError function's pattern matching is
out of sync with actual error messages being emitted by the health check logic.
Update the errorPatternsByPriority slice to include patterns that match the
actual error messages now being generated, such as "PacemakerCluster CR not
found in cache" for missing-CR cases and patterns that correctly identify
resource failures versus node failures (for example, messages containing "node
is unhealthy" should map to node error reasons while messages about specific
resources should map to resource error reasons). Consider also adding patterns
with a stable resource marker in the error format to ensure filtering by
resource event reasons continues to work correctly, and verify the patterns work
by testing with actual error messages produced by BuildHealthStatusFromCR.

In `@pkg/tnf/pkg/pcs/auth.go`:
- Around line 49-53: The conditional check for two-node cluster authentication
only proceeds with two-node auth when both cfg.NodeName2 and cfg.NodeIP2 are
present, but silently degrades to single-node auth if only one is set, which
masks incomplete configuration. Add validation to ensure cfg.NodeName2 and
cfg.NodeIP2 are treated as an atomic pair: if either one is non-empty, both must
be non-empty, otherwise return an error indicating inconsistent cluster
configuration. Only fall back to single-node auth when both are empty, and
proceed with two-node auth only when both are populated.

In `@pkg/tnf/update-setup/runner_test.go`:
- Around line 21-22: Update the coverage note on lines 21-22 of runner_test.go
that references the now-deleted nodehandler_test.go file. Remove or correct the
stale reference to point readers to the appropriate current source of truth for
reconciliation logic testing, or update the comment to accurately reflect where
this testing now occurs in the refactored codebase.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 333-336: The line accessing strings.Fields(value)[0] can panic if
the value is empty or contains only whitespace, since strings.Fields will return
an empty slice. Add a guard check after the strings.Fields call to verify the
resulting slice is not empty before attempting to access index [0]. If the slice
is empty, return an appropriate error instead of allowing the panic. This
ensures the runner handles malformed node_ip_map values gracefully rather than
crashing.
- Around line 173-179: The code currently skips nodes from the desiredNodes map
when GetNodeIPForPacemaker fails, which can cause incorrect pacemaker member
removals. Instead of continuing with the warning log in the loop over
capturedNodes, return an error from the containing function to abort
reconciliation when a desired-state node cannot be represented with a valid IP.
This ensures the reconciliation fails fast rather than silently dropping the
node from desiredNodes.

---

Outside diff comments:
In `@pkg/tnf/pkg/jobs/tnf.go`:
- Around line 60-67: The deduplication logic in RunTNFJobController only
includes the job name in the controllerKey but ignores the scheduleOnNode
parameter, causing later calls with different scheduling configurations to be
skipped and the existing controller to retain its original scheduling behavior.
Modify the controllerKey construction to include the scheduleOnNode value in
addition to jobType and node information, so that controllers with different
scheduling parameters are treated as distinct and not incorrectly deduplicated.
This ensures that scheduling hints provided in later calls are properly
respected rather than being ignored due to "first call wins" behavior.

---

Nitpick comments:
In `@pkg/tnf/operator/lifecycle_manager_test.go`:
- Around line 1369-1374: The test fixtures currently use identical values for
node Name and UID (using name as UID for simplicity), and rely on name-based
node labels in job creation. Since the cleanup contract is UID-based, refactor
the fixtures to have distinct names and UIDs: in the createReadyNodeLM function
(anchor location), replace the UID assignment to use a unique identifier
different from the name, and similarly update the companion fixture at line
range 1410-1418 with distinct name and UID values. Additionally, when calling
createTNFJobWithNodeLabelLM, pass the actual UID value explicitly via the node
label parameter instead of deriving it from the name parameter, ensuring that
all test cases verify proper job creation and cleanup using UID-based matching
rather than name-based fallbacks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0f349e5f-58c2-4b60-80f4-0cea35630dbf

📥 Commits

Reviewing files that changed from the base of the PR and between f38807a and 9722e18.

📒 Files selected for processing (33)

bindata/tnfdeployment/role.yaml
docs/tnf-lifecycle-architecture.md
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/auth/runner.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/nodes.go
pkg/tnf/pkg/tools/tnf_labels.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

💤 Files with no reviewable changes (3)

pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go

coderabbitai · 2026-06-16T18:21:20Z

+func ListNodesFromInformer(informer cache.SharedIndexInformer) ([]*corev1.Node, error) {
+	if informer == nil {
+		return nil, fmt.Errorf("informer is nil")
+	}
+
+	lister := corev1listers.NewNodeLister(informer.GetIndexer())
+	return lister.List(labels.Everything())
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

ListNodesFromInformer is returning all nodes while downstream logic treats it as control-plane input.

Line 29 uses labels.Everything(). Downstream reconciliation/startup paths consume this result as “control-plane nodes”, so worker nodes can cause false >2 checks and skip pacemaker actions.

Suggested fix

func ListNodesFromInformer(informer cache.SharedIndexInformer) ([]*corev1.Node, error) { if informer == nil { return nil, fmt.Errorf("informer is nil") } - lister := corev1listers.NewNodeLister(informer.GetIndexer()) - return lister.List(labels.Everything()) + lister := corev1listers.NewNodeLister(informer.GetIndexer()) + selector := labels.SelectorFromSet(map[string]string{ + ControlPlaneNodeLabelSelector: "", + }) + return lister.List(selector) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/ceohelpers/node_helpers.go` around lines 23 - 30, The ListNodesFromInformer function is returning all nodes via labels.Everything() on line 29, but downstream logic expects only control-plane nodes to be returned. This causes worker nodes to be incorrectly counted as control-plane nodes, leading to false positive checks and skipped pacemaker actions. Modify the label selector in the lister.List() call to filter for control-plane nodes only by using a label selector that matches the control-plane node role label instead of labels.Everything().

coderabbitai · 2026-06-16T18:21:20Z

+	// Check if lastUpdated timestamp has changed since last sync.
+	// If unchanged, skip processing to avoid redundant work on timer-triggered syncs.
+	// Use previous.CRLastUpdated for comparison (only set for non-Unknown status).
+	if previous != nil && !previous.CRLastUpdated.IsZero() && crLastUpdated.Equal(previous.CRLastUpdated) {
+		klog.V(4).Infof("Skipping sync: lastUpdated timestamp unchanged (%v)", crLastUpdated)
+		return nil, nil, nil
+	}
+
+	// Build health status from the CRD status fields
+	currentStatus := pacemaker.BuildHealthStatusFromCR(pacemakerCR)
+	currentStatus.CRLastUpdated = crLastUpdated
+
+	// Only update previous for non-Unknown status (preserves last valid for grace period)
+	// Concurrency safety: check timestamp when updating to prevent stale writes from overwriting fresh data.
+	// Two concurrent syncs can read c.previous, process different CR versions, then both try to update.
+	// We only update if currentStatus is fresher than what's already in c.previous.
+	if currentStatus.OverallStatus != pacemaker.StatusUnknown {
+		c.previousMu.Lock()
+		// Only update if we have fresher data (or c.previous is nil/uninitialized)
+		if c.previous == nil || c.previous.CRLastUpdated.IsZero() || currentStatus.CRLastUpdated.After(c.previous.CRLastUpdated) {
+			c.previous = currentStatus
+		}


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don’t mark this CR timestamp as processed before side effects succeed.

Line 159 advances c.previous before updateOperatorStatus runs. If that update fails, the sync loop retries, but Lines 142-145 then skip the same LastUpdated, so the degraded condition is never retried until the CR changes again. Commit the processed/previous timestamp only after the status update path succeeds, or track “last valid” separately from “last successfully processed.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/lifecycle_health.go` around lines 139 - 160, The assignment to c.previous at line 159 happens before the updateOperatorStatus call completes, causing skipped retries on status update failures. Move the c.previous = currentStatus assignment to execute only after updateOperatorStatus succeeds, ensuring that the timestamp is only marked as processed when the side effect (status update) completes successfully. This prevents the sync loop from skipping the same CR timestamp due to stale data in c.previous when retrying after a failed updateOperatorStatus call.

coderabbitai · 2026-06-16T18:21:20Z

+	// Record PacemakerWarningsCleared when warnings are resolved.
+	// This fires whenever the previous status had warnings and the current one doesn't,
+	// regardless of overall cluster health. Warnings are independent of error state.
+	// Examples:
+	// - Warning → Healthy: warnings resolved, cluster fully healthy
+	// - Error+Warning → Error: warnings resolved, but cluster still degraded for other reasons
+	// - Error+Warning → Healthy: warnings resolved AND cluster recovered (fires both events)
+	if previousHadWarnings && currentHasNoWarnings {
+		c.eventRecorder.Eventf(pacemaker.EventReasonWarningsCleared, msgPacemakerWarningsCleared)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t emit “warnings cleared” while status is Unknown.

StatusUnknown carries no warnings, so a Warning → Unknown transition currently records PacemakerWarningsCleared even though the controller lost visibility rather than confirmed remediation. Gate this transition on a known current status.

✅ Proposed guard

- if previousHadWarnings && currentHasNoWarnings { + if current.OverallStatus != pacemaker.StatusUnknown && previousHadWarnings && currentHasNoWarnings { c.eventRecorder.Eventf(pacemaker.EventReasonWarningsCleared, msgPacemakerWarningsCleared) klog.Infof("Pacemaker cluster warnings cleared") }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/lifecycle_health.go` around lines 416 - 424, The current condition for recording PacemakerWarningsCleared only checks previousHadWarnings and currentHasNoWarnings, but StatusUnknown carries no warnings, causing the event to fire incorrectly during Warning → Unknown transitions when visibility is lost rather than warnings being resolved. Add an additional guard condition to the if statement to verify the current status is a known status (not Unknown) before calling c.eventRecorder.Eventf with EventReasonWarningsCleared. This ensures warnings cleared is only recorded when the status is actually known and confirmed to have no warnings.

coderabbitai · 2026-06-16T18:21:20Z

+	// Get K8s control plane nodes
+	k8sNodes, err := ceohelpers.ListNodesFromInformer(c.nodeInformer)
+	if err != nil {
+		return fmt.Errorf("failed to list control plane nodes: %w", err)
+	}
+
+	// Before transition: require exactly 2 ready control plane nodes for initial setup
+	// After transition: require all control plane nodes ready (handles single-node case)
+	if !transitionComplete {
+		// Pacemaker only supports 2 control plane nodes - initial setup requires exactly 2
+		if len(k8sNodes) > 2 {
+			klog.Warningf("Found %d control plane nodes - pacemaker only supports 2, skipping initial setup", len(k8sNodes))
+			return nil
+		}
+		if len(k8sNodes) < 2 {
+			klog.V(4).Infof("Waiting for 2 control plane nodes for initial setup (current: %d)", len(k8sNodes))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Control-plane gating is being applied to the full node set.

The startup logic assumes k8sNodes is control-plane-only, but the source helper returns all nodes. This can incorrectly skip startup (>2 nodes) or block startup on non-control-plane readiness.

Proposed fix

- // Get K8s control plane nodes - k8sNodes, err := ceohelpers.ListNodesFromInformer(c.nodeInformer) + // Get K8s nodes and filter to control-plane only + allNodes, err := ceohelpers.ListNodesFromInformer(c.nodeInformer) if err != nil { return fmt.Errorf("failed to list control plane nodes: %w", err) } + k8sNodes := make([]*corev1.Node, 0, len(allNodes)) + for _, node := range allNodes { + if _, ok := node.Labels["node-role.kubernetes.io/control-plane"]; ok { + k8sNodes = append(k8sNodes, node) + continue + } + if _, ok := node.Labels["node-role.kubernetes.io/master"]; ok { + k8sNodes = append(k8sNodes, node) + } + }

Also applies to: 115-121

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/lifecycle_job_controllers.go` around lines 75 - 90, The variable k8sNodes is populated from ceohelpers.ListNodesFromInformer which returns all nodes in the cluster, not just control-plane nodes. However, the subsequent validation logic (checking if length is greater than 2 or less than 2) assumes k8sNodes contains only control-plane nodes. Filter k8sNodes to extract only control-plane nodes before applying the control-plane specific gating checks that compare the count to 2. This affects the logic in the initial section near line 75-90 (anchor site in lifecycle_job_controllers.go) and also at lines 115-121 (sibling site in the same file where similar gating checks are applied).

coderabbitai · 2026-06-16T18:21:21Z

+// GetEventReasonForError determines the appropriate event reason based on error content.
 // Used for both deduplication (key) and event recording.
-func getEventReasonForError(errorMsg string) string {
+func GetEventReasonForError(errorMsg string) string {
 	for _, entry := range errorPatternsByPriority {
 		if strings.Contains(errorMsg, entry.pattern) {
 			return entry.reason
 		}
 	}
 	return EventReasonGenericError
 }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align event-reason matching with the errors now emitted.

GetEventReasonForError does not match current missing-CR text like PacemakerCluster CR not found in cache, and resource failures built as "<node> node is unhealthy: Kubelet has failed" are classified as node errors instead of resource errors. Add patterns or adjust the emitted messages so filtering by CR/resource event reasons keeps working.

🧭 Possible direction

// CR-related errors (check before more generic patterns) {"Failed to get PacemakerCluster CR", EventReasonCRNotFound}, + {"PacemakerCluster CR not found", EventReasonCRNotFound}, {"CR has no status", EventReasonCRNotFound},

Also consider making resource summaries include a stable resource marker or adding tests that feed actual BuildHealthStatusFromCR errors through GetEventReasonForError.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/pacemaker/healthcheck.go` around lines 354 - 363, The GetEventReasonForError function's pattern matching is out of sync with actual error messages being emitted by the health check logic. Update the errorPatternsByPriority slice to include patterns that match the actual error messages now being generated, such as "PacemakerCluster CR not found in cache" for missing-CR cases and patterns that correctly identify resource failures versus node failures (for example, messages containing "node is unhealthy" should map to node error reasons while messages about specific resources should map to resource error reasons). Consider also adding patterns with a stable resource marker in the error format to ensure filtering by resource event reasons continues to work correctly, and verify the patterns work by testing with actual error messages produced by BuildHealthStatusFromCR.

coderabbitai · 2026-06-16T18:21:21Z

+	if cfg.NodeName2 != "" && cfg.NodeIP2 != "" {
+		command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath)
+	} else {
+		command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath)
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate node-2 config as an atomic pair before falling back to single-node auth.

If only one of cfg.NodeName2 / cfg.NodeIP2 is populated, this path silently degrades to single-node auth instead of failing fast on inconsistent cluster config. That can mask upstream state corruption and run an incomplete auth flow.

Suggested fix

- if cfg.NodeName2 != "" && cfg.NodeIP2 != "" { + hasNode2Name := cfg.NodeName2 != "" + hasNode2IP := cfg.NodeIP2 != "" + if hasNode2Name != hasNode2IP { + return false, fmt.Errorf("incomplete node2 configuration: both NodeName2 and NodeIP2 must be set together") + } + if hasNode2Name { command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath) } else { command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if cfg.NodeName2 != "" && cfg.NodeIP2 != "" {

command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath)

} else {

command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath)

}

hasNode2Name := cfg.NodeName2 != ""

hasNode2IP := cfg.NodeIP2 != ""

if hasNode2Name != hasNode2IP {

return false, fmt.Errorf("incomplete node2 configuration: both NodeName2 and NodeIP2 must be set together")

}

if hasNode2Name {

command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, cfg.NodeName2, cfg.NodeIP2, TokenPath)

} else {

command = fmt.Sprintf("/usr/sbin/pcs host auth %s addr=%s --token %s --debug", cfg.NodeName1, cfg.NodeIP1, TokenPath)

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/pcs/auth.go` around lines 49 - 53, The conditional check for two-node cluster authentication only proceeds with two-node auth when both cfg.NodeName2 and cfg.NodeIP2 are present, but silently degrades to single-node auth if only one is set, which masks incomplete configuration. Add validation to ensure cfg.NodeName2 and cfg.NodeIP2 are treated as an atomic pair: if either one is non-empty, both must be non-empty, otherwise return an error indicating inconsistent cluster configuration. Only fall back to single-node auth when both are empty, and proceed with two-node auth only when both are populated.

coderabbitai · 2026-06-16T18:21:21Z

+- Reconciliation logic: tested in nodehandler_test.go (single source of truth)
+- Helper functions (decodeStringList): implicitly tested through business logic


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update stale coverage note that still references nodehandler_test.go.

That file was removed in this refactor, so this comment now points readers to a non-existent source of truth.

Suggested fix

-- Reconciliation logic: tested in nodehandler_test.go (single source of truth) +- Reconciliation logic: tested in lifecycle reconciliation/update-setup test suites

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner_test.go` around lines 21 - 22, Update the coverage note on lines 21-22 of runner_test.go that references the now-deleted nodehandler_test.go file. Remove or correct the stale reference to point readers to the appropriate current source of truth for reconciliation logic testing, or update the comment to accurately reflect where this testing now occurs in the refactored codebase.

coderabbitai · 2026-06-16T18:21:21Z

+	for _, node := range capturedNodes {
+		ip, err := tools.GetNodeIPForPacemaker(node)
+		if err != nil {
+			klog.Warningf("Skipping node %s from desired state (no valid IP): %v", node.Name, err)
+			continue
+		}
+		desiredNodes[node.Name] = ip


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when a desired node has no usable IP instead of silently dropping it.

Skipping a captured node here shrinks desiredNodes, which can incorrectly drive remove actions for still-valid pacemaker members. This path should return an error and abort reconciliation when a desired-state node cannot be represented.

Suggested fix

desiredNodes := make(map[string]string) // nodeName -> IP for _, node := range capturedNodes { ip, err := tools.GetNodeIPForPacemaker(node) if err != nil { - klog.Warningf("Skipping node %s from desired state (no valid IP): %v", node.Name, err) - continue + return fmt.Errorf("failed to build desired state for node %q: %w", node.Name, err) } desiredNodes[node.Name] = ip }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner.go` around lines 173 - 179, The code currently skips nodes from the desiredNodes map when GetNodeIPForPacemaker fails, which can cause incorrect pacemaker member removals. Instead of continuing with the warning log in the loop over capturedNodes, return an error from the containing function to abort reconciliation when a desired-state node cannot be represented with a valid IP. This ensures the reconciliation fails fast rather than silently dropping the node from desiredNodes.

coderabbitai · 2026-06-16T18:21:21Z

+				value := strings.TrimPrefix(parts[1], "\"")
+				value = strings.SplitN(value, "\"", 2)[0] // Get text before closing quote
+				value = strings.Fields(value)[0]          // Get first word (handles unquoted case)
+				return value, nil


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard against empty node_ip_map before indexing parsed fields.

strings.Fields(value)[0] can panic on empty values (e.g., node_ip_map=""). This should return a normal error instead of crashing the runner.

Suggested fix

value := strings.TrimPrefix(parts[1], "\"") value = strings.SplitN(value, "\"", 2)[0] // Get text before closing quote - value = strings.Fields(value)[0] // Get first word (handles unquoted case) - return value, nil + fields := strings.Fields(value) + if len(fields) == 0 { + return "", fmt.Errorf("node_ip_map parameter is empty") + } + return fields[0], nil }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner.go` around lines 333 - 336, The line accessing strings.Fields(value)[0] can panic if the value is empty or contains only whitespace, since strings.Fields will return an empty slice. Add a guard check after the strings.Fields call to verify the resulting slice is not empty before attempting to access index [0]. If the slice is empty, return an appropriate error instead of allowing the panic. This ensures the runner handles malformed node_ip_map values gracefully rather than crashing.

Implements the foundational framework for multi-node job scheduling with round-robin retry logic and drift detection: - Multi-node job support with NodeTarget struct for node-specific jobs - Round-robin retry logic with state tracking (AttemptNumber, NodeIndex, LastJobConfig) - Job config drift detection (resets retry state when config changes) - Degradation when retry attempts exhausted - IsControllerRunning() helper to prevent duplicate controllers - Shared node utility functions (GetNodeNames, DetermineReconciliationActions, IsNodeReady) - Job type labels for filtering and cleanup See docs/tnf/job-controllers.md for detailed multi-node retry behavior and config drift examples. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implements job controller startup with dual-mode behavior (bootstrap vs runtime) and individual job implementations: - StartJobControllers() with bootstrap vs runtime separation - Bootstrap mode: exponential backoff retry, requires exactly 2 ready nodes, sets degraded condition - Runtime mode: idempotent restart support, handles 1-2 nodes, fast path when setup job complete - startTnfJobcontrollers(): waits for etcd bootstrap, stable revision, creates per-node and cluster-wide controllers - Auth job implementation: sets hacluster password (required before Pacemaker operations) - After-setup job implementation: * Pre-transition: waits for setup job completion * Post-transition: waits for update-setup job containing this node * Disables kubelet service (Pacemaker owns it now) * Prevents kubelet disable race during Day 2 node addition - Update-setup controller check: on operator restart, starts controller for stopped update-setup job See docs/tnf/job-controllers.md for dual-mode operation and bootstrap sequence. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implements drift detection and reconciliation for Pacemaker cluster membership: - ReconcilePacemakerConfig(): unified entry point for drift reconciliation - Detects membership drift: compares K8s nodes vs Pacemaker nodes - Detects IP changes: compares node IPs to detect node replacements - ConfigMap generation tracking: prevents re-running same update-setup job - Target node selection: intersection of K8s and Pacemaker nodes (update-setup must run on active Pacemaker node) - ConfigMap reuse: when desired state unchanged, reuses existing ConfigMap generation - Transition protection: only runs after HasExternalEtcdCompletedTransition + setup job complete - Update-setup implementation: adds/removes nodes, configures fencing, updates etcd resource - Setup job implementation: waits for auth jobs before Pacemaker operations Event handlers: - Node Add/Ready: both call ReconcilePacemakerConfig (fixes race) - Node Update: detects IP changes - Node Delete: triggers reconciliation See docs/tnf/reconciliation.md for drift detection scenarios and reconciliation workflow. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implements the PacemakerLifecycleManager controller for continuous monitoring and maintenance: - PacemakerLifecycleManager: main controller with 30s sync loop - Event handlers: Node Add/Update/Delete, PacemakerCluster Update - Health monitoring: * Tracks PacemakerCluster CR staleness (grace period based on last transition time) * Detects stale status from status collector failures * Records events for health transitions (Healthy→Warning, Warning→Error, etc.) * Prevents duplicate event spam with deduplication - Orphaned job cleanup: * Detects jobs for deleted nodes (node UID no longer exists) * Deletes jobs when node permanently removed * Logs cleanup actions - Status collector CronJob: * Creates/updates CronJob for collecting Pacemaker status * Runs every 30s on control plane nodes * Updates PacemakerCluster CR with health status - Replaces nodehandler.go with lifecycle-based approach See docs/tnf/controller-lifecycle.md for sync loop, event handling, and health monitoring details. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Adds comprehensive architecture documentation and makes alert configuration non-fatal: Documentation: - docs/tnf/README.md: Quick start guide, component overview, workflow summary - docs/tnf/architecture-overview.md: Detailed architecture, dual-mode operation, job types table Alert configuration fix: - setup/update-setup jobs now log warnings instead of failing when alert configuration fails - Allows Pacemaker setup to succeed even if alerts can't be configured - Provides clear troubleshooting guidance in warning messages See docs/tnf/architecture-overview.md for canonical Job Types table and system architecture. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (1)

pkg/tnf/pkg/tools/nodes.go (1)
80-100: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Sort reconciliation actions before returning.

desiredNodes and currentNodes are maps, so nodesToRemove / nodesToAdd ordering can change between runs. Sort both slices before returning to keep reconciliation command order and tests deterministic.
Proposed fix
 func DetermineReconciliationActions(desiredNodes, currentNodes map[string]string) (nodesToRemove, nodesToAdd []string) {
@@
-	return nodesToRemove, nodesToAdd
+	sort.Strings(nodesToRemove)
+	sort.Strings(nodesToAdd)
+	return nodesToRemove, nodesToAdd
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/tnf/pkg/tools/nodes.go` around lines 80 - 100, Sort the reconciliation
results before returning from the node diff logic in the function that builds
nodesToRemove and nodesToAdd, because iterating over currentNodes and
desiredNodes maps produces nondeterministic ordering. Update the return path in
this nodes.go reconciliation helper to sort both slices after they are populated
so the command order and tests stay deterministic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tnf/architecture-overview.md`:
- Around line 315-318: Update the Node-specific jobs section to match the
UID-based tracking model: replace the node-name labeling language with UID-based
references, and make the cleanup description consistent with job-controller
behavior. Use the same terminology as the job-controller docs and related
node-job handling so the architecture overview no longer implies node-scoped
cleanup by name.

In `@docs/tnf/job-controllers.md`:
- Around line 88-96: The json.Marshal example in fencingJobConfigFunc has a
return type mismatch because it directly returns []byte, error from json.Marshal
while the function expects string, error. Update the example to marshal config
into a byte slice, handle the error explicitly, and return the marshaled payload
as a string with nil error on success; use fencingJobConfigFunc and json.Marshal
as the key symbols to locate the snippet.

In `@docs/tnf/reconciliation.md`:
- Around line 280-284: The job flow in the reconciliation diagram still shows
fencing inside the update setup path, but fencing is now handled by the separate
fencing job. Remove this “Update Fencing” block from the update-setup flow and
place it under the fencing job section in the docs/reconciliation content so the
flow matches the runner behavior.
- Around line 140-148: The drift example in the reconciliation docs uses a
3-node Kubernetes cluster, but the current reconciliation flow only supports up
to two nodes. Update the example in the scenario section to use a supported
one-to-two-node case, or explicitly label the 3-node case as future work; keep
the guidance aligned with the reconciliation behavior described by the TNF
scenario text.

In `@manifests/0000_20_etcd-operator_06_deployment.yaml`:
- Line 59: The default deployment for the operator still hardcodes the verbose
logging flag, which should not be shipped cluster-wide. Remove the -v=4 entry
from the deployment manifest and, if needed, make verbose logging available only
through a temporary debug override or an explicit conditional flag in the
deployment configuration.

In `@pkg/tnf/after-setup/runner.go`:
- Around line 144-146: The ConfigMap lookup in after-setup is using the wrong
label selector, which can miss the update-setup ConfigMap and cause polling
timeouts. Update the selector in the ConfigMaps list call within runner.go to
match the same update-setup label used by the other readers, and keep it
consistent with the update-setup producer’s component label rather than the job
name label.
- Around line 195-208: The update-setup wait logic in runner.go is matching any
completed Job with the label, which can let an older generation satisfy the
check too early. Update the completion check in the after-setup loop to verify
the Job belongs to the selected generation by comparing its generation/config
hash against the chosen ConfigMap before returning true. Keep the change
localized around the updateSetupJobs list and the jobs.IsConditionTrue
completion path so only the intended generation can unblock.

In `@pkg/tnf/fencing/runner.go`:
- Around line 113-114: The info-level log in runner.go is exposing internal
hostnames by printing the full nodeNames list; update the logging around
ConfigureFencing to log only a count or summary at info level, and move the full
nodeNames details to a debug/redacted path using the existing fencing runner
context so sensitive hostnames are not emitted in normal logs.
- Around line 154-177: The waitForUpdateSetupCompletion helper currently treats
a missing or failed update-setup Job as success, which lets fencing continue too
early. Update the logic in waitForUpdateSetupCompletion so that only
jobs.IsComplete returns done=true, while apierrors.IsNotFound and jobs.IsFailed
should both return false with a non-nil error (or equivalent retry-triggering
failure) so reconciliation keeps waiting or retries instead of proceeding. Keep
the changes localized to updateSetupDone and preserve the existing logging
around kubeClient.BatchV1().Jobs(...).Get and jobs.IsComplete/jobs.IsFailed.

In `@pkg/tnf/operator/lifecycle_helpers.go`:
- Around line 49-55: Missing node addresses should make the Pacemaker CR look
unavailable, not produce a partial membership map. In the helper that builds the
node map from pacemakerCR.Status.Nodes, stop skipping entries with empty
Addresses and instead return an error (preferably a typed “CR
incomplete/unavailable” error) so ReconcilePacemakerConfig() and
getActivePacemakerNodes() do not treat incomplete status as authoritative state.
Keep the map-building logic in this helper focused on only fully populated node
entries, and let callers handle the error path rather than acting on a partial
snapshot.

In `@pkg/tnf/operator/lifecycle_reconciliation.go`:
- Around line 713-717: The desired-state snapshot in the node loop is being
truncated when tools.GetNodeIPForPacemaker fails, because the code only logs a
warning and continues. Update the reconciliation path in
lifecycle_reconciliation.go so this failure is returned from the surrounding
reconciliation function instead of skipping the node, using the existing node
iteration logic around tools.GetNodeIPForPacemaker and the ConfigMap assembly to
abort and retry on error. Keep the node list complete for the snapshot rather
than writing a partial membership.

In `@pkg/tnf/pkg/jobs/tnf.go`:
- Around line 133-135: The retry logic in the TNF job flow is treating
target-node callback order as semantic, which can cause state resets or
incorrect node indexing when the same nodes arrive in a different order.
Canonicalize the target node list by sorting a copy by node name before calling
GetNodeNames and before using targetNodes[nodeIndex]. Apply the same ordering
normalization anywhere the retry state is compared or reconstructed so the logic
in the TNF job path remains stable across retries.
- Around line 153-155: The config-drift log in the job reset path is exposing
raw serialized job config via the klog.Infof call in the configChanged branch.
Update the logging in the job handling flow around the
JobConfigFunc/state.LastJobConfig comparison so it only reports that drift was
detected, or emits a non-sensitive fingerprint/digest instead of Old/New config
values. Keep the reset/delete behavior unchanged, but remove any direct logging
of serialized config contents.

In `@pkg/tnf/update-setup/runner.go`:
- Around line 370-382: The removePacemakerNodeGlobally flow is incorrectly
treating a failed pacemaker removal as success when
pcsClusterNodeRemoveOutputContains401 matches, allowing later etcd cleanup and
node_ip_map mutation to proceed unsafely. Update removePacemakerNodeGlobally to
return a retryable or explicit partial-failure error instead of nil in the 401
path, and have the caller that performs etcd member cleanup and node_ip_map
updates stop before mutating state unless pacemaker removal truly succeeded.
- Around line 510-539: decodeNodeList currently trusts ConfigMap-provided name
and ip fields, which later flow into shell command strings and node_ip_map
usage. Update decodeNodeList to validate each nodeInfo before constructing
corev1.Node: reject empty or non-allow-listed names and reject invalid IP values
at decode time. Keep the validation localized in decodeNodeList so callers like
the cluster add/remove flow never receive unsafe values.

---

Nitpick comments:
In `@pkg/tnf/pkg/tools/nodes.go`:
- Around line 80-100: Sort the reconciliation results before returning from the
node diff logic in the function that builds nodesToRemove and nodesToAdd,
because iterating over currentNodes and desiredNodes maps produces
nondeterministic ordering. Update the return path in this nodes.go
reconciliation helper to sort both slices after they are populated so the
command order and tests stay deterministic.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 71eb8ee1-fd94-45cd-a1fe-81ca12dc8c3a

📥 Commits

Reviewing files that changed from the base of the PR and between 9722e18 and 04fcb8d.

📒 Files selected for processing (45)

bindata/tnfdeployment/cronjob.yaml
bindata/tnfdeployment/role.yaml
docs/tnf/README.md
docs/tnf/architecture-overview.md
docs/tnf/controller-lifecycle.md
docs/tnf/job-controllers.md
docs/tnf/reconciliation.md
manifests/0000_20_etcd-operator_06_deployment.yaml
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/operator/ceohelpers/node_helpers.go
pkg/operator/ceohelpers/node_helpers_test.go
pkg/tnf/after-setup/runner.go
pkg/tnf/auth/runner.go
pkg/tnf/fencing/runner.go
pkg/tnf/operator/lifecycle_cleanup.go
pkg/tnf/operator/lifecycle_health.go
pkg/tnf/operator/lifecycle_helpers.go
pkg/tnf/operator/lifecycle_job_controllers.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/operator/lifecycle_manager_test.go
pkg/tnf/operator/lifecycle_reconciliation.go
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/operator/lifecycle_status_collector.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/starter.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/jobs/jobcontroller.go
pkg/tnf/pkg/jobs/tnf.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/pkg/jobs/utils.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go
pkg/tnf/pkg/pacemaker/statuscollector.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/pkg/tools/jobs.go
pkg/tnf/pkg/tools/nodes.go
pkg/tnf/setup/runner.go
pkg/tnf/update-setup/runner.go
pkg/tnf/update-setup/runner_test.go

💤 Files with no reviewable changes (3)

pkg/tnf/operator/nodehandler_test.go
pkg/tnf/operator/nodehandler.go
pkg/tnf/pkg/pacemaker/healthcheck_test.go

✅ Files skipped from review due to trivial changes (2)

bindata/tnfdeployment/cronjob.yaml
docs/tnf/README.md

🚧 Files skipped from review as they are similar to previous changes (17)

bindata/tnfdeployment/role.yaml
pkg/tnf/operator/lifecycle_reconciliation_test.go
pkg/tnf/auth/runner.go
pkg/tnf/update-setup/runner_test.go
pkg/tnf/pkg/pacemaker/client_helpers.go
pkg/operator/ceohelpers/common.go
pkg/operator/ceohelpers/common_test.go
pkg/tnf/pkg/pacemaker/statuscollector_test.go
pkg/tnf/operator/lifecycle_job_controllers_test.go
pkg/tnf/operator/starter_test.go
pkg/tnf/pkg/pcs/auth.go
pkg/tnf/operator/lifecycle_manager.go
pkg/tnf/pkg/jobs/tnf_test.go
pkg/tnf/operator/starter.go
pkg/tnf/pkg/pacemaker/test_helpers.go
pkg/tnf/pkg/pacemaker/healthcheck.go
pkg/tnf/operator/lifecycle_manager_test.go

coderabbitai · 2026-06-26T21:11:25Z

+**Node-specific jobs** (auth, after-setup):
+- Labeled with `node: <node-name>`
+- Tied to node lifecycle via UID
+- Cleaned up when node is deleted


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Align this section with the UID-based job tracking change.

This still says node-scoped jobs are labeled by node name, but the PR objective and the later job-controller doc describe UID-based tracking/cleanup. That inconsistency will mislead anyone debugging orphan cleanup behavior.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf/architecture-overview.md` around lines 315 - 318, Update the Node-specific jobs section to match the UID-based tracking model: replace the node-name labeling language with UID-based references, and make the cleanup description consistent with job-controller behavior. Use the same terminology as the job-controller docs and related node-job handling so the architecture overview no longer implies node-scoped cleanup by name.

coderabbitai · 2026-06-26T21:11:25Z

+```go
+// jobConfigFunc captures node UIDs and fencing secret UIDs to detect infrastructure changes
+fencingJobConfigFunc := func() (string, error) {
+    config := map[string]interface{}{
+        "nodeUIDs": []string{node1.UID, node2.UID},           // Detect node replacements
+        "secretUIDs": []string{secret1.UID, secret2.UID},     // Detect secret rotation
+    }
+    return json.Marshal(config)
+}


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

What is the Go standard library signature of encoding/json.Marshal, and what is the idiomatic way to return its result from a function with signature func() (string, error)?

💡 Result:

The standard library signature for encoding/json.Marshal is: func Marshal(v any) ([]byte, error) [1][2] The idiomatic way to return its result from a function with signature func (string, error) is to perform the marshalling, check the error immediately, and then convert the resulting byte slice to a string if successful [3][4]. Here is the idiomatic implementation pattern: func GetJSONString (string, error) { bytes, err:= json.Marshal(someData) if err!= nil { return "", err } return string(bytes), nil } In Go, it is standard practice to separate the declaration of the assignment (bytes, err:=...) and the error check (if err!= nil) [3]. This approach is clear, readable, and consistent with how errors are handled throughout the Go ecosystem [3][4].

Citations:

1: https://pkg.go.dev/encoding/json@go1.26.4

2: https://tip.golang.org/pkg/json

3: https://stackoverflow.com/questions/46024818/idiomatic-go-happy-path

4: https://jsonic.io/guides/json-marshal-unmarshal-go

Fix the json.Marshal example. json.Marshal returns ([]byte, error), so return json.Marshal(config) does not match func() (string, error). Marshal into a byte slice, check the error, then return string(payload), nil.

Suggested fix

payload, err := json.Marshal(config) if err != nil { return "", err } return string(payload), nil

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```go

// jobConfigFunc captures node UIDs and fencing secret UIDs to detect infrastructure changes

fencingJobConfigFunc := func() (string, error) {

config := map[string]interface{}{

"nodeUIDs": []string{node1.UID, node2.UID}, // Detect node replacements

"secretUIDs": []string{secret1.UID, secret2.UID}, // Detect secret rotation

}

return json.Marshal(config)

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf/job-controllers.md` around lines 88 - 96, The json.Marshal example in fencingJobConfigFunc has a return type mismatch because it directly returns []byte, error from json.Marshal while the function expects string, error. Update the example to marshal config into a byte slice, handle the error explicitly, and return the marshaled payload as a string with nil error on success; use fencingJobConfigFunc and json.Marshal as the key symbols to locate the snippet.

coderabbitai · 2026-06-26T21:11:26Z

+**Scenario 1: Node Added to Kubernetes**
+
+```text
+Kubernetes:  [master-0, master-1, master-2]
+Pacemaker:   [master-0, master-1]
+
+Drift: YES - master-2 exists in K8s but not Pacemaker
+Action: Add master-2 to Pacemaker cluster
+```


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Keep the drift example within supported TNF node counts.

This scenario shows adding master-2, but the implementation rejects clusters with more than two nodes. Use a one-to-two-node example or call out that 3-node reconciliation is future work.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf/reconciliation.md` around lines 140 - 148, The drift example in the reconciliation docs uses a 3-node Kubernetes cluster, but the current reconciliation flow only supports up to two nodes. Update the example in the scenario section to use a supported one-to-two-node case, or explicitly label the 3-node case as future work; keep the guidance aligned with the reconciliation behavior described by the TNF scenario text.

coderabbitai · 2026-06-26T21:11:26Z

+        ┌────────────────────────────────┐
+        │ Update Fencing:                │
+        │ - Update STONITH configuration │
+        │ - Add/remove fence agents      │
+        └────────────────────────────────┘


📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Update the job flow to show fencing as a separate job.

The runner now explicitly says fencing is handled by the fencing job, not inline in update-setup. This block should be removed or moved under the fencing job section.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tnf/reconciliation.md` around lines 280 - 284, The job flow in the reconciliation diagram still shows fencing inside the update setup path, but fencing is now handled by the separate fencing job. Remove this “Update Fencing” block from the update-setup flow and place it under the fencing job section in the docs/reconciliation content so the flow matches the runner behavior.

coderabbitai · 2026-06-26T21:11:26Z

        - "--terminate-on-files=/var/run/secrets/etcd-client/tls.key"
        - "--terminate-on-files=/var/run/configmaps/etcd-ca/ca-bundle.crt"
        - "--terminate-on-files=/var/run/configmaps/etcd-service-ca/service-ca.crt"
+        - "-v=4"


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid shipping -v=4 in the default deployment.

This enables verbose operator logging cluster-wide, and the new TNF paths emit control-plane node and reconciliation details at V(4). That widens internal-hostname exposure in aggregated logs and increases log volume for every cluster. Prefer a temporary debug override or a conditional debug flag instead.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@manifests/0000_20_etcd-operator_06_deployment.yaml` at line 59, The default deployment for the operator still hardcodes the verbose logging flag, which should not be shipped cluster-wide. Remove the -v=4 entry from the deployment manifest and, if needed, make verbose logging available only through a temporary debug override or an explicit conditional flag in the deployment configuration.

coderabbitai · 2026-06-26T21:11:26Z

+		ip, err := tools.GetNodeIPForPacemaker(*node)
+		if err != nil {
+			klog.Warningf("Failed to get IP for node %s: %v - skipping from ConfigMap", node.Name, err)
+			continue
+		}


🗄️ Data Integrity & Integration | 🔴 Critical | ⚡ Quick win

Don’t silently drop nodes from the desired-state snapshot.

If tools.GetNodeIPForPacemaker fails here, the node is omitted from the ConfigMap instead of aborting reconciliation. update-setup then sees a smaller desired membership and can reconfigure pacemaker down to one node — exactly the split-brain path this PR is trying to prevent during reboot windows. Bubble the error up and retry instead of encoding a partial node list.

Suggested fix

for _, node := range nodes { ip, err := tools.GetNodeIPForPacemaker(*node) if err != nil { - klog.Warningf("Failed to get IP for node %s: %v - skipping from ConfigMap", node.Name, err) - continue + return "", fmt.Errorf("failed to get IP for node %s: %w", node.Name, err) } infos = append(infos, nodeInfo{ Name: node.Name, IP: ip,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ip, err := tools.GetNodeIPForPacemaker(*node)

if err != nil {

klog.Warningf("Failed to get IP for node %s: %v - skipping from ConfigMap", node.Name, err)

continue

}

ip, err := tools.GetNodeIPForPacemaker(*node)

if err != nil {

return "", fmt.Errorf("failed to get IP for node %s: %w", node.Name, err)

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/operator/lifecycle_reconciliation.go` around lines 713 - 717, The desired-state snapshot in the node loop is being truncated when tools.GetNodeIPForPacemaker fails, because the code only logs a warning and continues. Update the reconciliation path in lifecycle_reconciliation.go so this failure is returned from the surrounding reconciliation function instead of skipping the node, using the existing node iteration logic around tools.GetNodeIPForPacemaker and the ConfigMap assembly to abort and retry on error. Keep the node list complete for the snapshot rather than writing a partial membership.

coderabbitai · 2026-06-26T21:11:26Z

+	currentTargetNodes := tools.GetNodeNames(targetNodes)
+	nodesChanged := !slicesEqual(state.TargetNodes, currentTargetNodes)
+


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Canonicalize target-node order before comparing and indexing.

The retry state treats callback order as semantic. If targetNodesFunc returns the same nodes in a different order, this can reset/delete an in-flight job or schedule NodeIndex against a different node. Sort a copy of target nodes by name before GetNodeNames and before selecting targetNodes[nodeIndex].

Also applies to: 286-305, 310-320

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/jobs/tnf.go` around lines 133 - 135, The retry logic in the TNF job flow is treating target-node callback order as semantic, which can cause state resets or incorrect node indexing when the same nodes arrive in a different order. Canonicalize the target node list by sorting a copy by node name before calling GetNodeNames and before using targetNodes[nodeIndex]. Apply the same ordering normalization anywhere the retry state is compared or reconstructed so the logic in the TNF job path remains stable across retries.

coderabbitai · 2026-06-26T21:11:26Z

+		if configChanged {
+			klog.Infof("Job %s config changed - resetting retry state and deleting job. Old: %s, New: %s",
+				jobName, state.LastJobConfig, currentConfig)


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Avoid logging raw serialized job config.

JobConfigFunc is a generic serialized config surface; logging old/new values can expose node identifiers, secret identifiers, or future sensitive fields. Log only that drift was detected, or log a non-sensitive fingerprint.

Proposed fix

- klog.Infof("Job %s config changed - resetting retry state and deleting job. Old: %s, New: %s", - jobName, state.LastJobConfig, currentConfig) + klog.Infof("Job %s config changed - resetting retry state and deleting job", jobName)

As per coding guidelines, flag logging that may expose passwords, tokens, API keys, PII, session IDs, internal hostnames, or customer data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if configChanged {

klog.Infof("Job %s config changed - resetting retry state and deleting job. Old: %s, New: %s",

jobName, state.LastJobConfig, currentConfig)

if configChanged {

klog.Infof("Job %s config changed - resetting retry state and deleting job", jobName)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/jobs/tnf.go` around lines 153 - 155, The config-drift log in the job reset path is exposing raw serialized job config via the klog.Infof call in the configChanged branch. Update the logging in the job handling flow around the JobConfigFunc/state.LastJobConfig comparison so it only reports that drift was detected, or emits a non-sensitive fingerprint/digest instead of Old/New config values. Keep the reset/delete behavior unchanged, but remove any direct logging of serialized config contents.

Source: Coding guidelines

coderabbitai · 2026-06-26T21:11:26Z

+func removePacemakerNodeGlobally(ctx context.Context, nodeToRemove string) error {
+	cmd := fmt.Sprintf("/usr/sbin/pcs cluster node remove %s --force --skip-offline", nodeToRemove)
+	stdOut, stdErr, err := exec.Execute(ctx, cmd)
+	if err != nil {
+		if pcsClusterNodeRemoveOutputContains401(stdOut, stdErr) {
+			klog.Warningf(
+				"pcs cluster node remove for %q did not succeed and output contained 401: "+
+					"the cluster may still see a live endpoint for that member that rejects this node's PCSD credentials — "+
+					"commonly a replacement node is answering at the same IP the cluster associates with the old member. "+
+					"Skipping fatal error; reconcile sync/start and fencing may still help. stdout=%q stderr=%q err=%v",
+				nodeToRemove, stdOut, stdErr, err,
+			)
+			return nil


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Do not treat a failed pacemaker removal as success.

Returning nil on 401 lets the caller continue with etcd member cleanup and node_ip_map updates even though the pacemaker member may still exist. Return a retryable error or an explicit partial-success signal before mutating etcd state.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner.go` around lines 370 - 382, The removePacemakerNodeGlobally flow is incorrectly treating a failed pacemaker removal as success when pcsClusterNodeRemoveOutputContains401 matches, allowing later etcd cleanup and node_ip_map mutation to proceed unsafely. Update removePacemakerNodeGlobally to return a retryable or explicit partial-failure error instead of nil in the 401 path, and have the caller that performs etcd member cleanup and node_ip_map updates stop before mutating state unless pacemaker removal truly succeeded.

coderabbitai · 2026-06-26T21:11:26Z

+func decodeNodeList(data string) ([]*corev1.Node, error) {
+	type nodeInfo struct {
+		Name string `json:"name"`
+		IP   string `json:"ip"`
+	}
+
+	var infos []nodeInfo
+	if err := json.Unmarshal([]byte(data), &infos); err != nil {
+		return nil, err
+	}
+
+	nodes := make([]*corev1.Node, len(infos))
+	for i, info := range infos {
+		// Create minimal node with name and IP address
+		nodes[i] = &corev1.Node{
+			ObjectMeta: metav1.ObjectMeta{
+				Name: info.Name,
+			},
+			Status: corev1.NodeStatus{
+				Addresses: []corev1.NodeAddress{
+					{
+						Type:    corev1.NodeInternalIP,
+						Address: info.IP,
+					},
+				},
+			},
+		}
+	}
+
+	return nodes, nil


🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Validate ConfigMap node names and IPs before shell use.

decodeNodeList accepts arbitrary name / ip values, and those values later feed bash -c command strings such as pcs cluster node add/remove and node_ip_map. Reject empty or non-allow-listed node names and invalid IPs at decode time. As per path instructions, “Command: no shell=True ... with user input” and “Validate at trust boundaries with allow-lists, not deny-lists.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/update-setup/runner.go` around lines 510 - 539, decodeNodeList currently trusts ConfigMap-provided name and ip fields, which later flow into shell command strings and node_ip_map usage. Update decodeNodeList to validate each nodeInfo before constructing corev1.Node: reject empty or non-allow-listed names and reject invalid IP values at decode time. Keep the validation localized in decodeNodeList so callers like the cluster add/remove flow never receive unsafe values.

Source: Path instructions

openshift-ci Bot requested review from clobrano and slintes April 29, 2026 15:35

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 29, 2026

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread pkg/tnf/update-setup/runner_test.go Outdated

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from e9ed644 to 0533a44 Compare April 29, 2026 15:53

slintes reviewed Apr 29, 2026

View reviewed changes

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from 0533a44 to 3bf1712 Compare April 29, 2026 21:10

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread pkg/tnf/operator/nodehandler.go Outdated

Comment thread pkg/tnf/update-setup/runner_test.go

Comment thread pkg/tnf/update-setup/runner.go Outdated

Comment thread pkg/tnf/update-setup/runner.go Outdated

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 2 times, most recently from 78f3141 to fad9968 Compare April 29, 2026 21:33

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread pkg/tnf/update-setup/runner.go Outdated

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from fad9968 to 03a2a57 Compare April 29, 2026 21:37

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread pkg/tnf/update-setup/runner.go Outdated

fonta-rh reviewed Apr 30, 2026

View reviewed changes

jaypoulz changed the title ~~OCPBUGS-84695: fix TNF pacemaker split-brain by checking node existence before replacement~~ OCPBUGS-84695: update node-to-pacemaker reconciliation logic to prevent pacemaker split-brain Apr 30, 2026

jaypoulz marked this pull request as draft April 30, 2026 16:23

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 2 times, most recently from f3779b1 to 880d8ca Compare June 11, 2026 20:09

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread pkg/tnf/operator/lifecycle_manager_test.go Outdated

Comment thread pkg/tnf/operator/lifecycle_manager_test.go

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 2 times, most recently from 888e2f1 to ff4e65d Compare June 11, 2026 21:49

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread pkg/tnf/operator/lifecycle_job_controllers_test.go

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 3 times, most recently from bf8e8df to 3f964fb Compare June 12, 2026 20:36

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread pkg/tnf/operator/lifecycle_health.go

Comment thread pkg/tnf/update-setup/runner.go Outdated

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch 3 times, most recently from 47dec5c to 0a1dee8 Compare June 15, 2026 16:03

coderabbitai Bot reviewed Jun 16, 2026

View reviewed changes

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from 9722e18 to 12cff1f Compare June 25, 2026 20:03

jaypoulz and others added 6 commits June 25, 2026 16:05

Capturing WIP review updates

04fcb8d

jaypoulz force-pushed the OCPBUGS-84695-tnf-update-setup-node-check branch from 12cff1f to 04fcb8d Compare June 26, 2026 20:53

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

		- Reconciliation logic: tested in nodehandler_test.go (single source of truth)
		- Helper functions (decodeStringList): implicitly tested through business logic

		currentTargetNodes := tools.GetNodeNames(targetNodes)
		nodesChanged := !slicesEqual(state.TargetNodes, currentTargetNodes)

Uh oh!

Conversation

jaypoulz commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OCPBUGS-84695: Prevent pacemaker split-brain during upgrades

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 29, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

openshift-ci-robot commented Apr 29, 2026

Summary by CodeRabbit

Uh oh!

jaypoulz commented Apr 29, 2026

Uh oh!

openshift-ci-robot commented Apr 29, 2026

Uh oh!

jaypoulz commented Apr 29, 2026

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jaypoulz commented Apr 29, 2026

Uh oh!

openshift-ci-robot commented Apr 29, 2026

Uh oh!

slintes Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

jaypoulz Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

jaypoulz Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

slintes Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jaypoulz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jaypoulz Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fonta-rh left a comment

Choose a reason for hiding this comment

Review: OCPBUGS-84695 — Fix TNF Pacemaker Split-Brain

1. Dead code: removeEtcdMemberByIP() + unused getNodeIP parameter

2. Missing tests for intermediate setup states

3. removeEtcdMembersForNodes docstring vs behavior mismatch

Uh oh!

jaypoulz commented Jun 11, 2026

Uh oh!

jaypoulz commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 29, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

1. Dead code: `removeEtcdMemberByIP()` + unused `getNodeIP` parameter

3. `removeEtcdMembersForNodes` docstring vs behavior mismatch

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading