OCPEDGE-2705: Add TNF Pacemaker metrics controller#1634
Conversation
Implement a metrics controller that projects PacemakerCluster CR conditions into Prometheus gauges for the TNF monitoring pipeline. The controller shares the existing pacemakerInformer with HealthCheck, registers 20 metric families (~53 series on a 2-node cluster), and uses table-driven sync logic with GaugeVec.Reset() to prevent stale data from removed nodes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@fonta-rh: This pull request references OCPEDGE-2705 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughA Pacemaker metrics controller is added to collect cluster, node, and resource Prometheus metrics from ChangesPacemaker Metrics Reporting
Estimated code review effort: 3 (Moderate) | ~25 minutes 🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/tnf/pkg/metriccontroller/controller_test.go (1)
21-68: 💤 Low valueConsider extracting fakeInformer to a shared test utility.
The comment on line 22 notes this is copied from
pacemaker/healthcheck_test.go. To reduce duplication, consider extracting this to a shared test helper package (e.g.,pkg/tnf/pkg/testutil) so both test files can reuse the same implementation.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/tnf/pkg/metriccontroller/controller_test.go` around lines 21 - 68, The fakeInformer struct and createFakeInformer function are duplicated between controller_test.go and pacemaker/healthcheck_test.go as noted in the comment. Extract both the createFakeInformer function and the fakeInformer type definition (along with all its method implementations) to a shared test utility package such as pkg/tnf/pkg/testutil. Update both test files to import and use the shared implementation instead of maintaining duplicate code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/operator/starter.go`:
- Around line 320-328: The type assertion on legacyregistry.DefaultGatherer to
metrics.KubeRegistry is unsafe and will cause a panic if the assertion fails at
runtime. Replace the bare type assertion with the comma-ok idiom by capturing
both the converted value and a boolean success indicator, then check the boolean
before proceeding. If the assertion fails, handle the error appropriately (e.g.,
log an error and return or exit) before creating the metricsController with
NewPacemakerMetricsController.
---
Nitpick comments:
In `@pkg/tnf/pkg/metriccontroller/controller_test.go`:
- Around line 21-68: The fakeInformer struct and createFakeInformer function are
duplicated between controller_test.go and pacemaker/healthcheck_test.go as noted
in the comment. Extract both the createFakeInformer function and the
fakeInformer type definition (along with all its method implementations) to a
shared test utility package such as pkg/tnf/pkg/testutil. Update both test files
to import and use the shared implementation instead of maintaining duplicate
code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6cfb780f-0e07-4d69-909e-d7f8d653442a
📒 Files selected for processing (4)
pkg/tnf/operator/starter.gopkg/tnf/pkg/metriccontroller/controller.gopkg/tnf/pkg/metriccontroller/controller_test.gopkg/tnf/pkg/metriccontroller/metrics.go
Both metriccontroller and pacemaker tests had identical 47-line fakeInformer implementations. Move to pkg/tnf/internal/testutil/ so future interface changes only need one update. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/tnf/internal/testutil/fakeinformer.go`:
- Around line 19-21: The `store.Add(obj)` call in the function containing the if
statement is suppressing the error with the blank identifier (_), which can hide
test setup failures. Modify the function signature to return an error type, and
instead of discarding the error, return it if `store.Add(obj)` fails. Then
update all five call sites of this function across controller_test.go and
healthcheck_test.go to handle the returned error appropriately by checking for
nil and failing the test if an error is returned.
In `@pkg/tnf/pkg/pacemaker/healthcheck_test.go`:
- Around line 53-59: The test setup in createTestHealthCheckWithMockStatus has
an implicit coupling between how the fake informer indexes objects (by
PacemakerCluster.Name) and how getPacemakerStatus fetches them (by
PacemakerClusterResourceName constant). Make this contract explicit by setting
mockStatus.Name to match the expected lookup key before passing it to
testutil.CreateFakeInformer(mockStatus), ensuring the mock object can be found
when getPacemakerStatus attempts to retrieve it by its fixed resource name.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9c1e7295-27cc-4c14-9264-76fe10a05aa9
📒 Files selected for processing (3)
pkg/tnf/internal/testutil/fakeinformer.gopkg/tnf/pkg/metriccontroller/controller_test.gopkg/tnf/pkg/pacemaker/healthcheck_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/tnf/pkg/metriccontroller/controller_test.go
| {pacmkrv1.ResourceSchedulableConditionType, resourceSchedulableGauge}, | ||
| } | ||
|
|
||
| func conditionToFloat64(cond *metav1.Condition) float64 { |
There was a problem hiding this comment.
One thing to note is that this choice eliminates the distinction between status unknown (nil) and status unhealthy (0).
That said, I'm pretty sure this choice was made at the API level as well. The only time I'm not sure this should apply is during initial setup, before the pacemakercluster CR is populated.
There was a problem hiding this comment.
The CR is always created with conditions populated — StatusCollector writes Status: *status in the initial create (statuscollector.go:825), then immediately PUTs the status subresource. There's no window where the CR exists without conditions.
Before the CR exists, the informer store is empty, so sync() hits the "not found" path and resets all gauges to 0. That's the intended behavior — during initial setup, nothing is healthy yet, and 0 is the safe default for alerting (== 0 fires without needing absent()).
There was a problem hiding this comment.
Excellent. Glad we've thought this through.
| klog.V(4).Infof("syncing Pacemaker metrics") | ||
|
|
||
| item, exists, err := c.pacemakerInformer.GetStore().GetByKey(pacemaker.PacemakerClusterResourceName) | ||
| if err != nil { |
There was a problem hiding this comment.
should we log this error?
There was a problem hiding this comment.
The factory controller framework (library-go) already logs returned errors and retries with exponential backoff — adding a log here would duplicate the message on each retry attempt.
There was a problem hiding this comment.
Awesome - good that the error is logged. :)
| metricsController := metriccontroller.NewPacemakerMetricsController( | ||
| pacemakerInformer, | ||
| controllerContext.EventRecorder, | ||
| legacyregistry.DefaultGatherer.(metrics.KubeRegistry), |
There was a problem hiding this comment.
Is default gatherer guaranteed to be a KubeRegistry? if not, it might be worth check if it is first, so that we can catch the error and log it.
There was a problem hiding this comment.
This matches the existing pattern at pkg/operator/starter.go:440 (non-TNF starter). DefaultGatherer is initialized by legacyregistry as a KubeRegistry — it can't be a different type at runtime.
Happy to add a guard, but this is the only file we touch outside of our owned scope (pkg/tnf/), so it's the one I want to be most consistent with the existing codebase in.
There was a problem hiding this comment.
Agreed. I'd follow the existing pattern and minimize the diff, like you said.
|
|
||
| var ( | ||
| clusterHealthyGauge = metrics.NewGauge(&metrics.GaugeOpts{ | ||
| Name: "tnf_cluster_healthy", |
There was a problem hiding this comment.
This is chai-bot, but it claims:
Using Namespace: "tnf" + Subsystem: "cluster" in GaugeOpts is the k8s convention and would produce the same wire name while making the code more structured
I'm not one way or another but I feel like this would be low effort for slightly cleaner structure.
There was a problem hiding this comment.
Didn't know about this convention — good point. I was going with the existing CEO pattern from their metric (signerExpirationGauge). All the fields in GaugeOpts are concatenated, so the final wire name is still tnf_cluster_healthy, etc. — no breaking change.
Adopted it: Namespace: "tnf" with Subsystem per tier (cluster, node, resource).
| # TYPE tnf_resource_started gauge | ||
| tnf_resource_started{node="master-0",resource="Etcd"} 0 | ||
| tnf_resource_started{node="master-0",resource="Kubelet"} 1 | ||
| tnf_resource_started{node="master-1",resource="Etcd"} 1 |
There was a problem hiding this comment.
etcd cannot be active while kubelet is dead :) probably not relevant but something I noticed
There was a problem hiding this comment.
Yeah, not a realistic combination — but the test is just exercising the per-resource mapping logic, not simulating a real failure. The fixture sets one resource unhealthy per node to verify conditions don't bleed across resources.
jaypoulz
left a comment
There was a problem hiding this comment.
Overall this is a great extension to the pacemaker healthcheck family of controllers. :)
As a standalone patch that gathers up these stats, I think this is quite reasonable. My only concern is - what happens if a node was populating information and then suddenly stops. PacemakerCluster would drop it entirely. Maybe an improvement here would be to initiate the node list from the k8s node list, and then populate what's available from the CR. This way my lifecycle manager work plays nicely as it adds and removes nodes.
| registry := setupAndSync(t, buildCluster(withNodes("node-alpha", "node-beta"))) | ||
|
|
||
| require.NoError(t, testutil.GatherAndCompare(registry, strings.NewReader(` | ||
| # HELP tnf_node_online [ALPHA] Whether a TNF node is online (1=online, 0=offline) |
There was a problem hiding this comment.
I'm not crazy about the string fixture approach. Feels like there's maybe a more programmatic way of synthesizing these, or we throw them in a fixtures file to make this more readable.
There was a problem hiding this comment.
These are the standard k8s testutil.GatherAndCompare assertions — they require Prometheus text exposition format, so there's no programmatic builder to use here. Each string is only 3-6 lines and is tightly coupled to the buildCluster() options in the same test, so I think keeping them inline makes each test easier to read as a self-contained unit. Moving them to fixture files would separate the "what we set up" from the "what we expect" without reducing the overall line count.
Review feedback: structure metric definitions with Namespace: "tnf" and Subsystem per tier (cluster, node, resource). Wire names unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@fonta-rh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
pacemakerInformerwith HealthCheck — zero additional API server loadGaugeVec.Reset()to prevent stale data from removed nodesDetails
New package
pkg/tnf/pkg/metriccontroller/with three files:metrics.goconditionToFloat64helpercontroller.gometrics.KubeRegistryinjection, sync logiccontroller_test.goWiring in
pkg/tnf/operator/starter.go— controller created after HealthCheck, sharingpacemakerInformer.Key design decisions
EtcdCertSignerControllerpattern)== 0vsabsent())StabilityLevel: metrics.ALPHA— explicit on all metricsMetrics exposed
GaugeGaugeVecnodeGaugeVecnode,resourceTest plan
go build ./...— all binaries compilego test ./pkg/tnf/pkg/metriccontroller/ -v -count=5— 55/55 pass, no registration pollutiongo test ./pkg/tnf/... -count=1— 454 existing TNF tests pass (no regressions)make build— cleanmake test-unit— all tests pass with-race:8443/metricsand in PrometheusE2E validation (11 scenarios on live cluster)
17/20 metrics dynamically validated via reversible
pcsoperations (maintenance mode, unmanage, disable, standby, fence disable, cluster stop, force-stop, double-ban, virsh destroy, kubelet kill). 3 remaining are not dynamically testable:node_ready(too-transient, 1-3s vs 60s poll),node_clean(API blind spot — quorum loss blocks CR writes),cluster_node_count_as_expected(structurally unreachable — configured membership fixed at 2). All code paths covered by unit tests.Related
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
PacemakerCluster, including cluster condition gauges, per-node online status, and per-resource state.Tests