STOR-3011: Implement CSO upgrade check and Prometheus alerts for SELinuxMount readiness#714
STOR-3011: Implement CSO upgrade check and Prometheus alerts for SELinuxMount readiness#714rvagner78 wants to merge 9 commits into
Conversation
- Introduced a new alert `SELinuxMountGAReadinessWorkloadsDetected` to monitor workloads with SELinux volume conflicts, providing a summary and runbook URL for remediation. - Integrated the SELinuxMount readiness controller into both Standalone and HyperShift operator starters to enhance monitoring capabilities.
|
@rvagner78: This pull request references STOR-3011 which is a valid jira issue. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Caution Review failedAn error occurred during the review process. Please try again later. WalkthroughA new ChangesSELinux Mount Readiness
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rvagner78 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
pkg/operator/selinuxmountreadiness/controller.go (1)
114-151: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDeduplicate conflict parsing. The method
conflictsPresent(lines 122-128) and the exportedConflictsPresent(lines 146-150) implement identical key-lookup andConditionTruecomparison. Have the method delegate to the exported function so the parsing contract stays single-sourced and the tests incontroller_test.goactually cover the production path.♻️ Proposed refactor
func (c *Controller) conflictsPresent() (present bool, found bool, err error) { cm, err := c.configMapLister.Get(selinuxConflictsConfigMapName) if apierrors.IsNotFound(err) { return false, false, nil } if err != nil { return false, false, err } - value, ok := cm.Data[selinuxConflictsDataKey] - if !ok { - klog.V(2).Infof("ConfigMap %s/%s exists but is missing key %q, treating as no conflicts", - csoclients.CloudConfigNamespace, selinuxConflictsConfigMapName, selinuxConflictsDataKey) - return false, true, nil - } - return metav1.ConditionStatus(value) == metav1.ConditionTrue, true, nil + present, found = ConflictsPresent(cm) + if found && cm.Data[selinuxConflictsDataKey] == "" { + klog.V(2).Infof("ConfigMap %s/%s exists but is missing key %q, treating as no conflicts", + csoclients.CloudConfigNamespace, selinuxConflictsConfigMapName, selinuxConflictsDataKey) + } + return present, found, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/selinuxmountreadiness/controller.go` around lines 114 - 151, The conflict parsing logic is duplicated between conflictsPresent and ConflictsPresent, so update conflictsPresent to delegate to ConflictsPresent after retrieving the ConfigMap and handling NotFound/errors. Keep the ConfigMap lookup and missing-key logging in conflictsPresent, but centralize the key lookup and metav1.ConditionTrue check in ConflictsPresent so the parsing behavior stays single-sourced and controller_test.go covers the production path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@manifests/09_selinux-mount-readiness-configmap-reader-role.yaml`:
- Around line 16-25: The ConfigMap Role is too narrowly scoped for the
unfiltered informer used by selinux mount readiness. In
pkg/operator/selinuxmountreadiness/controller.go, the ConfigMap informer created
via InformersFor(...).Core().V1().ConfigMaps().Informer() LIST/WATCHes all
ConfigMaps, so update the RBAC in
manifests/09_selinux-mount-readiness-configmap-reader-role.yaml to keep get
restricted to selinux-conflicts but allow list/watch without resourceNames, or
alternatively change the informer to use a fieldSelector for
metadata.name=selinux-conflicts so the current scoping still works.
In `@manifests/12_prometheusrules.yaml`:
- Around line 95-106: The SELinuxMount readiness alert’s runbook link is
pointing to an enhancement document instead of a published operational runbook.
Update the SELinuxMountGAReadinessWorkloadsDetected alert annotation in the
Prometheus rules to use a valid openshift/runbooks URL (or an approved KCS
article) that SREs can act on, matching the pattern used by the other alerts in
this file.
In `@pkg/operator/selinuxmountreadiness/controller.go`:
- Around line 32-33: Replace the placeholder KCSArticleURL with the final
customer-facing runbook/KCS link before GA, since upgradeBlockedMessage() and
the alert runbook_url surface it directly to users. Update the constant in
controller.go to point to the published support article, and verify any
references in the selinuxmountreadiness controller and related alerting code
continue to use KCSArticleURL so the change is centralized.
---
Nitpick comments:
In `@pkg/operator/selinuxmountreadiness/controller.go`:
- Around line 114-151: The conflict parsing logic is duplicated between
conflictsPresent and ConflictsPresent, so update conflictsPresent to delegate to
ConflictsPresent after retrieving the ConfigMap and handling NotFound/errors.
Keep the ConfigMap lookup and missing-key logging in conflictsPresent, but
centralize the key lookup and metav1.ConditionTrue check in ConflictsPresent so
the parsing behavior stays single-sourced and controller_test.go covers the
production path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b0f5d8f0-da7a-4e5b-b5ef-2937e1faaa1c
📒 Files selected for processing (5)
manifests/09_selinux-mount-readiness-configmap-reader-role.yamlmanifests/12_prometheusrules.yamlpkg/operator/operator_starter.gopkg/operator/selinuxmountreadiness/controller.gopkg/operator/selinuxmountreadiness/controller_test.go
| - apiGroups: | ||
| - "" | ||
| resources: | ||
| - configmaps | ||
| resourceNames: | ||
| - selinux-conflicts | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Confirm the configmaps informer is unscoped (no fieldSelector) for the openshift-config namespace.
fd -t f csoclients.go pkg/csoclients | head
rg -nP 'InformersFor\(.*CloudConfigNamespace|NewFilteredSharedInformerFactory|FieldSelector|tweakListOptions' pkg/csoclients pkg/operator/selinuxmountreadiness -C2Repository: openshift/cluster-storage-operator
Length of output: 1352
🏁 Script executed:
#!/bin/bash
sed -n '1,120p' manifests/09_selinux-mount-readiness-configmap-reader-role.yaml
printf '\n---\n'
sed -n '1,120p' pkg/operator/selinuxmountreadiness/controller.goRepository: openshift/cluster-storage-operator
Length of output: 5864
🏁 Script executed:
#!/bin/bash
sed -n '1,120p' manifests/09_selinux-mount-readiness-configmap-reader-role.yaml
printf '\n---\n'
sed -n '1,120p' pkg/operator/selinuxmountreadiness/controller.goRepository: openshift/cluster-storage-operator
Length of output: 5864
resourceNames does not cover the unscoped ConfigMap informer
pkg/operator/selinuxmountreadiness/controller.go builds an informer with clients.KubeInformers.InformersFor(csoclients.CloudConfigNamespace).Core().V1().ConfigMaps().Informer(), which LIST/WATCHes all ConfigMaps in openshift-config. This Role only grants get/list/watch on resourceNames: ["selinux-conflicts"], so the informer requests will be forbidden under least-privilege.
Keep get scoped to selinux-conflicts, but either remove resourceNames from list/watch or switch the informer to a fieldSelector on metadata.name=selinux-conflicts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@manifests/09_selinux-mount-readiness-configmap-reader-role.yaml` around lines
16 - 25, The ConfigMap Role is too narrowly scoped for the unfiltered informer
used by selinux mount readiness. In
pkg/operator/selinuxmountreadiness/controller.go, the ConfigMap informer created
via InformersFor(...).Core().V1().ConfigMaps().Informer() LIST/WATCHes all
ConfigMaps, so update the RBAC in
manifests/09_selinux-mount-readiness-configmap-reader-role.yaml to keep get
restricted to selinux-conflicts but allow list/watch without resourceNames, or
alternatively change the informer to use a fieldSelector for
metadata.name=selinux-conflicts so the current scoping still works.
| - alert: SELinuxMountGAReadinessWorkloadsDetected | ||
| expr: sum(cluster:selinux_warning_controller_selinux_volume_conflict:count) > 0 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: "Workloads incompatible with SELinuxMount GA detected" | ||
| description: | | ||
| Pods with SELinux volume conflicts were detected. These workloads may fail to start | ||
| after upgrade when SELinuxMount becomes GA. Query metric | ||
| selinux_warning_controller_selinux_volume_conflict to list affected pods. | ||
| runbook_url: https://github.com/openshift/enhancements/blob/master/enhancements/storage/selinuxmount-ga-block-upgrade.md |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
runbook_url should point to a published runbook. The other alerts in this file reference github.com/openshift/runbooks/.... This new alert links to an enhancement doc, which won't pass the usual runbook-link expectations and isn't actionable for SREs. Align it with an openshift/runbooks entry (or a KCS article) before merge.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@manifests/12_prometheusrules.yaml` around lines 95 - 106, The SELinuxMount
readiness alert’s runbook link is pointing to an enhancement document instead of
a published operational runbook. Update the
SELinuxMountGAReadinessWorkloadsDetected alert annotation in the Prometheus
rules to use a valid openshift/runbooks URL (or an approved KCS article) that
SREs can act on, matching the pattern used by the other alerts in this file.
There was a problem hiding this comment.
coderabbit is right here, we should create a PR in openshift/runbooks. I created a new story https://redhat.atlassian.net/browse/STOR-3027
There was a problem hiding this comment.
@jsafrane, thanks for confirming — good call. The enhancement doc isn't actionable for SREs, so having a proper runbook entry is the right fix.
Once the runbook PR in openshift/runbooks is merged, the runbook_url here should be updated to point to it (following the pattern of the other alerts in this file, e.g. https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/<AlertName>.md). That can be done in a follow-up PR linked to STOR-3027.
Would you also like me to open a GitHub issue in this repository to track the runbook URL update alongside STOR-3027?
| // KCSArticleURL is linked from Prometheus alerts. Update when the KCS article is published. | ||
| KCSArticleURL = "https://github.com/openshift/enhancements/blob/master/enhancements/storage/selinuxmount-ga-block-upgrade.md" |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Placeholder runbook URL surfaced to users. KCSArticleURL points to an enhancement doc, not a customer runbook, yet it is embedded in upgradeBlockedMessage() (operator condition) and the alert runbook_url. Track replacing this before GA so customers are not directed to a developer enhancement doc.
Want me to open an issue to track publishing the KCS article and swapping in the final URL?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/operator/selinuxmountreadiness/controller.go` around lines 32 - 33,
Replace the placeholder KCSArticleURL with the final customer-facing runbook/KCS
link before GA, since upgradeBlockedMessage() and the alert runbook_url surface
it directly to users. Update the constant in controller.go to point to the
published support article, and verify any references in the
selinuxmountreadiness controller and related alerting code continue to use
KCSArticleURL so the change is centralized.
- Updated the expression for the `SELinuxMountGAReadinessWorkloadsDetected` alert to calculate the number of conflicting pods more accurately by using floor(sum/2). - Enhanced alert annotations to provide clearer information on the number of pods with SELinux conflicts and included a query metric for affected pods.
| @@ -0,0 +1,48 @@ | |||
| apiVersion: rbac.authorization.k8s.io/v1 | |||
There was a problem hiding this comment.
CSO already has cluster-admin in the standalone clusters from here.
IMO this role does not does nothing.
There was a problem hiding this comment.
yes, it does make much sense right now, however there is a TODO item "remove cluster-admin from the operator, then a specific UID won't be needed." in manifests/10_deployment.yaml, so i guess that is the reason why it was added, to indicate that operator needs just read CM selinux-conflicts. But it can be removed for now.
There was a problem hiding this comment.
Since we're going to remove the controller in 5.1, I think this RBAC will be useless much earlier than we finally implement fine-grained RBAC roles.
| resourceNames: | ||
| - selinux-conflicts |
There was a problem hiding this comment.
something tells me that it's not possible to limit list by resourceName and coderabbit is actually right
| - alert: SELinuxMountGAReadinessWorkloadsDetected | ||
| expr: sum(cluster:selinux_warning_controller_selinux_volume_conflict:count) > 0 | ||
| for: 10m | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: "Workloads incompatible with SELinuxMount GA detected" | ||
| description: | | ||
| Pods with SELinux volume conflicts were detected. These workloads may fail to start | ||
| after upgrade when SELinuxMount becomes GA. Query metric | ||
| selinux_warning_controller_selinux_volume_conflict to list affected pods. | ||
| runbook_url: https://github.com/openshift/enhancements/blob/master/enhancements/storage/selinuxmount-ga-block-upgrade.md |
There was a problem hiding this comment.
coderabbit is right here, we should create a PR in openshift/runbooks. I created a new story https://redhat.atlassian.net/browse/STOR-3027
|
|
||
| // SELinuxMountGAReadinessFeatureGate gates the upgrade readiness check until | ||
| // openshift/api exposes features.FeatureGateSELinuxMountGAReadiness. | ||
| SELinuxMountGAReadinessFeatureGate = configv1.FeatureGateName("SELinuxMountGAReadiness") |
There was a problem hiding this comment.
Bump openshift/api once openshift/api#2882 is merged and import the feature gate from there.
There was a problem hiding this comment.
It's merged now (finally!)
| return factory.New().WithSync(c.sync).WithSyncDegradedOnError(clients.OperatorClient).WithInformers( | ||
| clients.OperatorClient.Informer(), | ||
| clients.KubeInformers.InformersFor(csoclients.CloudConfigNamespace).Core().V1().ConfigMaps().Informer(), | ||
| clients.ConfigInformers.Config().V1().FeatureGates().Informer(), |
There was a problem hiding this comment.
The controller watches FeatureGate CR, but actually never reads it when it changes. It always uses the c.featureGates set at the controller creation.
What we usually do is that we read FeatureGates at the operator start and don't start a whole controller if the controller is not enabled.
And when FeatureGates change, we exit() and newly started operator fetches a fresh value. See an example and docs
There was a problem hiding this comment.
I can see CSO already has a featureGateAccessor. So just remove the informer here and it should be OK. The accessor will exit() when the gates change.
| // ConflictsPresent reports whether the selinux-conflicts ConfigMap indicates conflicts. | ||
| // It is exported for unit tests. | ||
| func ConflictsPresent(cm *corev1.ConfigMap) (present bool, found bool) { | ||
| if cm == nil { | ||
| return false, false | ||
| } | ||
| value, ok := cm.Data[selinuxConflictsDataKey] | ||
| if !ok { | ||
| return false, true | ||
| } | ||
| return metav1.ConditionStatus(value) == metav1.ConditionTrue, true | ||
| } |
There was a problem hiding this comment.
Is this function used anywhere?
There was a problem hiding this comment.
Fixed, renamed to conflictsPresentInConfigMap and used in conflictPresent() and also in unit tests.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/selinuxmountreadiness/controller.go (1)
132-148: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRedundant key re-check after
conflictsPresentInConfigMap.conflictsPresentInConfigMapalready encodes the missing-key case (returnsfound=true,present=false); the inlinecm.Data[key]lookup here only duplicates that logic for logging. Consider folding the log into the helper or dropping it. Low priority.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/selinuxmountreadiness/controller.go` around lines 132 - 148, `conflictsPresent` is re-checking the ConfigMap key after `conflictsPresentInConfigMap` has already determined the missing-key case, so the logic is duplicated. Update `conflictsPresent` to rely on the `present`/`found` result from `conflictsPresentInConfigMap` and remove the extra `cm.Data[selinuxConflictsDataKey]` lookup, or move the logging into `conflictsPresentInConfigMap` if you still want that diagnostic message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/selinuxmountreadiness/controller_test.go`:
- Around line 203-210: The test setup in the selinuxmountreadiness controller
should register cleanup for stopCh immediately after it is created so it always
runs, even if cache sync fails. Move the defer close(stopCh) in the test around
RunConfigMapInformer, StartInformers, and WaitForSync so it is set before any
t.Fatal path, ensuring informer goroutines are stopped reliably.
---
Nitpick comments:
In `@pkg/operator/selinuxmountreadiness/controller.go`:
- Around line 132-148: `conflictsPresent` is re-checking the ConfigMap key after
`conflictsPresentInConfigMap` has already determined the missing-key case, so
the logic is duplicated. Update `conflictsPresent` to rely on the
`present`/`found` result from `conflictsPresentInConfigMap` and remove the extra
`cm.Data[selinuxConflictsDataKey]` lookup, or move the logging into
`conflictsPresentInConfigMap` if you still want that diagnostic message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 49c39526-924b-445e-8d9d-98930e2de6c9
📒 Files selected for processing (4)
manifests/09_selinux-mount-readiness-configmap-reader-role.yamlpkg/operator/operator_starter.gopkg/operator/selinuxmountreadiness/controller.gopkg/operator/selinuxmountreadiness/controller_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/operator/operator_starter.go
- manifests/09_selinux-mount-readiness-configmap-reader-role.yaml
|
/test all |
| # Number of conflicts reported by the SELinux warning controller, i.e. pods that may conflict, if they land on the same node. | ||
| # The controller does not emit metrics with other properties than "SELinuxLabel" or "SELinuxChangePolicy". Still, let's filter them to be future proof and keep the cardinality low. | ||
| - expr: sum by(property) (selinux_warning_controller_selinux_volume_conflict{property =~ "SELinuxLabel|SELinuxChangePolicy"}) | ||
| record: cluster:selinux_warning_controller_selinux_volume_conflict:count |
There was a problem hiding this comment.
Please don't remove this rule, it's used to sum the failures and sent via telemetry.
There was a problem hiding this comment.
Ok, the recording rule for telemetry was restored.
|
/test all |
|
/test e2e-azure-file-csi |
|
@rvagner78: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@rvagner78: This PR was included in a payload test run from #715
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7b96e9d0-7156-11f1-9c62-ad103feffb0c-0 |
| return err | ||
| } | ||
|
|
||
| if selinuxmountreadiness.FeatureGateEnabled(ssr.featureGates) { |
There was a problem hiding this comment.
| if selinuxmountreadiness.FeatureGateEnabled(ssr.featureGates) { | |
| if ssr.featureGates.Enabled(the feature gate from openshift/api) { |
And selinuxmountreadiness.FeatureGateEnabled and its unit test becomes useless.
(i would expect both cursor and coderabbit to spot this)
| return err | ||
| } | ||
|
|
||
| if selinuxmountreadiness.FeatureGateEnabled(hsr.featureGates) { |
There was a problem hiding this comment.
same as above, call hsrr.featureGates.Enabled
Vendor openshift/api to pick up FeatureGateSELinuxMountGAReadiness from openshift/api#2882 and gate the SELinux mount readiness controller with featureGates.Enabled() directly, removing the local FeatureGateEnabled wrapper and its unit test. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@rvagner78: This PR was included in a payload test run from #715
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ecea6200-73be-11f1-80a5-d953eec5d313-0 |
|
@rvagner78: This PR was included in a payload test run from #715
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ff343440-7490-11f1-893c-a6a79a083dc4-0 |
|
@jsafrane: This PR was included in a payload test run from #715
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c7f7fb70-752a-11f1-8353-8747cb10a2fe-0 |
|
@rvagner78: This PR was included in a payload test run from #715
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5b0f6ec0-7544-11f1-86d8-78f0905bdb7b-0 |
Summary by CodeRabbit
selinux-conflictsConfigMap.SELinuxMountGAReadinessWorkloadsDetectedalert to warn when readiness failures are detected.