LOG-9424: Make legacy SCC role cleanup non-blocking#3313
Conversation
…nt (openshift#3295) * fix(auth): Change SCC Role naming to prevent conflicts when CLFs share service accounts When multiple ClusterLogForwarders in the same namespace share the same spec.serviceAccount, they were creating a shared Role named {sa}-scc. This caused both CLFs to continuously update the Role's ownerReferences, alternating between pointing to CLF #1 and CLF #2, flooding the API server with thousands of update requests. Changed the Role naming from {sa}-scc to {clfName}-{sa}-scc, making each CLF's Role unique. The RoleBinding now correctly references this new Role name format. Both resources keep their owner references for proper cleanup when a CLF is deleted. Also removed owner reference from the metadata-reader ClusterRoleBinding because cluster-scoped resources should not be owned by namespaced resources, following Kubernetes best practices. Resolves: LOG-9424 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * feat(reconcile): Add cleanup logic for old SCC Role resources Add DeleteRole function to enable cleanup of Resources with the old naming scheme ({sa}-scc) during reconciliation. This ensures smooth migration from the previous implementation where multiple CLFs sharing a service account would conflict. The ReconcileRBAC function now deletes any old Role using the legacy naming scheme after creating the new one, preventing orphaned resources during upgrades. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * test(auth): Update tests for new SCC RBAC resource naming scheme Update existing tests to reflect the new naming scheme where SCC Roles are named {clfName}-{sa}-scc instead of {sa}-scc. This ensures each CLF gets a unique Role when sharing a service account. Add test to verify that metadata-reader ClusterRoleBinding does not have owner references, ensuring compliance with Kubernetes best practices (cluster-scoped resources should not be owned by namespaced resources). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * check role references before deleting --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The cleanup of old {sa}-scc Role resources introduced in openshift#3308 uses
k8sClient.List which goes through the namespace-scoped cache. When a
CLF is created in a namespace the cache has not yet indexed, the List
fails with "unknown namespace for the cache", blocking the entire
reconciliation and preventing collector workloads from being created.
Extract cleanup into a best-effort helper that logs errors instead of
returning them. The old role will be cleaned up on the next successful
reconciliation cycle.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@vparfonov: This pull request references LOG-9424 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test e2e-target |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jcantrill, vparfonov The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/test e2e-target |
The AtLeastOnce delivery mode uses disk-backed buffers which can delay initial log delivery. The 3-minute poll timeout was too tight for slow CI nodes, causing flaky failures. Align with DefaultWaitForLogsTimeout (5 minutes) used by other e2e tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/test e2e-target |
|
/retest |
1 similar comment
|
/retest |
|
/override "Spell check" |
|
/lgtm |
|
@jcantrill: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/retest |
1 similar comment
|
/retest |
|
/retest-required |
|
@vparfonov: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
2db215f
into
openshift:release-6.5
Summary
k8sClient.Listcall for legacy role cleanup goes through the namespace-scoped cache, which doesn't know about namespaces the operator hasn't indexed yet, causing"unknown namespace for the cache"errors that block the entire reconciliationRoot Cause
In #3308,
ReconcileRBACwas extended with cleanup logic that lists RoleBindings in the namespace before deleting the old{sa}-sccRole. This List goes through the controller-runtime namespace-scoped cache. When a CLF is created in a test namespace (e.g.,clo-test-XXXXX) that the cache hasn't indexed yet, the List fails with:This error was returned from
ReconcileRBAC, preventing the operator from creating collector DaemonSets/Deployments for the CLF. The operator retried with exponential backoff but never recovered because the cache was never updated.Test plan
go test ./internal/auth/)openshift-loggingnamespace (cache is warm)Depends on: #3308
🤖 Generated with Claude Code