Skip to content

LOG-9424: Make legacy SCC role cleanup non-blocking#3313

Merged
openshift-merge-bot[bot] merged 4 commits into
openshift:release-6.5from
vparfonov:fix-rbac-cache-namespace
Jun 23, 2026
Merged

LOG-9424: Make legacy SCC role cleanup non-blocking#3313
openshift-merge-bot[bot] merged 4 commits into
openshift:release-6.5from
vparfonov:fix-rbac-cache-namespace

Conversation

@vparfonov

Copy link
Copy Markdown
Contributor

Summary

  • Fixes e2e test failures introduced by LOG-9424: Fix API flooding when MultiCLF shares the same serviceAccou… #3308 where the operator fails to reconcile CLFs in newly created namespaces
  • The k8sClient.List call for legacy role cleanup goes through the namespace-scoped cache, which doesn't know about namespaces the operator hasn't indexed yet, causing "unknown namespace for the cache" errors that block the entire reconciliation
  • Extracts cleanup into a best-effort helper that logs errors instead of returning them — the old role will be cleaned up on the next successful cycle

Root Cause

In #3308, ReconcileRBAC was extended with cleanup logic that lists RoleBindings in the namespace before deleting the old {sa}-scc Role. This List goes through the controller-runtime namespace-scoped cache. When a CLF is created in a test namespace (e.g., clo-test-XXXXX) that the cache hasn't indexed yet, the List fails with:

unable to list: clo-test-30319 because of unknown namespace for the cache

This error was returned from ReconcileRBAC, preventing the operator from creating collector DaemonSets/Deployments for the CLF. The operator retried with exponential backoff but never recovered because the cache was never updated.

Test plan

  • Verify auth unit tests pass (go test ./internal/auth/)
  • e2e tests that create CLFs in dynamic namespaces (deployment, security) should pass
  • Legacy role cleanup still works in openshift-logging namespace (cache is warm)

Depends on: #3308

🤖 Generated with Claude Code

jcantrill and others added 2 commits June 9, 2026 10:57
…nt (openshift#3295)

* fix(auth): Change SCC Role naming to prevent conflicts when CLFs share service accounts

When multiple ClusterLogForwarders in the same namespace share the same
spec.serviceAccount, they were creating a shared Role named {sa}-scc.
This caused both CLFs to continuously update the Role's ownerReferences,
alternating between pointing to CLF #1 and CLF #2, flooding the API
server with thousands of update requests.

Changed the Role naming from {sa}-scc to {clfName}-{sa}-scc, making
each CLF's Role unique. The RoleBinding now correctly references this
new Role name format. Both resources keep their owner references for
proper cleanup when a CLF is deleted.

Also removed owner reference from the metadata-reader ClusterRoleBinding
because cluster-scoped resources should not be owned by namespaced
resources, following Kubernetes best practices.

Resolves: LOG-9424

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* feat(reconcile): Add cleanup logic for old SCC Role resources

Add DeleteRole function to enable cleanup of Resources with the old
naming scheme ({sa}-scc) during reconciliation. This ensures smooth
migration from the previous implementation where multiple CLFs sharing
a service account would conflict.

The ReconcileRBAC function now deletes any old Role using the legacy
naming scheme after creating the new one, preventing orphaned resources
during upgrades.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* test(auth): Update tests for new SCC RBAC resource naming scheme

Update existing tests to reflect the new naming scheme where SCC Roles
are named {clfName}-{sa}-scc instead of {sa}-scc. This ensures each
CLF gets a unique Role when sharing a service account.

Add test to verify that metadata-reader ClusterRoleBinding does not
have owner references, ensuring compliance with Kubernetes best
practices (cluster-scoped resources should not be owned by namespaced
resources).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* check role references before deleting

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The cleanup of old {sa}-scc Role resources introduced in openshift#3308 uses
k8sClient.List which goes through the namespace-scoped cache. When a
CLF is created in a namespace the cache has not yet indexed, the List
fails with "unknown namespace for the cache", blocking the entire
reconciliation and preventing collector workloads from being created.

Extract cleanup into a best-effort helper that logs errors instead of
returning them. The old role will be cleaned up on the next successful
reconciliation cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 18, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 18, 2026

Copy link
Copy Markdown

@vparfonov: This pull request references LOG-9424 which is a valid jira issue.

Details

In response to this:

Summary

  • Fixes e2e test failures introduced by LOG-9424: Fix API flooding when MultiCLF shares the same serviceAccou… #3308 where the operator fails to reconcile CLFs in newly created namespaces
  • The k8sClient.List call for legacy role cleanup goes through the namespace-scoped cache, which doesn't know about namespaces the operator hasn't indexed yet, causing "unknown namespace for the cache" errors that block the entire reconciliation
  • Extracts cleanup into a best-effort helper that logs errors instead of returning them — the old role will be cleaned up on the next successful cycle

Root Cause

In #3308, ReconcileRBAC was extended with cleanup logic that lists RoleBindings in the namespace before deleting the old {sa}-scc Role. This List goes through the controller-runtime namespace-scoped cache. When a CLF is created in a test namespace (e.g., clo-test-XXXXX) that the cache hasn't indexed yet, the List fails with:

unable to list: clo-test-30319 because of unknown namespace for the cache

This error was returned from ReconcileRBAC, preventing the operator from creating collector DaemonSets/Deployments for the CLF. The operator retried with exponential backoff but never recovered because the cache was never updated.

Test plan

  • Verify auth unit tests pass (go test ./internal/auth/)
  • e2e tests that create CLFs in dynamic namespaces (deployment, security) should pass
  • Legacy role cleanup still works in openshift-logging namespace (cache is warm)

Depends on: #3308

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 834a6f00-5c9b-451c-bc9a-6a396415492a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@vparfonov vparfonov marked this pull request as draft June 18, 2026 14:41
@openshift-ci openshift-ci Bot requested review from Clee2691 and alanconway June 18, 2026 14:41
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2026
@vparfonov

Copy link
Copy Markdown
Contributor Author

/test e2e-target

@jcantrill

Copy link
Copy Markdown
Contributor

/approve

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jcantrill, vparfonov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2026
@vparfonov

Copy link
Copy Markdown
Contributor Author

/test e2e-target

The AtLeastOnce delivery mode uses disk-backed buffers which can delay
initial log delivery. The 3-minute poll timeout was too tight for slow
CI nodes, causing flaky failures. Align with DefaultWaitForLogsTimeout
(5 minutes) used by other e2e tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vparfonov

Copy link
Copy Markdown
Contributor Author

/test e2e-target

@vparfonov vparfonov marked this pull request as ready for review June 19, 2026 06:22
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2026
@openshift-ci openshift-ci Bot requested a review from cahartma June 19, 2026 06:23
@jcantrill

Copy link
Copy Markdown
Contributor

/retest

1 similar comment
@jcantrill

Copy link
Copy Markdown
Contributor

/retest

@jcantrill

Copy link
Copy Markdown
Contributor

/override "Spell check"

@jcantrill

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@jcantrill: /override requires failed status contexts, check run or a prowjob name to operate on.
The following unknown contexts/checkruns were given:

  • Spell check

Only the following failed contexts/checkruns were expected:

  • CodeRabbit
  • Spell checking
  • ci/prow/ci-index-cluster-logging-operator-bundle
  • ci/prow/e2e-target
  • ci/prow/functional-target
  • ci/prow/images
  • ci/prow/lint
  • ci/prow/unit
  • pull-ci-openshift-cluster-logging-operator-master-ci-index-cluster-logging-operator-bundle
  • pull-ci-openshift-cluster-logging-operator-master-e2e-target
  • pull-ci-openshift-cluster-logging-operator-master-functional-target
  • pull-ci-openshift-cluster-logging-operator-master-images
  • pull-ci-openshift-cluster-logging-operator-master-lint
  • pull-ci-openshift-cluster-logging-operator-master-unit
  • tide

If you are trying to override a checkrun that has a space in it, you must put a double quote on the context.

Details

In response to this:

/override "Spell check"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2026
@vparfonov

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@jcantrill

Copy link
Copy Markdown
Contributor

/retest

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2026
@vparfonov

Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-ci

openshift-ci Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

@vparfonov: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jcantrill

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 23, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 2db215f into openshift:release-6.5 Jun 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. release/6.5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants