OCPBUGS-84534: fix concurrent map race in project authorization cache by sanchezl · Pull Request #642 · openshift/openshift-apiserver

sanchezl · 2026-05-07T14:26:56Z

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Dual-path copy-on-write with atomic store swap:

During full cache invalidation (every 15s or on RBAC changes), new stores are built privately by the writer goroutine — mutations happen in place with zero copy overhead. During incremental updates between invalidations, stores are shared with concurrent List() readers — addSubjectsToNamespace and deleteNamespaceFromSubjects use copy-on-write to create new subjectRecord objects with copied namespaces sets, so readers iterate immutable snapshots. All three stores are grouped behind atomic.Pointer[authorizationCacheStores] so the full-rebuild swap is a single atomic operation — readers always see a consistent view.

No-op guards skip the COW copy entirely when the namespace is already present/absent from the subject's set — this eliminates unnecessary allocations when subjects are re-processed with unchanged access (common in incremental updates, especially with duplicate subjects from upstream).

This avoids the lock contention that caused the previous mutex fix (PR #267) to be reverted (PR #326), while also avoiding the O(n²) allocation overhead of unconditional copy-on-write that was identified during code review.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

addSubjectsToNamespace(): item.namespaces.Insert(namespace)
deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process with a fatal panic. Additionally, three store pointers (userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) were swapped non-atomically during full cache rebuilds — a concurrent List() could read stores from different generations, returning silently wrong results.

Fix Strategy

Two races and one performance issue are fixed:

Map race → dual-path COW: When stores are shared with readers (incremental updates), create new subjectRecord with a copied sets.String instead of mutating in place. When stores are private to the writer (full invalidation rebuild), mutate in place for performance. The copyOnWrite bool parameter threads through the call chain to select the appropriate path.
Non-atomic store swap → atomic.Pointer: Group all three stores in authorizationCacheStores behind atomic.Pointer. Full rebuilds populate private stores, then swap atomically. List() snapshots the pointer once at entry.
No-op guards: In the COW path, check Has(namespace) / !Has(namespace) before copying. When a subject already has (or already lacks) access to a namespace, skip the copy entirely. This is critical because an upstream kube bug in AllowedSubjects() returns duplicate subjects (line 124 returns subjects instead of dedupedSubjects), causing each duplicate to trigger a redundant COW copy.

Alternatives Considered

Three approaches were prototyped and benchmarked (see comment below for full comparison):

Approach	1000ns / 100u	Allocs	Branch
Dual-path COW (this PR)	~9ms / 15MB	16K	`bugfix/project-auth-cache-race`
sync.Map	~12ms / 12MB	27K	`bugfix/project-auth-cache-race-syncmap`
Fine-grained RWMutex	~12ms / 11MB	27K	`bugfix/project-auth-cache-race-rwmutex`

All three are viable. Dual-path COW was chosen because it is the fastest, fully lock-free on the read path, and carries no risk of reintroducing the lock contention that caused the PR #326 revert.

Fix History

PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access
PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)
PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained
PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned
This PR: Dual-path COW + atomic pointer + no-op guards — List() never blocks, no O(n²) regression

Related Issues

OCPBUGS-65926 — Same crash on 4.14.26, incorrectly closed as duplicate of OCPBUGS-56594 (a different bug in kube-apiserver audit log serialization)
OCPBUGS-58029 — Clone of 65926
OCPBUGS-56594 — Different bug (kube-apiserver audit race), fixed by Fix API server crash on concurrent map iteration and write kubernetes/kubernetes#129472
Upstream kube bug — subject_locator.go:124 returns subjects instead of dedupedSubjects, causing duplicate subjects to flow into addSubjectsToNamespace. Mitigated by no-op guards in this PR; upstream fix tracked separately.

QA Validation

Test 1: Race condition is fixed

Deploy a build with the fix
Run the cluster under load with concurrent project list requests
Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters (the PR #267 revert scenario)

Provision a cluster with 2000+ namespaces and substantial RBAC
Measure oc get projects latency before and after — should not regress
Key property: List() should never block waiting on the sync goroutine
Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

Grant/revoke a user's access to a namespace
Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Summary by CodeRabbit

Refactor
- Redesigned the authorization cache to provide consistent snapshots for readers, use atomic swaps for store replacements, and adopt copy-on-write updates so incremental syncs safely support concurrent reads.
Tests
- Added a concurrency race test exercising concurrent readers/writers against the cache.
- Added benchmarks for full cache invalidation, incremental copy-on-write behavior, and subject-update performance across namespace counts.

openshift-ci-robot · 2026-05-07T14:27:06Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache.List() that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Commit 1 — Copy-on-write subjectRecords: addSubjectsToNamespace and deleteNamespaceFromSubjects now create new subjectRecord objects with copied namespaces sets instead of mutating the underlying map in place. Any concurrent List() holding the old record iterates an immutable snapshot. cache.Store is internally thread-safe, so the replacement is safe without locking — List() never blocks.

Commit 2 — Atomic store pointer swap: During full cache invalidation, three store pointers were swapped non-atomically. A concurrent List() could read stores from different points in time. Wraps all three stores in a struct behind atomic.Pointer so they swap as a single unit.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

addSubjectsToNamespace(): item.namespaces.Insert(namespace)

deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process.

Fix History

PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access

PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)

PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained

PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This fix avoids locks entirely via copy-on-write. List() never blocks, regardless of how long synchronize() takes.

Related Issues

OCPBUGS-65926 — Same crash on 4.14.26, incorrectly closed as duplicate of OCPBUGS-56594 (a different bug in kube-apiserver audit log serialization)

OCPBUGS-58029 — Clone of 65926

OCPBUGS-56594 — Different bug (kube-apiserver audit race), fixed by Fix API server crash on concurrent map iteration and write kubernetes/kubernetes#129472

QA Validation

Test 1: Race condition is fixed

Deploy a build with the fix

Run the cluster under load with concurrent project list requests

Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters

Provision a cluster with 2000+ namespaces and substantial RBAC

Measure oc get projects latency before and after — should not regress

Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

Grant/revoke a user's access to a namespace

Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-05-07T14:27:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 84048ca1-3f76-445c-a06b-6c7174fe7ef6

📥 Commits

Reviewing files that changed from the base of the PR and between 7d02e79 and 2de9d35.

📒 Files selected for processing (2)

pkg/project/auth/cache.go
pkg/project/auth/cache_test.go

Walkthrough

AuthorizationCache now stores per-type caches in a single atomically-swapped authorizationCacheStores pointer, uses copy-on-write for subject updates during incremental syncs, atomically replaces stores on full invalidation, snapshots stores for consistent List reads, and adds a concurrency race test plus benchmarks.

Changes

Cache Atomicity & Copy-on-Write Sync

Layer / File(s)	Summary
Grouped stores + struct change `pkg/project/auth/cache.go` (`authorizationCacheStores`, `AuthorizationCache` struct)	Replaced three separate `cache.Store` fields with one `stores atomic.Pointer[authorizationCacheStores]` field and moved the store instances into the `authorizationCacheStores` aggregate.
Constructor wiring `pkg/project/auth/cache.go` (`NewAuthorizationCache`)	Constructor builds the three underlying stores, packs them into an `authorizationCacheStores` instance, and stores the pointer via `ac.stores.Store(...)`.
synchronize control flow and snapshotting `pkg/project/auth/cache.go` (`synchronize`, `syncHandler`, related sync functions)	`synchronize()` loads a consistent snapshot from `ac.stores.Load()`; incremental syncs run with `copyOnWrite = true`; full invalidation constructs fresh stores and atomically swaps the pointer at the end. The `copyOnWrite` flag is threaded through the sync pipeline.
Namespace/subject mutation: copy-on-write `pkg/project/auth/cache.go` (mutation helpers: `deleteNamespaceFromSubjects`, `addSubjectsToNamespace`, subjectRecord handling)	Subject mutation helpers now clone `subjectRecord` values and their `sets.String` namespace sets when `copyOnWrite` is true instead of mutating/deleting namespace entries in-place.
Read path snapshotting `pkg/project/auth/cache.go` (`List`)	`List()` snapshots `ac.stores.Load()` once per call and reads from that snapshot for a consistent view during the call.
Sync flow updates and purge handling `pkg/project/auth/cache.go` (namespace, policy, role-binding, purge sync functions)	Namespace, policy, role-binding, and purge synchronization functions accept and propagate the `copyOnWrite` boolean into subject mutation helpers.
Tests & Benchmarks `pkg/project/auth/cache_test.go` (imports, `TestAuthorizationCacheRace`, `BenchmarkFullCacheInvalidation`, `BenchmarkAddSubjectsToNamespace`, `fakeVersioner`)	Added `TestAuthorizationCacheRace` (concurrent writer/readers exercising `synchronize()` and `List()`), `BenchmarkFullCacheInvalidation`, `BenchmarkIncrementalSyncDuplicateSubjects`, and `BenchmarkAddSubjectsToNamespace`; added `sync` import and supporting test helpers.

sequenceDiagram
    participant Sync as Synchronizer
    participant IDX as Namespace/RBAC Indexers
    participant Stores as authorizationCacheStores (snapshot)
    participant ac as AuthorizationCache (stores pointer)
    participant Reader as Reader (List)

    rect rgba(100,149,237,0.5)
    Sync->>Stores: Load snapshot via ac.stores.Load()
    end

    rect rgba(34,139,34,0.5)
    Sync->>IDX: Read namespaces/policies/rolebindings
    IDX-->>Sync: Items
    Sync->>Stores: Clone subjectRecords (copy-on-write) and update copies
    Sync->>Stores: Or build fresh stores for full invalidation
    Sync->>ac: Atomic swap of new authorizationCacheStores pointer
    end

    rect rgba(255,165,0,0.5)
    Reader->>ac: Load snapshot once via ac.stores.Load()
    Reader->>Stores: Read user/group subject records from snapshot
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately identifies the critical issue being fixed: a concurrent map race in the project authorization cache that causes openshift-apiserver crashes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	No Ginkgo tests in modified code; PR contains only standard Go tests (TestXXX/BenchmarkXXX functions) with stable, deterministic names.
Test Structure And Quality	✅ Passed	Tests added use standard Go testing package, not Ginkgo. Custom check is for Ginkgo tests only, making it inapplicable to this PR.
Microshift Test Compatibility	✅ Passed	PR adds standard Go unit tests and benchmarks, not Ginkgo e2e tests. MicroShift check only applies to Ginkgo e2e tests with It(), Describe(), Context(), etc.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No Ginkgo e2e tests were added. All test additions are standard Go unit tests and benchmarks in pkg/project/auth/cache_test.go, not SNO-applicable e2e tests.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only internal authorization cache logic (pkg/project/auth/cache.go/cache_test.go) with no deployment manifests, scheduling constraints, or topology-dependent configurations.
Ote Binary Stdout Contract	✅ Passed	PR changes pkg/project/auth/cache.go, which is not part of OTE binary infrastructure. OTE binary doesn't import or use this auth cache code, so the stdout contract check doesn't apply.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR adds only standard Go unit tests and benchmarks to pkg/project/auth/cache_test.go using testing.T/testing.B, not Ginkgo test framework. Check not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sanchezl · 2026-05-07T14:27:32Z

/verified by "TestAuthorizationCacheRace"

openshift-ci · 2026-05-07T14:27:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/project/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-05-07T14:27:46Z

@sanchezl: This PR has been marked as verified by "TestAuthorizationCacheRace".

Details

In response to this:

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-05-07T14:32:43Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache.List() that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Commit 1 — Copy-on-write subjectRecords: addSubjectsToNamespace and deleteNamespaceFromSubjects now create new subjectRecord objects with copied namespaces sets instead of mutating the underlying map in place. Any concurrent List() holding the old record iterates an immutable snapshot. cache.Store is internally thread-safe, so the replacement is safe without locking — List() never blocks.

Commit 2 — Atomic store pointer swap: During full cache invalidation, three store pointers were swapped non-atomically. A concurrent List() could read stores from different points in time. Wraps all three stores in a struct behind atomic.Pointer so they swap as a single unit.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

addSubjectsToNamespace(): item.namespaces.Insert(namespace)

deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process.

Fix History

PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access

PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)

PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained

PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This fix avoids locks entirely via copy-on-write. List() never blocks, regardless of how long synchronize() takes.

Related Issues

OCPBUGS-65926 — Same crash on 4.14.26, incorrectly closed as duplicate of OCPBUGS-56594 (a different bug in kube-apiserver audit log serialization)

OCPBUGS-58029 — Clone of 65926

OCPBUGS-56594 — Different bug (kube-apiserver audit race), fixed by Fix API server crash on concurrent map iteration and write kubernetes/kubernetes#129472

QA Validation

Test 1: Race condition is fixed

Deploy a build with the fix

Run the cluster under load with concurrent project list requests

Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters

Provision a cluster with 2000+ namespaces and substantial RBAC

Measure oc get projects latency before and after — should not regress

Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

Grant/revoke a user's access to a namespace

Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Summary by CodeRabbit

Bug Fixes

Enhanced authorization cache reliability under concurrent load through improved internal consistency mechanisms.

Tests

Added comprehensive stress testing for authorization cache operations under concurrent access patterns.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/project/auth/cache.go (1)

451-471: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move lastCacheInvalidation to after the atomic store swap.

On the full-rebuild path, Line 453 resets the expiry timer before the rebuilt stores are visible. List() keeps serving the old snapshot until Line 467, so a slow rebuild can leave stale data live longer than maxCacheLifespan, and the next expiry window starts too early.

Suggested fix

 	invalidateCache := ac.invalidateCache(expired)
 	if invalidateCache {
-		ac.lastCacheInvalidation = ac.clock.Now()
 		userSubjectRecordStore = cache.NewStore(subjectRecordKeyFn)
 		groupSubjectRecordStore = cache.NewStore(subjectRecordKeyFn)
 		reviewRecordStore = cache.NewStore(reviewRecordKeyFn)
 	}
@@
 	if invalidateCache {
 		ac.stores.Store(&authorizationCacheStores{
 			userSubjectRecordStore:  userSubjectRecordStore,
 			groupSubjectRecordStore: groupSubjectRecordStore,
 			reviewRecordStore:       reviewRecordStore,
 		})
+		ac.lastCacheInvalidation = ac.clock.Now()
 	}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/project/auth/cache.go` around lines 451 - 471, The cache expiry timestamp
ac.lastCacheInvalidation is being set when invalidateCache is true before
swapping in the rebuilt stores, which can extend stale-serving time; move the
assignment of ac.lastCacheInvalidation to after the atomic swap (the
ac.stores.Store call that installs the new authorizationCacheStores with
userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) so the
expiry timer starts only once the new stores are visible to List()/readers.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/project/auth/cache.go`:
- Around line 451-471: The cache expiry timestamp ac.lastCacheInvalidation is
being set when invalidateCache is true before swapping in the rebuilt stores,
which can extend stale-serving time; move the assignment of
ac.lastCacheInvalidation to after the atomic swap (the ac.stores.Store call that
installs the new authorizationCacheStores with userSubjectRecordStore,
groupSubjectRecordStore, reviewRecordStore) so the expiry timer starts only once
the new stores are visible to List()/readers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f932d9a3-ce8c-40cc-a062-02fafaee0b7a

📥 Commits

Reviewing files that changed from the base of the PR and between 999dd5a and 10ef6dd.

📒 Files selected for processing (2)

pkg/project/auth/cache.go
pkg/project/auth/cache_test.go

sanchezl · 2026-05-08T16:59:26Z

/retest-required

sanchezl · 2026-05-11T17:04:51Z

/jira refresh

openshift-ci-robot · 2026-05-11T17:04:59Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

benluddy · 2026-05-12T17:06:07Z

-			if len(subjectRecord.namespaces) == 0 {
-				subjectRecordStore.Delete(subjectRecord)
+			old := obj.(*subjectRecord)
+			newNamespaces := sets.NewString(old.namespaces.UnsortedList()...)


I'm concerned about this from a performance perspective -- especially allocations -- because this will run something like O(N*U) times on each invalidation (N=namespaces and U=users).

See #642 (comment)

openshift-ci-robot · 2026-05-12T18:52:04Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Dual-path copy-on-write with atomic store swap:

During full cache invalidation (every 15s or on RBAC changes), new stores are built privately by the writer goroutine — mutations happen in place with zero copy overhead. During incremental updates between invalidations, stores are shared with concurrent List() readers — addSubjectsToNamespace and deleteNamespaceFromSubjects use copy-on-write to create new subjectRecord objects with copied namespaces sets, so readers iterate immutable snapshots. All three stores are grouped behind atomic.Pointer[authorizationCacheStores] so the full-rebuild swap is a single atomic operation — readers always see a consistent view.

This avoids the lock contention that caused the previous mutex fix (PR #267) to be reverted (PR #326), while also avoiding the O(n²) allocation overhead of unconditional copy-on-write that was identified during code review.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

addSubjectsToNamespace(): item.namespaces.Insert(namespace)

deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process with a fatal panic. Additionally, three store pointers (userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) were swapped non-atomically during full cache rebuilds — a concurrent List() could read stores from different generations, returning silently wrong results.

Fix Strategy

Two races are fixed:

Map race → dual-path COW: When stores are shared with readers (incremental updates), create new subjectRecord with a copied sets.String instead of mutating in place. When stores are private to the writer (full invalidation rebuild), mutate in place for performance. The copyOnWrite bool parameter threads through the call chain to select the appropriate path.

Non-atomic store swap → atomic.Pointer: Group all three stores in authorizationCacheStores behind atomic.Pointer. Full rebuilds populate private stores, then swap atomically. List() snapshots the pointer once at entry.

Fix History

PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access

PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)

PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained

PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This PR: Dual-path COW + atomic pointer — List() never blocks, no O(n²) regression

Related Issues

OCPBUGS-65926 — Same crash on 4.14.26, incorrectly closed as duplicate of OCPBUGS-56594 (a different bug in kube-apiserver audit log serialization)

OCPBUGS-58029 — Clone of 65926

OCPBUGS-56594 — Different bug (kube-apiserver audit race), fixed by Fix API server crash on concurrent map iteration and write kubernetes/kubernetes#129472

QA Validation

Test 1: Race condition is fixed

Deploy a build with the fix

Run the cluster under load with concurrent project list requests

Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters (the PR #267 revert scenario)

Provision a cluster with 2000+ namespaces and substantial RBAC

Measure oc get projects latency before and after — should not regress

Key property: List() should never block waiting on the sync goroutine

Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

Grant/revoke a user's access to a namespace

Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sanchezl · 2026-05-12T19:34:22Z

Alternative Approaches Evaluated

Three approaches were prototyped independently and benchmarked using BenchmarkFullCacheInvalidation (added by @benluddy in commit 2). Each approach fixes both the concurrent map iteration and map write panic and the non-atomic store pointer swap.

1. Dual-path COW (this PR)

Branch: bugfix/project-auth-cache-race

During full cache invalidation, new stores are private to the writer — mutations happen in place with zero copy overhead. During incremental updates, stores are shared with readers — copy-on-write creates new subjectRecord objects with copied sets.String. A copyOnWrite bool parameter threads through the call chain. All three stores are grouped behind atomic.Pointer[authorizationCacheStores] for atomic swap.

2. sync.Map

Branch: bugfix/project-auth-cache-race-syncmap

Replace sets.String (backed by map[string]sets.Empty) with sync.Map for the subjectRecord.namespaces field. sync.Map provides built-in concurrent read/write safety without external locking. List() uses Range() to iterate, addSubjectsToNamespace uses Store(), deleteNamespaceFromSubjects uses Delete().

3. Fine-grained RWMutex

Branch: bugfix/project-auth-cache-race-rwmutex

Add a per-record sync.RWMutex to subjectRecord. Lock only individual map mutations (nanosecond-scale), not the entire synchronize() call (which can take minutes on large clusters). This is fundamentally different from the reverted PR #267, which held a single coarse-grained write lock across all of synchronize().

Benchmark Results

BenchmarkFullCacheInvalidation at 1000 namespaces × 100 users (the critical scale point):

Approach	Time	Memory	Allocs
Dual-path COW	~9 ms	~15 MB	~16K
sync.Map	~12 ms	~11 MB	~27K
Fine-grained RWMutex	~12 ms	~11 MB	~27K
Original COW-only (before optimization)	~941 ms	~3 GB	~2M

All three optimized approaches are within ~30% of each other and are orders of magnitude better than the original unconditional COW approach that @benluddy identified as O(N×U) catastrophic.

Why Dual-path COW?

Fastest at the critical scale point (~9ms vs ~12ms)
Fully lock-free on the read path — List() never blocks, never contends. This property eliminates any risk of reintroducing the lock contention that caused PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326's revert (where RLock waits reached 3-4 minutes)
No new concurrency primitives in hot paths — the sync.Map approach introduces Range() iteration overhead per List() call; the RWMutex approach introduces lock/unlock per record mutation
Atomic pointer swap ensures readers see a consistent snapshot of all three stores

The sync.Map and fine-grained RWMutex approaches are both viable alternatives if the team prefers a different tradeoff. The branches are available for review.

sanchezl · 2026-05-12T21:21:41Z

/retest-required

sanchezl · 2026-05-12T21:51:51Z

/retest ci/prow/e2e-aws-ovn ci/prow/e2e-aws-ovn-serial-1of2 ci/prow/e2e-aws-ovn-serial-2of2

sanchezl · 2026-05-12T21:51:56Z

/test e2e-aws-ovn

sanchezl · 2026-05-12T21:52:00Z

/test e2e-aws-ovn-serial-1of2

sanchezl · 2026-05-12T21:52:02Z

/test e2e-aws-ovn-serial-2of2

sanchezl · 2026-05-12T23:11:44Z

/retest-required

sanchezl · 2026-05-13T00:12:13Z

/retest ci/prow/e2e-aws-ovn

sanchezl · 2026-05-13T00:12:14Z

/retest ci/prow/e2e-aws-ovn-upgrade

sanchezl · 2026-05-13T00:21:49Z

/test e2e-aws-ovn

sanchezl · 2026-05-13T00:21:50Z

/test e2e-aws-ovn-upgrade

sanchezl · 2026-05-13T13:41:46Z

/test e2e-aws-ovn

sanchezl · 2026-05-13T16:30:39Z

@coderabbitai resume

coderabbitai · 2026-05-13T16:30:47Z

✅ Actions performed

Reviews resumed.

sanchezl · 2026-05-13T18:31:51Z

/verified

openshift-ci-robot · 2026-05-13T18:32:39Z

@sanchezl: The /verified command must be used with one of the following actions: by, later, remove, or bypass. See https://docs.ci.openshift.org/docs/architecture/jira/#premerge-verification for more information.

Details

In response to this:

/verified

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sanchezl · 2026-05-13T18:41:44Z

/verified by "TestAuthorizationCacheRace"

openshift-ci-robot · 2026-05-13T18:41:56Z

@sanchezl: This PR has been marked as verified by "TestAuthorizationCacheRace".

Details

In response to this:

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

addSubjectsToNamespace and deleteNamespaceFromSubjects mutate subjectRecord.namespaces (a sets.String / map) in place while List() iterates the same map from HTTP request goroutines. This causes a fatal "concurrent map iteration and map write" panic that crashes openshift-apiserver pods intermittently. Use a dual-path strategy: during full cache invalidation, where new stores are private to the writer goroutine, mutate in place for zero-copy performance. During incremental updates, where stores are shared with concurrent readers, use copy-on-write to create new subjectRecord objects with copied namespaces sets. Group all three cache stores behind an atomic.Pointer so they swap as a single unit during full invalidation, ensuring readers see a consistent view. This avoids the lock contention that caused the previous mutex fix (PR openshift#267) to be reverted (PR openshift#326), while also avoiding the O(n²) allocation overhead of unconditional copy-on-write.

Benchmark synchronize() at varying namespace × user scales (10/10 through 1000/1000) with the cache always expired, forcing full invalidation on every iteration.

Benchmark the incremental COW path with varying levels of subject duplication (D=1 for no duplicates, D=10 for 10x duplicates per user). This exercises the scenario where broken upstream dedup in AllowedSubjects causes redundant COW copies in addSubjectsToNamespace during incremental cache updates.

openshift-ci-robot · 2026-05-21T14:36:29Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Dual-path copy-on-write with atomic store swap:

During full cache invalidation (every 15s or on RBAC changes), new stores are built privately by the writer goroutine — mutations happen in place with zero copy overhead. During incremental updates between invalidations, stores are shared with concurrent List() readers — addSubjectsToNamespace and deleteNamespaceFromSubjects use copy-on-write to create new subjectRecord objects with copied namespaces sets, so readers iterate immutable snapshots. All three stores are grouped behind atomic.Pointer[authorizationCacheStores] so the full-rebuild swap is a single atomic operation — readers always see a consistent view.

No-op guards skip the COW copy entirely when the namespace is already present/absent from the subject's set — this eliminates unnecessary allocations when subjects are re-processed with unchanged access (common in incremental updates, especially with duplicate subjects from upstream).

This avoids the lock contention that caused the previous mutex fix (PR #267) to be reverted (PR #326), while also avoiding the O(n²) allocation overhead of unconditional copy-on-write that was identified during code review.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

addSubjectsToNamespace(): item.namespaces.Insert(namespace)

deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process with a fatal panic. Additionally, three store pointers (userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) were swapped non-atomically during full cache rebuilds — a concurrent List() could read stores from different generations, returning silently wrong results.

Fix Strategy

Two races and one performance issue are fixed:

Map race → dual-path COW: When stores are shared with readers (incremental updates), create new subjectRecord with a copied sets.String instead of mutating in place. When stores are private to the writer (full invalidation rebuild), mutate in place for performance. The copyOnWrite bool parameter threads through the call chain to select the appropriate path.

Non-atomic store swap → atomic.Pointer: Group all three stores in authorizationCacheStores behind atomic.Pointer. Full rebuilds populate private stores, then swap atomically. List() snapshots the pointer once at entry.

No-op guards: In the COW path, check Has(namespace) / !Has(namespace) before copying. When a subject already has (or already lacks) access to a namespace, skip the copy entirely. This is critical because an upstream kube bug in AllowedSubjects() returns duplicate subjects (line 124 returns subjects instead of dedupedSubjects), causing each duplicate to trigger a redundant COW copy.

Alternatives Considered

Three approaches were prototyped and benchmarked (see comment below for full comparison):

Approach 1000ns / 100u Allocs Branch

Dual-path COW (this PR) ~9ms / 15MB 16K bugfix/project-auth-cache-race

sync.Map ~12ms / 12MB 27K bugfix/project-auth-cache-race-syncmap

Fine-grained RWMutex ~12ms / 11MB 27K bugfix/project-auth-cache-race-rwmutex

All three are viable. Dual-path COW was chosen because it is the fastest, fully lock-free on the read path, and carries no risk of reintroducing the lock contention that caused the PR #326 revert.

Fix History

PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access

PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)

PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained

PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This PR: Dual-path COW + atomic pointer + no-op guards — List() never blocks, no O(n²) regression

Related Issues

OCPBUGS-65926 — Same crash on 4.14.26, incorrectly closed as duplicate of OCPBUGS-56594 (a different bug in kube-apiserver audit log serialization)

OCPBUGS-58029 — Clone of 65926

OCPBUGS-56594 — Different bug (kube-apiserver audit race), fixed by Fix API server crash on concurrent map iteration and write kubernetes/kubernetes#129472

Upstream kube bug — subject_locator.go:124 returns subjects instead of dedupedSubjects, causing duplicate subjects to flow into addSubjectsToNamespace. Mitigated by no-op guards in this PR; upstream fix tracked separately.

QA Validation

Test 1: Race condition is fixed

Deploy a build with the fix

Run the cluster under load with concurrent project list requests

Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters (the PR #267 revert scenario)

Provision a cluster with 2000+ namespaces and substantial RBAC

Measure oc get projects latency before and after — should not regress

Key property: List() should never block waiting on the sync goroutine

Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

Grant/revoke a user's access to a namespace

Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sanchezl · 2026-05-21T14:36:42Z

No-op Guard Optimization + Upstream Kube Bug

Following @benluddy's review, this push adds two improvements:

1. No-op guards in COW path

addSubjectsToNamespace and deleteNamespaceFromSubjects now check Has(namespace) / !Has(namespace) before copying. When a subject's access hasn't changed (the common case in incremental updates), the COW copy is skipped entirely.

2. Upstream kube bug identified

subject_locator.go:124 builds a dedupedSubjects slice but returns the original subjects — discarding the dedup work. Duplicate subjects flow through RBACSubjectsToUsersAndGroups (no dedup) into addSubjectsToNamespace, causing redundant COW copies. The no-op guards mitigate this; an upstream fix is tracked separately.

Benchmark: `BenchmarkIncrementalSyncDuplicateSubjects`

Exercises the incremental COW path with duplicate subjects (D=1 = no dupes, D=10 = 10x dupes per user):

Before (no-op guards):

Scale	Time	Memory	Allocs
N=1000, U=100, D=1	2.05s	7.1 GB	824K
N=1000, U=100, D=10	20.5s	71.1 GB	8.0M

After (with no-op guards):

Scale	Time	Memory	Allocs
N=1000, U=100, D=1	130ms	209 KB	1K
N=1000, U=100, D=10	135ms	215 KB	1K

The no-op guards eliminate ~99.99% of allocations in the incremental path. Duplicates no longer matter because the guard catches the no-op before any copying occurs.

sanchezl · 2026-05-21T14:37:00Z

/verified by "TestAuthorizationCacheRace"

openshift-ci-robot · 2026-05-21T14:37:16Z

@sanchezl: This PR has been marked as verified by "TestAuthorizationCacheRace".

Details

In response to this:

/verified by "TestAuthorizationCacheRace"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-05-21T14:39:40Z

@sanchezl: This pull request references Jira Issue OCPBUGS-84534, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

Fix a fatal concurrent map iteration and map write panic in AuthorizationCache that intermittently crashes openshift-apiserver pods. The race has existed for years and was never successfully fixed.

Dual-path copy-on-write with atomic store swap:

During full cache invalidation (every 15s or on RBAC changes), new stores are built privately by the writer goroutine — mutations happen in place with zero copy overhead. During incremental updates between invalidations, stores are shared with concurrent List() readers — addSubjectsToNamespace and deleteNamespaceFromSubjects use copy-on-write to create new subjectRecord objects with copied namespaces sets, so readers iterate immutable snapshots. All three stores are grouped behind atomic.Pointer[authorizationCacheStores] so the full-rebuild swap is a single atomic operation — readers always see a consistent view.

No-op guards skip the COW copy entirely when the namespace is already present/absent from the subject's set — this eliminates unnecessary allocations when subjects are re-processed with unchanged access (common in incremental updates, especially with duplicate subjects from upstream).

This avoids the lock contention that caused the previous mutex fix (PR #267) to be reverted (PR #326), while also avoiding the O(n²) allocation overhead of unconditional copy-on-write that was identified during code review.

Root Cause

List() (called from HTTP request goroutines via proxy.(*REST).List) reads subjectRecord.namespaces (sets.String = map[string]Empty). Meanwhile, synchronize() (background goroutine) mutates the same maps in place:

addSubjectsToNamespace(): item.namespaces.Insert(namespace)

deleteNamespaceFromSubjects(): delete(subjectRecord.namespaces, namespace)

Go's runtime detects the concurrent map read+write and kills the process with a fatal panic. Additionally, three store pointers (userSubjectRecordStore, groupSubjectRecordStore, reviewRecordStore) were swapped non-atomically during full cache rebuilds — a concurrent List() could read stores from different generations, returning silently wrong results.

Fix Strategy

Two races and one performance issue are fixed:

Map race → dual-path COW: When stores are shared with readers (incremental updates), create new subjectRecord with a copied sets.String instead of mutating in place. When stores are private to the writer (full invalidation rebuild), mutate in place for performance. The copyOnWrite bool parameter threads through the call chain to select the appropriate path.

Non-atomic store swap → atomic.Pointer: Group all three stores in authorizationCacheStores behind atomic.Pointer. Full rebuilds populate private stores, then swap atomically. List() snapshots the pointer once at entry.

No-op guards: In the COW path, check Has(namespace) / !Has(namespace) before copying. When a subject already has (or already lacks) access to a namespace, skip the copy entirely. This is critical because an upstream kube bug in AllowedSubjects() returns duplicate subjects (line 124 returns subjects instead of dedupedSubjects), causing each duplicate to trigger a redundant COW copy.

Alternatives Considered

Three approaches were prototyped and benchmarked (see comment below for full comparison):

Approach 1000ns / 100u Allocs Branch

Dual-path COW (this PR) ~9ms / 15MB 16K bugfix/project-auth-cache-race

sync.Map ~12ms / 12MB 27K bugfix/project-auth-cache-race-syncmap

Fine-grained RWMutex ~12ms / 11MB 27K bugfix/project-auth-cache-race-rwmutex

All three are viable. Dual-path COW was chosen because it is the fastest, fully lock-free on the read path, and carries no risk of reintroducing the lock contention that caused the PR #326 revert.

Fix History

PR projects: add rw mutex to auth cache #267 (Jan 2022): Added sync.RWMutex to synchronize access

PR OCPBUGS-2803: Revert "projects: add rw mutex to auth cache" #326 (Oct 2022): Reverted the mutex — clusters with high namespace/RBAC counts had multi-minute sync times, blocking all List() requests (goroutine dumps showed 3-4 minute waits on RLock)

PR OCPBUGS-57474: ensure cache invalidation after a time #547 (Sep 2025): Timer-based cache invalidation every 15s for OCPBUGS-57474, but no locking — the race remained

PR WIP: OCPBUGS-57474: Authorization Cache V2 #530: "Authorization Cache V2" full rewrite — abandoned

This PR: Dual-path COW + atomic pointer + no-op guards — List() never blocks, no O(n²) regression

Related Issues

OCPBUGS-65926 — Same crash on 4.14.26, incorrectly closed as duplicate of OCPBUGS-56594 (a different bug in kube-apiserver audit log serialization)

OCPBUGS-58029 — Clone of 65926

OCPBUGS-56594 — Different bug (kube-apiserver audit race), fixed by Fix API server crash on concurrent map iteration and write kubernetes/kubernetes#129472

Upstream kube bug — subject_locator.go:124 returns subjects instead of dedupedSubjects, causing duplicate subjects to flow into addSubjectsToNamespace. Mitigated by no-op guards in this PR; upstream fix tracked separately.

QA Validation

Test 1: Race condition is fixed

Deploy a build with the fix

Run the cluster under load with concurrent project list requests

Verify no panics in openshift-apiserver logs and no pod restarts with exitCode 2

Test 2: No regression on large clusters (the PR #267 revert scenario)

Provision a cluster with 2000+ namespaces and substantial RBAC

Measure oc get projects latency before and after — should not regress

Key property: List() should never block waiting on the sync goroutine

Monitor openshift-apiserver memory usage for unexpected growth

Test 3: Cache freshness

Grant/revoke a user's access to a namespace

Verify reflected in oc projects output within ~15 seconds

Verification

/verified by "TestAuthorizationCacheRace"

Summary by CodeRabbit

Refactor

Redesigned the authorization cache to provide consistent snapshots for readers, use atomic swaps for store replacements, and adopt copy-on-write updates so incremental syncs safely support concurrent reads.

Tests

Added a concurrency race test exercising concurrent readers/writers against the cache.

Added benchmarks for full cache invalidation, incremental copy-on-write behavior, and subject-update performance across namespace counts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-05-21T18:04:50Z

@sanchezl: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-upgrade	`2de9d35`	link	true	`/test e2e-aws-ovn-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot requested review from deads2k and derekwaynecarr May 7, 2026 14:27

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 11, 2026

benluddy reviewed May 12, 2026

View reviewed changes

sanchezl force-pushed the bugfix/project-auth-cache-race branch from 10ef6dd to 84c0d39 Compare May 12, 2026 18:50

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 12, 2026

sanchezl force-pushed the bugfix/project-auth-cache-race branch from 84c0d39 to 1cf6ada Compare May 12, 2026 19:12

sanchezl force-pushed the bugfix/project-auth-cache-race branch from 1cf6ada to 044c077 Compare May 12, 2026 23:51

sanchezl force-pushed the bugfix/project-auth-cache-race branch 4 times, most recently from e61a4e8 to f17e018 Compare May 13, 2026 06:31

sanchezl force-pushed the bugfix/project-auth-cache-race branch from f17e018 to 7d02e79 Compare May 13, 2026 14:51

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 13, 2026

sanchezl and others added 3 commits May 21, 2026 10:31

project: add full cache invalidation benchmark

888a918

Benchmark synchronize() at varying namespace × user scales (10/10 through 1000/1000) with the cache always expired, forcing full invalidation on every iteration.

sanchezl force-pushed the bugfix/project-auth-cache-race branch from 7d02e79 to 2de9d35 Compare May 21, 2026 14:35

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 21, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 21, 2026

Conversation

sanchezl commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix Strategy

Alternatives Considered

Fix History

Related Issues

QA Validation

Test 1: Race condition is fixed

Test 2: No regression on large clusters (the PR #267 revert scenario)

Test 3: Cache freshness

Verification

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented May 7, 2026

Summary

Root Cause

Fix History

Related Issues

QA Validation

Test 1: Race condition is fixed

Test 2: No regression on large clusters

Test 3: Cache freshness

Verification

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

sanchezl commented May 7, 2026

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

openshift-ci-robot commented May 7, 2026

Uh oh!

openshift-ci-robot commented May 7, 2026

Summary

Root Cause

Fix History

Related Issues

QA Validation

Test 1: Race condition is fixed

Test 2: No regression on large clusters

Test 3: Cache freshness

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sanchezl commented May 8, 2026

Uh oh!

sanchezl commented May 11, 2026

Uh oh!

openshift-ci-robot commented May 11, 2026

Uh oh!

benluddy May 12, 2026

Choose a reason for hiding this comment

Uh oh!

sanchezl May 13, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented May 12, 2026

Summary

Root Cause

Fix Strategy

Fix History

Related Issues

QA Validation

Test 1: Race condition is fixed

Test 2: No regression on large clusters (the PR #267 revert scenario)

Test 3: Cache freshness

Verification

Uh oh!

sanchezl commented May 12, 2026

sanchezl commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Benchmark: `BenchmarkIncrementalSyncDuplicateSubjects`