Summary (product impact)
A single unresolvable inheritedRoles reference in a Milo IAM Role currently disables every permission that role grants — not just the broken one. The Role controller computes effective permissions all-or-nothing: if one inherited role can't be resolved, it fails the whole Role (EffectivePermissionsError) and writes zero OpenFGA tuples. For a broadly-inherited role like a project owner (≈16 inherited roles), one small/transient config mistake causes a broad, silent access outage across unrelated capabilities.
This is a reliability + operability problem for anyone managing access via GitOps, and it's hard to diagnose from the symptom (users see scattered Forbidden errors with no obvious cause).
What customers/operators experience
- A user who should have access gets
Forbidden on resources that have nothing to do with the misconfigured grant.
- The failure is non-obvious: the symptom (denied request) is far from the cause (a dangling reference in one inherited role on a different role).
- Recovery is slow: after the referenced role finally exists, the controller is in exponential backoff (~up to ~16 min) and doesn't re-list, so the role stays broken well after the fix is in place.
Real incident (repro)
While enabling project owners to read LocationBindings, the project owner role inherited services.miloapis.com-entitlement-admin, which had not yet been created. That one unresolved reference wedged the entire owner role — so project owners couldn't read LocationBindings either (a completely unrelated, valid grant). It took a multi-step investigation to trace Forbidden on locationbindings → owner role EffectivePermissionsError → a missing, unrelated role. Once the missing role deployed, the owner role only recovered minutes later on the natural backoff retry.
Controller: auth-provider-openfga-controller-manager, controller role. Error: Failed to compute effective permissions: inherited role '<name>' not found in namespace '<ns>'.
Why this is also GitOps-hostile
In GitOps, references are expected to resolve at reconcile time (eventually consistent), not at write time. A role/binding legitimately arriving before the role it inherits is a normal transient — especially across independently-published bundles deployed by separate Flux sources/intervals. All-or-nothing turns that normal ordering gap into a whole-role outage until convergence.
Desired behavior
- Graceful degradation, not whole-role failure. Apply the permissions from the inherited roles that do resolve; a single unresolved reference must not zero out the rest. (Stay fail-closed for the unresolved reference itself — never over-grant.)
- Loud, specific visibility. Surface a
Degraded / Ready=False condition + event on the Role that names the exact unresolved reference, so the misconfiguration is immediately diagnosable instead of manifesting as mystery Forbiddens elsewhere.
- Fast convergence. When the referenced role appears, recompute promptly (cap/shorten the backoff or watch for the dependency) rather than waiting out a long backoff.
Explicitly NOT this
Do not add admission-time existence validation that denies a Role/PolicyBinding referencing a not-yet-existing role. That fights GitOps eventual consistency: it would make Flux apply fail every reconcile until ordering happens to line up, turn Kustomizations red during normal rollouts, force explicit dependsOn coupling, and break clean-cluster bootstrap. Reference-existence checking belongs left of runtime — e.g. a CI lint across the role bundles at PR time (catches typos like entitlement-admin vs entitlement-viewer before merge), or at most an admission warning (never a deny).
Acceptance criteria
- A Role with one unresolvable
inheritedRoles entry still grants the permissions from all resolvable inherited roles (verified via OpenFGA tuples / a real access check).
- The Role reports a clear
Degraded/non-Ready condition + event naming the unresolved reference.
- When the missing inherited role is created, the Role converges to fully-Ready without a long delay.
- No admission-time deny is introduced for not-yet-existing references; GitOps apply ordering remains order-independent.
- (Optional/related) A CI lint exists to flag inheritedRole references that don't resolve to any defined role across the platform's role bundles.
🤖 Generated with Claude Code
Summary (product impact)
A single unresolvable
inheritedRolesreference in a Milo IAMRolecurrently disables every permission that role grants — not just the broken one. The Role controller computes effective permissions all-or-nothing: if one inherited role can't be resolved, it fails the whole Role (EffectivePermissionsError) and writes zero OpenFGA tuples. For a broadly-inherited role like a projectowner(≈16 inherited roles), one small/transient config mistake causes a broad, silent access outage across unrelated capabilities.This is a reliability + operability problem for anyone managing access via GitOps, and it's hard to diagnose from the symptom (users see scattered
Forbiddenerrors with no obvious cause).What customers/operators experience
Forbiddenon resources that have nothing to do with the misconfigured grant.Real incident (repro)
While enabling project owners to read
LocationBindings, the projectownerrole inheritedservices.miloapis.com-entitlement-admin, which had not yet been created. That one unresolved reference wedged the entireownerrole — so project owners couldn't readLocationBindings either (a completely unrelated, valid grant). It took a multi-step investigation to traceForbidden on locationbindings→ owner roleEffectivePermissionsError→ a missing, unrelated role. Once the missing role deployed, the owner role only recovered minutes later on the natural backoff retry.Controller:
auth-provider-openfga-controller-manager, controllerrole. Error:Failed to compute effective permissions: inherited role '<name>' not found in namespace '<ns>'.Why this is also GitOps-hostile
In GitOps, references are expected to resolve at reconcile time (eventually consistent), not at write time. A role/binding legitimately arriving before the role it inherits is a normal transient — especially across independently-published bundles deployed by separate Flux sources/intervals. All-or-nothing turns that normal ordering gap into a whole-role outage until convergence.
Desired behavior
Degraded/Ready=Falsecondition + event on the Role that names the exact unresolved reference, so the misconfiguration is immediately diagnosable instead of manifesting as mysteryForbiddens elsewhere.Explicitly NOT this
Do not add admission-time existence validation that denies a
Role/PolicyBindingreferencing a not-yet-existing role. That fights GitOps eventual consistency: it would make Flux apply fail every reconcile until ordering happens to line up, turn Kustomizations red during normal rollouts, force explicitdependsOncoupling, and break clean-cluster bootstrap. Reference-existence checking belongs left of runtime — e.g. a CI lint across the role bundles at PR time (catches typos likeentitlement-adminvsentitlement-viewerbefore merge), or at most an admission warning (never a deny).Acceptance criteria
inheritedRolesentry still grants the permissions from all resolvable inherited roles (verified via OpenFGA tuples / a real access check).Degraded/non-Ready condition + event naming the unresolved reference.🤖 Generated with Claude Code