Skip to content

IAM: one unresolved inheritedRole disables an entire Role's permissions — make it degrade gracefully #628

Description

@scotwells

Summary (product impact)

A single unresolvable inheritedRoles reference in a Milo IAM Role currently disables every permission that role grants — not just the broken one. The Role controller computes effective permissions all-or-nothing: if one inherited role can't be resolved, it fails the whole Role (EffectivePermissionsError) and writes zero OpenFGA tuples. For a broadly-inherited role like a project owner (≈16 inherited roles), one small/transient config mistake causes a broad, silent access outage across unrelated capabilities.

This is a reliability + operability problem for anyone managing access via GitOps, and it's hard to diagnose from the symptom (users see scattered Forbidden errors with no obvious cause).

What customers/operators experience

  • A user who should have access gets Forbidden on resources that have nothing to do with the misconfigured grant.
  • The failure is non-obvious: the symptom (denied request) is far from the cause (a dangling reference in one inherited role on a different role).
  • Recovery is slow: after the referenced role finally exists, the controller is in exponential backoff (~up to ~16 min) and doesn't re-list, so the role stays broken well after the fix is in place.

Real incident (repro)

While enabling project owners to read LocationBindings, the project owner role inherited services.miloapis.com-entitlement-admin, which had not yet been created. That one unresolved reference wedged the entire owner role — so project owners couldn't read LocationBindings either (a completely unrelated, valid grant). It took a multi-step investigation to trace Forbidden on locationbindings → owner role EffectivePermissionsError → a missing, unrelated role. Once the missing role deployed, the owner role only recovered minutes later on the natural backoff retry.

Controller: auth-provider-openfga-controller-manager, controller role. Error: Failed to compute effective permissions: inherited role '<name>' not found in namespace '<ns>'.

Why this is also GitOps-hostile

In GitOps, references are expected to resolve at reconcile time (eventually consistent), not at write time. A role/binding legitimately arriving before the role it inherits is a normal transient — especially across independently-published bundles deployed by separate Flux sources/intervals. All-or-nothing turns that normal ordering gap into a whole-role outage until convergence.

Desired behavior

  1. Graceful degradation, not whole-role failure. Apply the permissions from the inherited roles that do resolve; a single unresolved reference must not zero out the rest. (Stay fail-closed for the unresolved reference itself — never over-grant.)
  2. Loud, specific visibility. Surface a Degraded / Ready=False condition + event on the Role that names the exact unresolved reference, so the misconfiguration is immediately diagnosable instead of manifesting as mystery Forbiddens elsewhere.
  3. Fast convergence. When the referenced role appears, recompute promptly (cap/shorten the backoff or watch for the dependency) rather than waiting out a long backoff.

Explicitly NOT this

Do not add admission-time existence validation that denies a Role/PolicyBinding referencing a not-yet-existing role. That fights GitOps eventual consistency: it would make Flux apply fail every reconcile until ordering happens to line up, turn Kustomizations red during normal rollouts, force explicit dependsOn coupling, and break clean-cluster bootstrap. Reference-existence checking belongs left of runtime — e.g. a CI lint across the role bundles at PR time (catches typos like entitlement-admin vs entitlement-viewer before merge), or at most an admission warning (never a deny).

Acceptance criteria

  • A Role with one unresolvable inheritedRoles entry still grants the permissions from all resolvable inherited roles (verified via OpenFGA tuples / a real access check).
  • The Role reports a clear Degraded/non-Ready condition + event naming the unresolved reference.
  • When the missing inherited role is created, the Role converges to fully-Ready without a long delay.
  • No admission-time deny is introduced for not-yet-existing references; GitOps apply ordering remains order-independent.
  • (Optional/related) A CI lint exists to flag inheritedRole references that don't resolve to any defined role across the platform's role bundles.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions