Skip to content

Validate ActivityPolicy CEL for unguarded responseObject derefs (DLQ-leak guard) #214

Description

@ecv

Problem

ActivityPolicy summary/match CEL can dereference audit.responseObject.<deep.path>
(e.g. responseObject.metadata.name, responseObject.spec.x) without guarding
it. On a rejected create/update (409/422/admission-deny) the apiserver
returns a Status object as responseObject — no spec, no metadata.name.
CEL raises no such key, the event goes to the DLQ, retries fail identically →
slow steady leak. This caused the prod DLQSlowLeak on
gateway.networking.k8s.io-gateway (see #212).

Six other repos shipped the same latent pattern (NSO, milo, billing — fixed in
datum-cloud/network-services-operator#230, milo-os/milo#672, milo-os/billing#69).

Why existing validation does not catch it

audit is declared cel.MapType(StringType, cel.DynType)
(internal/cel/environment.go), so it is fully dynamic. env.Compile() accepts
audit.responseObject.metadata.name with no error, and ValidatePolicyExpression
(internal/cel/policy.go:50) is compile-based — so it cannot flag this. The
failure is a runtime "no such key" on a specific event shape, not a syntax/type
error. Detection must be structural (AST), not compile.

Top-level fields are already defaulted to empty maps
(environment.go:96-110) so has(audit.responseObject) is always true — which
is exactly why guarding the root is insufficient and deep-path guards are
required.

Ask

Add server-side validation to the ActivityPolicy path
(internal/controller/activitypolicy_controller.go, ideally a validating
admission webhook so bad specs are rejected at apply, not just reconciled with a
status condition) that flags unguarded deep dereferences of audit.responseObject
(and deep audit.requestObject) in auditRules[].summary and .match.

This is the primary gate: it is the only layer that catches ad-hoc / orphan
CRs applied directly to a control plane. The gateway.networking.k8s.io-gateway
policy that caused #212 was an orphan — it never flowed through any repo's CI, so
a CI-only lint would not have stopped it.

Detection rule

For each auditRule, a deref audit.responseObject.<a.b...> (depth ≥ the
defaulted top level) is UNSAFE unless either:

  • the same rule's match contains has(audit.responseObject.<same path or an ancestor that guarantees it>), or
  • the deref itself is guarded inline: has(audit.responseObject.<path>) ? ... : ....

Must evaluate match + summary togetherdelete rules legitimately deref
responseObject.spec.* because their match guards has(audit.responseObject.spec).
Apply the same logic to deep requestObject derefs (cf. milo-os/billing#69:
requestObject.spec.projectRef.name guarded only to projectRef).

Implementation note: walk the compiled CEL AST (env.Compile returns the AST)
for select chains rooted at responseObject/requestObject; check enclosing
has() calls and the rule's match AST. Reuse summaryTemplateRegex
(internal/cel/policy.go:34) to extract {{ }} expressions from summaries.

Reuse / second entry point

Factor the checker so it is callable from both admission and a CLI. Add an
activity policy lint subcommand (next to pkg/cmd/policy/preview.go) that runs
the same check over policy YAML files. A shared datum-cloud/actions reusable
workflow (sibling to validate-kustomize.yaml) will invoke that binary for fast
pre-merge feedback in milo/billing/NSO — tracked separately; this issue is the
admission/controller gate only.

Acceptance

  • Shared checker function: given a rule (match + summary), returns
    violations for unguarded deep responseObject/requestObject derefs
  • Wired into ActivityPolicy validation (admission webhook preferred;
    otherwise a hard status.conditions failure that blocks reconcile)
  • Rejecting an unguarded create-summary is covered by a test; the six
    already-fixed real policies pass
  • delete rules guarded via match still pass (no false positives)
  • Clear error: name the rule, the unguarded path, and suggest the
    has(...) ? ... : ... fix

Refs #212. Related fixes: datum-cloud/network-services-operator#230,
milo-os/milo#672, milo-os/billing#69, #213.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions