Problem
ActivityPolicy summary/match CEL can dereference audit.responseObject.<deep.path>
(e.g. responseObject.metadata.name, responseObject.spec.x) without guarding
it. On a rejected create/update (409/422/admission-deny) the apiserver
returns a Status object as responseObject — no spec, no metadata.name.
CEL raises no such key, the event goes to the DLQ, retries fail identically →
slow steady leak. This caused the prod DLQSlowLeak on
gateway.networking.k8s.io-gateway (see #212).
Six other repos shipped the same latent pattern (NSO, milo, billing — fixed in
datum-cloud/network-services-operator#230, milo-os/milo#672, milo-os/billing#69).
Why existing validation does not catch it
audit is declared cel.MapType(StringType, cel.DynType)
(internal/cel/environment.go), so it is fully dynamic. env.Compile() accepts
audit.responseObject.metadata.name with no error, and ValidatePolicyExpression
(internal/cel/policy.go:50) is compile-based — so it cannot flag this. The
failure is a runtime "no such key" on a specific event shape, not a syntax/type
error. Detection must be structural (AST), not compile.
Top-level fields are already defaulted to empty maps
(environment.go:96-110) so has(audit.responseObject) is always true — which
is exactly why guarding the root is insufficient and deep-path guards are
required.
Ask
Add server-side validation to the ActivityPolicy path
(internal/controller/activitypolicy_controller.go, ideally a validating
admission webhook so bad specs are rejected at apply, not just reconciled with a
status condition) that flags unguarded deep dereferences of audit.responseObject
(and deep audit.requestObject) in auditRules[].summary and .match.
This is the primary gate: it is the only layer that catches ad-hoc / orphan
CRs applied directly to a control plane. The gateway.networking.k8s.io-gateway
policy that caused #212 was an orphan — it never flowed through any repo's CI, so
a CI-only lint would not have stopped it.
Detection rule
For each auditRule, a deref audit.responseObject.<a.b...> (depth ≥ the
defaulted top level) is UNSAFE unless either:
- the same rule's
match contains has(audit.responseObject.<same path or an ancestor that guarantees it>), or
- the deref itself is guarded inline:
has(audit.responseObject.<path>) ? ... : ....
Must evaluate match + summary together — delete rules legitimately deref
responseObject.spec.* because their match guards has(audit.responseObject.spec).
Apply the same logic to deep requestObject derefs (cf. milo-os/billing#69:
requestObject.spec.projectRef.name guarded only to projectRef).
Implementation note: walk the compiled CEL AST (env.Compile returns the AST)
for select chains rooted at responseObject/requestObject; check enclosing
has() calls and the rule's match AST. Reuse summaryTemplateRegex
(internal/cel/policy.go:34) to extract {{ }} expressions from summaries.
Reuse / second entry point
Factor the checker so it is callable from both admission and a CLI. Add an
activity policy lint subcommand (next to pkg/cmd/policy/preview.go) that runs
the same check over policy YAML files. A shared datum-cloud/actions reusable
workflow (sibling to validate-kustomize.yaml) will invoke that binary for fast
pre-merge feedback in milo/billing/NSO — tracked separately; this issue is the
admission/controller gate only.
Acceptance
Refs #212. Related fixes: datum-cloud/network-services-operator#230,
milo-os/milo#672, milo-os/billing#69, #213.
Problem
ActivityPolicy summary/match CEL can dereference
audit.responseObject.<deep.path>(e.g.
responseObject.metadata.name,responseObject.spec.x) without guardingit. On a rejected create/update (409/422/admission-deny) the apiserver
returns a
Statusobject asresponseObject— nospec, nometadata.name.CEL raises
no such key, the event goes to the DLQ, retries fail identically →slow steady leak. This caused the prod
DLQSlowLeakongateway.networking.k8s.io-gateway(see #212).Six other repos shipped the same latent pattern (NSO, milo, billing — fixed in
datum-cloud/network-services-operator#230, milo-os/milo#672, milo-os/billing#69).
Why existing validation does not catch it
auditis declaredcel.MapType(StringType, cel.DynType)(
internal/cel/environment.go), so it is fully dynamic.env.Compile()acceptsaudit.responseObject.metadata.namewith no error, andValidatePolicyExpression(
internal/cel/policy.go:50) is compile-based — so it cannot flag this. Thefailure is a runtime "no such key" on a specific event shape, not a syntax/type
error. Detection must be structural (AST), not compile.
Top-level fields are already defaulted to empty maps
(
environment.go:96-110) sohas(audit.responseObject)is always true — whichis exactly why guarding the root is insufficient and deep-path guards are
required.
Ask
Add server-side validation to the ActivityPolicy path
(
internal/controller/activitypolicy_controller.go, ideally a validatingadmission webhook so bad specs are rejected at apply, not just reconciled with a
status condition) that flags unguarded deep dereferences of
audit.responseObject(and deep
audit.requestObject) inauditRules[].summaryand.match.This is the primary gate: it is the only layer that catches ad-hoc / orphan
CRs applied directly to a control plane. The
gateway.networking.k8s.io-gatewaypolicy that caused #212 was an orphan — it never flowed through any repo's CI, so
a CI-only lint would not have stopped it.
Detection rule
For each
auditRule, a derefaudit.responseObject.<a.b...>(depth ≥ thedefaulted top level) is UNSAFE unless either:
matchcontainshas(audit.responseObject.<same path or an ancestor that guarantees it>), orhas(audit.responseObject.<path>) ? ... : ....Must evaluate match + summary together —
deleterules legitimately derefresponseObject.spec.*because their match guardshas(audit.responseObject.spec).Apply the same logic to deep
requestObjectderefs (cf. milo-os/billing#69:requestObject.spec.projectRef.nameguarded only toprojectRef).Implementation note: walk the compiled CEL AST (
env.Compilereturns the AST)for
selectchains rooted atresponseObject/requestObject; check enclosinghas()calls and the rule's match AST. ReusesummaryTemplateRegex(
internal/cel/policy.go:34) to extract{{ }}expressions from summaries.Reuse / second entry point
Factor the checker so it is callable from both admission and a CLI. Add an
activity policy lintsubcommand (next topkg/cmd/policy/preview.go) that runsthe same check over policy YAML files. A shared
datum-cloud/actionsreusableworkflow (sibling to
validate-kustomize.yaml) will invoke that binary for fastpre-merge feedback in milo/billing/NSO — tracked separately; this issue is the
admission/controller gate only.
Acceptance
violations for unguarded deep
responseObject/requestObjectderefsotherwise a hard
status.conditionsfailure that blocks reconcile)already-fixed real policies pass
deleterules guarded via match still pass (no false positives)has(...) ? ... : ...fixRefs #212. Related fixes: datum-cloud/network-services-operator#230,
milo-os/milo#672, milo-os/billing#69, #213.