Summary
DLQSlowLeak firing in prod (prod-infrastructure-control-plane). The
gateway.networking.k8s.io-gateway ActivityPolicy sends Gateway audit events to
the DLQ at a slow, steady rate. ~120 events / 6h and climbing (started as
51 + 69 across the two processor pods). Gateway activity timeline entries are
silently dropped for the affected event shapes.
- Alert:
DLQSlowLeak (dlq-health group), severity warning
- Cluster:
prod-infrastructure-control-plane, namespace activity-system
- Runbook: docs/runbooks/dlq/dlq-growth.md → docs/runbooks/dlq/policy-dlq-errors.md
Root cause
Processor logs (activity-processor, dlq_retry.go:513), repeating, retries
exhausted at retryCount: 5:
DLQ event re-failed evaluation
eventType=audit
policy=gateway.networking.k8s.io-gateway
errorType=cel_summary
retryCount=5
err=rule 0 summary: summary template evaluation failed:
eval "link(audit.responseObject.metadata.name, audit.objectRef)": no such key: name
Rule 0's summary template dereferences audit.responseObject.metadata.name.
For some Gateway audit events responseObject is present but its metadata
has no name key (DELETE, status subresource, and error/forbidden
responses where the apiserver does not echo a named object). CEL raises
no such key: name, evaluation fails, the event goes to the DLQ, and the
retries fail identically — hence the steady leak rather than a spike.
Only this policy leaks. resourcemanager.miloapis.com-project (the other
audit policy) reports 0:
sum(increase(activity_processor_dlq_events_published_total[6h])) by (policy_name, error_type, kind)
{error_type="cel_summary", kind="Gateway", policy_name="gateway.networking.k8s.io-gateway"} = 120
{error_type="cel_summary", kind="Project", policy_name="resourcemanager.miloapis.com-project"} = 0
Where the policy config lives
The failing config is a live ActivityPolicy custom resource named
gateway.networking.k8s.io-gateway on the prod-infrastructure-control-plane
control plane. It is not version-controlled:
- Not shipped as a default/seed in this repo —
grep for the policy name and
kind: ActivityPolicy finds only e2e fixtures under
test/e2e/activitypolicy/policies/. No bootstrap/seeder creates it.
- Not present in the
datum-cloud/infra GitOps repo either.
So its source of truth is external / applied ad-hoc. Gap to fix: this
policy should be checked into a GitOps-managed location so a fix survives and
is reviewable.
Relevant code paths (for reference, not the bug):
internal/controller/activitypolicy_controller.go — reconciles ActivityPolicy
internal/cel/environment.go — defines the link(displayText, resourceRef) builtin
internal/cel/policy.go, internal/processor/policy.go — summary evaluation
internal/activityprocessor/dlq_retry.go — DLQ retry path (emits the error above)
This repo's own example already uses the correct null-safe form —
examples/basic-kubernetes/networking.yaml:19:
{{ has(audit.objectRef.name) ? link(audit.objectRef.name, audit.objectRef) : '' }}
Fix
Make the summary CEL null-safe. audit.objectRef is always present on audit
events and carries the name, so use it as the source (or fall back to it):
link(
has(audit.responseObject.metadata.name)
? audit.responseObject.metadata.name
: audit.objectRef.name,
audit.objectRef
)
Apply to the live CR:
kubectl edit activitypolicy gateway.networking.k8s.io-gateway
# patch rule 0 summary; save → processor immediately retries the DLQ backlog
Secondary bug — the runbook's own fix is wrong
docs/runbooks/dlq/policy-dlq-errors.md documents this remediation:
# Before: audit.responseObject.metadata.name
# After: has(audit.responseObject) ? audit.responseObject.metadata.name : audit.objectRef.name
This does not fix the present error: when responseObject exists but
metadata.name is absent, has(audit.responseObject) is true and the
expression still evaluates audit.responseObject.metadata.name → same
no such key: name. The guard must be on the full leaf path
(has(audit.responseObject.metadata.name)), not on the root object. Branch
fixes the runbook guidance.
Acceptance
Summary
DLQSlowLeakfiring in prod (prod-infrastructure-control-plane). Thegateway.networking.k8s.io-gatewayActivityPolicy sends Gateway audit events tothe DLQ at a slow, steady rate. ~120 events / 6h and climbing (started as
51 + 69 across the two processor pods). Gateway activity timeline entries are
silently dropped for the affected event shapes.
DLQSlowLeak(dlq-healthgroup), severity warningprod-infrastructure-control-plane, namespaceactivity-systemRoot cause
Processor logs (
activity-processor,dlq_retry.go:513), repeating, retriesexhausted at
retryCount: 5:Rule 0's summary template dereferences
audit.responseObject.metadata.name.For some Gateway audit events
responseObjectis present but itsmetadatahas no
namekey (DELETE, status subresource, and error/forbiddenresponses where the apiserver does not echo a named object). CEL raises
no such key: name, evaluation fails, the event goes to the DLQ, and theretries fail identically — hence the steady leak rather than a spike.
Only this policy leaks.
resourcemanager.miloapis.com-project(the otheraudit policy) reports 0:
Where the policy config lives
The failing config is a live
ActivityPolicycustom resource namedgateway.networking.k8s.io-gatewayon theprod-infrastructure-control-planecontrol plane. It is not version-controlled:
grepfor the policy name andkind: ActivityPolicyfinds only e2e fixtures undertest/e2e/activitypolicy/policies/. No bootstrap/seeder creates it.datum-cloud/infraGitOps repo either.So its source of truth is external / applied ad-hoc. Gap to fix: this
policy should be checked into a GitOps-managed location so a fix survives and
is reviewable.
Relevant code paths (for reference, not the bug):
internal/controller/activitypolicy_controller.go— reconciles ActivityPolicyinternal/cel/environment.go— defines thelink(displayText, resourceRef)builtininternal/cel/policy.go,internal/processor/policy.go— summary evaluationinternal/activityprocessor/dlq_retry.go— DLQ retry path (emits the error above)This repo's own example already uses the correct null-safe form —
examples/basic-kubernetes/networking.yaml:19:Fix
Make the summary CEL null-safe.
audit.objectRefis always present on auditevents and carries the name, so use it as the source (or fall back to it):
Apply to the live CR:
kubectl edit activitypolicy gateway.networking.k8s.io-gateway # patch rule 0 summary; save → processor immediately retries the DLQ backlogSecondary bug — the runbook's own fix is wrong
docs/runbooks/dlq/policy-dlq-errors.mddocuments this remediation:This does not fix the present error: when
responseObjectexists butmetadata.nameis absent,has(audit.responseObject)istrueand theexpression still evaluates
audit.responseObject.metadata.name→ sameno such key: name. The guard must be on the full leaf path(
has(audit.responseObject.metadata.name)), not on the root object. Branchfixes the runbook guidance.
Acceptance
gateway.networking.k8s.io-gatewayCR patched;cel_summaryDLQrate for this policy returns to 0
has()on the leaf path)