Skip to content

fix: gateway ActivityPolicy summary CEL leaks DLQ on events missing responseObject.metadata.name #212

Description

@ecv

Summary

DLQSlowLeak firing in prod (prod-infrastructure-control-plane). The
gateway.networking.k8s.io-gateway ActivityPolicy sends Gateway audit events to
the DLQ at a slow, steady rate. ~120 events / 6h and climbing (started as
51 + 69 across the two processor pods). Gateway activity timeline entries are
silently dropped for the affected event shapes.

  • Alert: DLQSlowLeak (dlq-health group), severity warning
  • Cluster: prod-infrastructure-control-plane, namespace activity-system
  • Runbook: docs/runbooks/dlq/dlq-growth.md → docs/runbooks/dlq/policy-dlq-errors.md

Root cause

Processor logs (activity-processor, dlq_retry.go:513), repeating, retries
exhausted at retryCount: 5:

DLQ event re-failed evaluation
  eventType=audit
  policy=gateway.networking.k8s.io-gateway
  errorType=cel_summary
  retryCount=5
  err=rule 0 summary: summary template evaluation failed:
      eval "link(audit.responseObject.metadata.name, audit.objectRef)": no such key: name

Rule 0's summary template dereferences audit.responseObject.metadata.name.
For some Gateway audit events responseObject is present but its metadata
has no name key (DELETE, status subresource, and error/forbidden
responses where the apiserver does not echo a named object). CEL raises
no such key: name, evaluation fails, the event goes to the DLQ, and the
retries fail identically — hence the steady leak rather than a spike.

Only this policy leaks. resourcemanager.miloapis.com-project (the other
audit policy) reports 0:

sum(increase(activity_processor_dlq_events_published_total[6h])) by (policy_name, error_type, kind)
  {error_type="cel_summary", kind="Gateway", policy_name="gateway.networking.k8s.io-gateway"} = 120
  {error_type="cel_summary", kind="Project", policy_name="resourcemanager.miloapis.com-project"} = 0

Where the policy config lives

The failing config is a live ActivityPolicy custom resource named
gateway.networking.k8s.io-gateway on the prod-infrastructure-control-plane
control plane. It is not version-controlled:

  • Not shipped as a default/seed in this repo — grep for the policy name and
    kind: ActivityPolicy finds only e2e fixtures under
    test/e2e/activitypolicy/policies/. No bootstrap/seeder creates it.
  • Not present in the datum-cloud/infra GitOps repo either.

So its source of truth is external / applied ad-hoc. Gap to fix: this
policy should be checked into a GitOps-managed location so a fix survives and
is reviewable.

Relevant code paths (for reference, not the bug):

  • internal/controller/activitypolicy_controller.go — reconciles ActivityPolicy
  • internal/cel/environment.go — defines the link(displayText, resourceRef) builtin
  • internal/cel/policy.go, internal/processor/policy.go — summary evaluation
  • internal/activityprocessor/dlq_retry.go — DLQ retry path (emits the error above)

This repo's own example already uses the correct null-safe form —
examples/basic-kubernetes/networking.yaml:19:

{{ has(audit.objectRef.name) ? link(audit.objectRef.name, audit.objectRef) : '' }}

Fix

Make the summary CEL null-safe. audit.objectRef is always present on audit
events and carries the name, so use it as the source (or fall back to it):

link(
  has(audit.responseObject.metadata.name)
    ? audit.responseObject.metadata.name
    : audit.objectRef.name,
  audit.objectRef
)

Apply to the live CR:

kubectl edit activitypolicy gateway.networking.k8s.io-gateway
# patch rule 0 summary; save → processor immediately retries the DLQ backlog

Secondary bug — the runbook's own fix is wrong

docs/runbooks/dlq/policy-dlq-errors.md documents this remediation:

# Before: audit.responseObject.metadata.name
# After:  has(audit.responseObject) ? audit.responseObject.metadata.name : audit.objectRef.name

This does not fix the present error: when responseObject exists but
metadata.name is absent, has(audit.responseObject) is true and the
expression still evaluates audit.responseObject.metadata.name → same
no such key: name. The guard must be on the full leaf path
(has(audit.responseObject.metadata.name)), not on the root object. Branch
fixes the runbook guidance.

Acceptance

  • Live gateway.networking.k8s.io-gateway CR patched; cel_summary DLQ
    rate for this policy returns to 0
  • Runbook CEL guidance corrected (has() on the leaf path)
  • Gateway policy CR moved into a GitOps-managed source of truth (follow-up)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions