milo-os · scotwells · Jun 25, 2026
diff --git a/docs/runbooks/dlq/dlq-growth.md b/docs/runbooks/dlq/dlq-growth.md
@@ -29,14 +29,22 @@ kubectl exec -n activity-system deploy/prometheus -- \
   promtool query instant 'topk(10, sum by (policy_name) (rate(activity_processor_dlq_events_published_total[5m])))'
 ```
 
-### 2. Check processor logs
+### 2. Check processor logs (Loki)
 
-```bash
-# Look for DLQ-related errors
-kubectl logs -n activity-system -l app=activity-processor --tail=100 | grep -i "dlq\|dead.letter\|error"
+DLQ failures often predate the current pods, so query Loki rather than
+`kubectl logs`. Full query guide: [querying-logs-with-loki.md](./querying-logs-with-loki.md).
+
+```logql
+# what's failing, grouped by policy + error class (last 15m)
+sum by (policy, errorType) (
+  count_over_time({namespace="activity-system", container="processor"} | json | errorType != "" [15m])
+)
+```
+
+Read `err` on a sample line to see the failing rule/field, then drill into one policy:
 
-# Check for policy evaluation errors
-kubectl logs -n activity-system -l app=activity-processor --tail=100 | grep -i "failed to evaluate"
+```logql
+{namespace="activity-system", container="processor"} | json | policy="<policy-name>" |~ "(?i)evaluat|dlq"
 ```
 
 ### 3. Common error patterns

diff --git a/docs/runbooks/dlq/dlq-high-retry-count.md b/docs/runbooks/dlq/dlq-high-retry-count.md
@@ -37,13 +37,19 @@ kubectl get activitypolicy <policy-name> -o yaml
 kubectl get activitypolicy <policy-name> -o jsonpath='{.status.conditions}'
 ```
 
-### 3. Examine processor logs
+### 3. Examine processor logs (Loki)
 
-```bash
-# Look for repeated errors for this policy
-kubectl logs -n activity-system -l app=activity-processor --tail=500 | grep "<policy-name>"
+Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md).
+High retry counts usually mean one policy is broken and a small set of events is
+looping. Confirm by checking how many *distinct* `auditID`s are re-failing:
+
+```logql
+{namespace="activity-system", container="processor"} | json | policy="<policy-name>" | line_format "{{.auditID}}"
 ```
 
+A few IDs across many lines ⇒ those events are stuck on an unfixed policy;
+read `err` on a sample line for the exact failing rule/field.
+
 ### 4. Common causes
 
 **Policy never fixed**:

diff --git a/docs/runbooks/dlq/dlq-publish-errors.md b/docs/runbooks/dlq/dlq-publish-errors.md
@@ -27,13 +27,18 @@ kubectl exec -n nats-system deploy/nats-box -- nats stream info ACTIVITY_DEAD_LE
 kubectl logs -n nats-system -l app.kubernetes.io/name=nats --tail=100 | grep -i error
 ```
 
-### 2. Check processor logs
+### 2. Check processor logs (Loki)
 
-```bash
-# Look for DLQ publish errors
-kubectl logs -n activity-system -l app=activity-processor --tail=200 | grep -i "failed to publish to DLQ"
+Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md):
+
+```logql
+{namespace="activity-system", container="processor"} |~ "(?i)failed to publish.*dlq|NAK"
 ```
 
+These lines mean the processor could not even write to the DLQ — events are
+being lost. This is almost always a NATS/JetStream problem (see below), not a
+policy problem.
+
 ### 3. Check NATS connectivity
 
 ```bash

diff --git a/docs/runbooks/dlq/dlq-retry-failing.md b/docs/runbooks/dlq/dlq-retry-failing.md
@@ -29,11 +29,21 @@ kubectl exec -n activity-system deploy/prometheus -- \
   promtool query instant 'sum by (api_group, kind) (rate(activity_processor_dlq_retry_attempts_total{result="failed"}[15m]))'
 ```
 
-### 2. Check processor logs for retry errors
+### 2. Check processor logs for retry errors (Loki)
 
-```bash
-# Look for retry failures
-kubectl logs -n activity-system -l app=activity-processor --tail=300 | grep -i "failed to republish\|failed to retry"
+Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md).
+Compare failed vs. succeeded across retry runs:
+
+```logql
+sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalFailed [15m]))
+sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalSucceeded [15m]))
+```
+
+If the same `auditID` keeps re-failing, the underlying policy is still broken
+(retry can't fix a bad policy) — pull the IDs to confirm a loop:
+
+```logql
+{namespace="activity-system", container="processor"} | json | errorType != "" | line_format "{{.auditID}}"
 ```
 
 ### 3. Identify root cause

diff --git a/docs/runbooks/dlq/policy-dlq-errors.md b/docs/runbooks/dlq/policy-dlq-errors.md
@@ -26,12 +26,19 @@ Alert includes these labels:
 kubectl get activitypolicy <policy-name> -o yaml
 ```
 
-### 2. Check processor logs for this policy
+### 2. Check processor logs for this policy (Loki)
 
-```bash
-kubectl logs -n activity-system -l app=activity-processor --tail=200 | grep "<policy-name>"
+Query Loki (failures may predate the current pods) — see
+[querying-logs-with-loki.md](./querying-logs-with-loki.md):
+
+```logql
+{namespace="activity-system", container="processor"} | json | policy="<policy-name>" |~ "(?i)evaluat|dlq"
 ```
 
+Read the `err` field — for `errorType="cel_summary"` it names the failing rule
+and the missing key (e.g. `no such key: name`), which tells you exactly which
+field the summary/match dereferences unsafely.
+
 ### 3. Test policy with sample event
 
 ```bash

diff --git a/docs/runbooks/dlq/querying-logs-with-loki.md b/docs/runbooks/dlq/querying-logs-with-loki.md
@@ -0,0 +1,113 @@
+# Querying activity-processor logs with Loki
+
+Every DLQ runbook in this directory eventually needs you to read the
+processor's logs. **Use Loki, not `kubectl logs`, for DLQ triage.**
+Dead-lettered events are frequently older than the current pods — they survive
+rollouts and can sit unprocessed for days — and triage usually means
+*aggregating* or *counting* across time, which `kubectl logs --tail` cannot do.
+
+## Running these queries
+
+Open **Grafana → Explore → select the Loki datasource**, then paste any LogQL
+query below into the query box and run it. Expand a log line to see all of its
+parsed fields. For the aggregating queries (the `sum ... count_over_time` and
+`unwrap` examples), switch the Explore panel to **Table** or **Graph** to read
+the result.
+
+## Base selector & log shape
+
+Start every query from:
+
+```logql
+{namespace="activity-system", container="processor"}
+```
+
+The processor logs **structured JSON** — pipe through `| json` to filter and
+extract fields. The useful ones:
+
+| Field | Meaning |
+|---|---|
+| `msg` | log message (wording varies across releases — see note below) |
+| `policy` | ActivityPolicy name, e.g. `networking.datumapis.com-connector` |
+| `errorType` | DLQ failure class: `cel_summary`, `cel_match`, `unmarshal`, `kind_resolve` — **the primary triage dimension** |
+| `err` | underlying error string; for CEL errors it names the failing rule and missing key |
+| `auditID` | source audit event ID — use it to follow one event across retries |
+| `subject` | NATS subject the event was dead-lettered to (`activity.dlq.<group>.<Kind>`) |
+| `retryCount`, `totalProcessed`, `totalSucceeded`, `totalFailed` | on retry-run lines |
+
+> Prefer filtering on the **stable structured fields** (`errorType`, `policy`,
+> `err`) and broad keywords (`|~ "(?i)dlq|dead.?letter|evaluat"`) over full
+> `msg` strings — message wording is refactored from time to time, the fields
+> and the `errorType` taxonomy are not.
+
+## Triage recipes (by alert)
+
+### Which policies / fields are failing — `DLQQueueGrowing`, `ActivityPolicyDLQErrors`
+
+```logql
+# everything dead-lettered, grouped by policy + error class, last hour
+sum by (policy, errorType) (
+  count_over_time({namespace="activity-system", container="processor"} | json | errorType != "" [1h])
+)
+```
+
+Then read `err` on a sample line. For `errorType="cel_summary"`, `err` looks
+like:
+
+```
+rule N summary: ... eval "link(audit.objectRef.name, audit.objectRef)": no such key: name
+```
+
+i.e. the summary dereferences a field that isn't present on this event shape.
+The classic case: `objectRef.name` is empty for `generateName` creates (the
+assigned name is in `responseObject.metadata.name`). See
+[policy-dlq-errors.md](./policy-dlq-errors.md) for the fix pattern.
+
+### Pin the exact failing expression for one policy
+
+```logql
+{namespace="activity-system", container="processor"} | json | policy="<policy-name>" |~ "(?i)evaluat|dlq"
+```
+
+### Real backlog vs. a churn loop — `DLQHighRetryCount`, `DLQRetryIneffective`
+
+A single event re-failing over and over (same `auditID`) is a stuck loop, not
+progress. Pull the auditIDs and look for repeats:
+
+```logql
+{namespace="activity-system", container="processor"} | json | errorType="cel_summary" | line_format "{{.auditID}}"
+```
+
+Few distinct IDs across many lines ⇒ a handful of events looping (fix the
+policy). Many distinct IDs ⇒ broad breakage (likely a shared field/shape bug).
+
+### Retry effectiveness — `DLQRetryIneffective`
+
+```logql
+# failed vs succeeded across retry runs, last hour
+sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalFailed [1h]))
+sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalSucceeded [1h]))
+```
+
+### DLQ publish failures (events being lost) — `DLQPublishErrors`
+
+```logql
+{namespace="activity-system", container="processor"} |~ "(?i)failed to publish.*dlq|NAK"
+```
+
+## Gotchas
+
+- **Lines are truncated (~4096 chars)** by the log collector. `auditID`,
+  `policy`, `errorType`, and `err` sit early and survive, but the full
+  `eventJSON` payload is usually cut off. To inspect a complete failing event,
+  look it up live by name/auditID rather than relying on the logged payload.
+- **Mind the query window.** Loki enforces `max_query_length` (commonly ~30d) —
+  very wide ranges error out — and retains logs only for a limited period.
+  **Failures older than retention are not in Loki**; fall back to metrics or the
+  NATS stream.
+- **DLQ "depth" is a metric, not a log, and can mislead.** Real depth is
+  `nats_stream_total_messages{stream_name="ACTIVITY_DEAD_LETTER"}` in
+  VictoriaMetrics. It can read **0 even while events are actively failing**,
+  because the stream is WorkQueue with a dedup window — failing events are
+  acked/dropped, not retained. Never infer "healthy" from depth alone;
+  corroborate with the eval-failure logs above.