diff --git a/docs/runbooks/dlq/dlq-growth.md b/docs/runbooks/dlq/dlq-growth.md index 0b8d5fab..afa3a919 100644 --- a/docs/runbooks/dlq/dlq-growth.md +++ b/docs/runbooks/dlq/dlq-growth.md @@ -29,14 +29,22 @@ kubectl exec -n activity-system deploy/prometheus -- \ promtool query instant 'topk(10, sum by (policy_name) (rate(activity_processor_dlq_events_published_total[5m])))' ``` -### 2. Check processor logs +### 2. Check processor logs (Loki) -```bash -# Look for DLQ-related errors -kubectl logs -n activity-system -l app=activity-processor --tail=100 | grep -i "dlq\|dead.letter\|error" +DLQ failures often predate the current pods, so query Loki rather than +`kubectl logs`. Full query guide: [querying-logs-with-loki.md](./querying-logs-with-loki.md). + +```logql +# what's failing, grouped by policy + error class (last 15m) +sum by (policy, errorType) ( + count_over_time({namespace="activity-system", container="processor"} | json | errorType != "" [15m]) +) +``` + +Read `err` on a sample line to see the failing rule/field, then drill into one policy: -# Check for policy evaluation errors -kubectl logs -n activity-system -l app=activity-processor --tail=100 | grep -i "failed to evaluate" +```logql +{namespace="activity-system", container="processor"} | json | policy="" |~ "(?i)evaluat|dlq" ``` ### 3. Common error patterns diff --git a/docs/runbooks/dlq/dlq-high-retry-count.md b/docs/runbooks/dlq/dlq-high-retry-count.md index 8a4522fe..ed7a02ec 100644 --- a/docs/runbooks/dlq/dlq-high-retry-count.md +++ b/docs/runbooks/dlq/dlq-high-retry-count.md @@ -37,13 +37,19 @@ kubectl get activitypolicy -o yaml kubectl get activitypolicy -o jsonpath='{.status.conditions}' ``` -### 3. Examine processor logs +### 3. Examine processor logs (Loki) -```bash -# Look for repeated errors for this policy -kubectl logs -n activity-system -l app=activity-processor --tail=500 | grep "" +Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md). +High retry counts usually mean one policy is broken and a small set of events is +looping. Confirm by checking how many *distinct* `auditID`s are re-failing: + +```logql +{namespace="activity-system", container="processor"} | json | policy="" | line_format "{{.auditID}}" ``` +A few IDs across many lines ⇒ those events are stuck on an unfixed policy; +read `err` on a sample line for the exact failing rule/field. + ### 4. Common causes **Policy never fixed**: diff --git a/docs/runbooks/dlq/dlq-publish-errors.md b/docs/runbooks/dlq/dlq-publish-errors.md index f7ca13ca..ce02c9a6 100644 --- a/docs/runbooks/dlq/dlq-publish-errors.md +++ b/docs/runbooks/dlq/dlq-publish-errors.md @@ -27,13 +27,18 @@ kubectl exec -n nats-system deploy/nats-box -- nats stream info ACTIVITY_DEAD_LE kubectl logs -n nats-system -l app.kubernetes.io/name=nats --tail=100 | grep -i error ``` -### 2. Check processor logs +### 2. Check processor logs (Loki) -```bash -# Look for DLQ publish errors -kubectl logs -n activity-system -l app=activity-processor --tail=200 | grep -i "failed to publish to DLQ" +Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md): + +```logql +{namespace="activity-system", container="processor"} |~ "(?i)failed to publish.*dlq|NAK" ``` +These lines mean the processor could not even write to the DLQ — events are +being lost. This is almost always a NATS/JetStream problem (see below), not a +policy problem. + ### 3. Check NATS connectivity ```bash diff --git a/docs/runbooks/dlq/dlq-retry-failing.md b/docs/runbooks/dlq/dlq-retry-failing.md index ad07a3aa..2ad396fd 100644 --- a/docs/runbooks/dlq/dlq-retry-failing.md +++ b/docs/runbooks/dlq/dlq-retry-failing.md @@ -29,11 +29,21 @@ kubectl exec -n activity-system deploy/prometheus -- \ promtool query instant 'sum by (api_group, kind) (rate(activity_processor_dlq_retry_attempts_total{result="failed"}[15m]))' ``` -### 2. Check processor logs for retry errors +### 2. Check processor logs for retry errors (Loki) -```bash -# Look for retry failures -kubectl logs -n activity-system -l app=activity-processor --tail=300 | grep -i "failed to republish\|failed to retry" +Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md). +Compare failed vs. succeeded across retry runs: + +```logql +sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalFailed [15m])) +sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalSucceeded [15m])) +``` + +If the same `auditID` keeps re-failing, the underlying policy is still broken +(retry can't fix a bad policy) — pull the IDs to confirm a loop: + +```logql +{namespace="activity-system", container="processor"} | json | errorType != "" | line_format "{{.auditID}}" ``` ### 3. Identify root cause diff --git a/docs/runbooks/dlq/policy-dlq-errors.md b/docs/runbooks/dlq/policy-dlq-errors.md index f4126987..c5c48edf 100644 --- a/docs/runbooks/dlq/policy-dlq-errors.md +++ b/docs/runbooks/dlq/policy-dlq-errors.md @@ -26,12 +26,19 @@ Alert includes these labels: kubectl get activitypolicy -o yaml ``` -### 2. Check processor logs for this policy +### 2. Check processor logs for this policy (Loki) -```bash -kubectl logs -n activity-system -l app=activity-processor --tail=200 | grep "" +Query Loki (failures may predate the current pods) — see +[querying-logs-with-loki.md](./querying-logs-with-loki.md): + +```logql +{namespace="activity-system", container="processor"} | json | policy="" |~ "(?i)evaluat|dlq" ``` +Read the `err` field — for `errorType="cel_summary"` it names the failing rule +and the missing key (e.g. `no such key: name`), which tells you exactly which +field the summary/match dereferences unsafely. + ### 3. Test policy with sample event ```bash diff --git a/docs/runbooks/dlq/querying-logs-with-loki.md b/docs/runbooks/dlq/querying-logs-with-loki.md new file mode 100644 index 00000000..480ae772 --- /dev/null +++ b/docs/runbooks/dlq/querying-logs-with-loki.md @@ -0,0 +1,113 @@ +# Querying activity-processor logs with Loki + +Every DLQ runbook in this directory eventually needs you to read the +processor's logs. **Use Loki, not `kubectl logs`, for DLQ triage.** +Dead-lettered events are frequently older than the current pods — they survive +rollouts and can sit unprocessed for days — and triage usually means +*aggregating* or *counting* across time, which `kubectl logs --tail` cannot do. + +## Running these queries + +Open **Grafana → Explore → select the Loki datasource**, then paste any LogQL +query below into the query box and run it. Expand a log line to see all of its +parsed fields. For the aggregating queries (the `sum ... count_over_time` and +`unwrap` examples), switch the Explore panel to **Table** or **Graph** to read +the result. + +## Base selector & log shape + +Start every query from: + +```logql +{namespace="activity-system", container="processor"} +``` + +The processor logs **structured JSON** — pipe through `| json` to filter and +extract fields. The useful ones: + +| Field | Meaning | +|---|---| +| `msg` | log message (wording varies across releases — see note below) | +| `policy` | ActivityPolicy name, e.g. `networking.datumapis.com-connector` | +| `errorType` | DLQ failure class: `cel_summary`, `cel_match`, `unmarshal`, `kind_resolve` — **the primary triage dimension** | +| `err` | underlying error string; for CEL errors it names the failing rule and missing key | +| `auditID` | source audit event ID — use it to follow one event across retries | +| `subject` | NATS subject the event was dead-lettered to (`activity.dlq..`) | +| `retryCount`, `totalProcessed`, `totalSucceeded`, `totalFailed` | on retry-run lines | + +> Prefer filtering on the **stable structured fields** (`errorType`, `policy`, +> `err`) and broad keywords (`|~ "(?i)dlq|dead.?letter|evaluat"`) over full +> `msg` strings — message wording is refactored from time to time, the fields +> and the `errorType` taxonomy are not. + +## Triage recipes (by alert) + +### Which policies / fields are failing — `DLQQueueGrowing`, `ActivityPolicyDLQErrors` + +```logql +# everything dead-lettered, grouped by policy + error class, last hour +sum by (policy, errorType) ( + count_over_time({namespace="activity-system", container="processor"} | json | errorType != "" [1h]) +) +``` + +Then read `err` on a sample line. For `errorType="cel_summary"`, `err` looks +like: + +``` +rule N summary: ... eval "link(audit.objectRef.name, audit.objectRef)": no such key: name +``` + +i.e. the summary dereferences a field that isn't present on this event shape. +The classic case: `objectRef.name` is empty for `generateName` creates (the +assigned name is in `responseObject.metadata.name`). See +[policy-dlq-errors.md](./policy-dlq-errors.md) for the fix pattern. + +### Pin the exact failing expression for one policy + +```logql +{namespace="activity-system", container="processor"} | json | policy="" |~ "(?i)evaluat|dlq" +``` + +### Real backlog vs. a churn loop — `DLQHighRetryCount`, `DLQRetryIneffective` + +A single event re-failing over and over (same `auditID`) is a stuck loop, not +progress. Pull the auditIDs and look for repeats: + +```logql +{namespace="activity-system", container="processor"} | json | errorType="cel_summary" | line_format "{{.auditID}}" +``` + +Few distinct IDs across many lines ⇒ a handful of events looping (fix the +policy). Many distinct IDs ⇒ broad breakage (likely a shared field/shape bug). + +### Retry effectiveness — `DLQRetryIneffective` + +```logql +# failed vs succeeded across retry runs, last hour +sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalFailed [1h])) +sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalSucceeded [1h])) +``` + +### DLQ publish failures (events being lost) — `DLQPublishErrors` + +```logql +{namespace="activity-system", container="processor"} |~ "(?i)failed to publish.*dlq|NAK" +``` + +## Gotchas + +- **Lines are truncated (~4096 chars)** by the log collector. `auditID`, + `policy`, `errorType`, and `err` sit early and survive, but the full + `eventJSON` payload is usually cut off. To inspect a complete failing event, + look it up live by name/auditID rather than relying on the logged payload. +- **Mind the query window.** Loki enforces `max_query_length` (commonly ~30d) — + very wide ranges error out — and retains logs only for a limited period. + **Failures older than retention are not in Loki**; fall back to metrics or the + NATS stream. +- **DLQ "depth" is a metric, not a log, and can mislead.** Real depth is + `nats_stream_total_messages{stream_name="ACTIVITY_DEAD_LETTER"}` in + VictoriaMetrics. It can read **0 even while events are actively failing**, + because the stream is WorkQueue with a dedup window — failing events are + acked/dropped, not retained. Never infer "healthy" from depth alone; + corroborate with the eval-failure logs above.