Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 14 additions & 6 deletions docs/runbooks/dlq/dlq-growth.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,14 +29,22 @@ kubectl exec -n activity-system deploy/prometheus -- \
promtool query instant 'topk(10, sum by (policy_name) (rate(activity_processor_dlq_events_published_total[5m])))'
```

### 2. Check processor logs
### 2. Check processor logs (Loki)

```bash
# Look for DLQ-related errors
kubectl logs -n activity-system -l app=activity-processor --tail=100 | grep -i "dlq\|dead.letter\|error"
DLQ failures often predate the current pods, so query Loki rather than
`kubectl logs`. Full query guide: [querying-logs-with-loki.md](./querying-logs-with-loki.md).

```logql
# what's failing, grouped by policy + error class (last 15m)
sum by (policy, errorType) (
count_over_time({namespace="activity-system", container="processor"} | json | errorType != "" [15m])
)
```

Read `err` on a sample line to see the failing rule/field, then drill into one policy:

# Check for policy evaluation errors
kubectl logs -n activity-system -l app=activity-processor --tail=100 | grep -i "failed to evaluate"
```logql
{namespace="activity-system", container="processor"} | json | policy="<policy-name>" |~ "(?i)evaluat|dlq"
```

### 3. Common error patterns
Expand Down
14 changes: 10 additions & 4 deletions docs/runbooks/dlq/dlq-high-retry-count.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,19 @@ kubectl get activitypolicy <policy-name> -o yaml
kubectl get activitypolicy <policy-name> -o jsonpath='{.status.conditions}'
```

### 3. Examine processor logs
### 3. Examine processor logs (Loki)

```bash
# Look for repeated errors for this policy
kubectl logs -n activity-system -l app=activity-processor --tail=500 | grep "<policy-name>"
Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md).
High retry counts usually mean one policy is broken and a small set of events is
looping. Confirm by checking how many *distinct* `auditID`s are re-failing:

```logql
{namespace="activity-system", container="processor"} | json | policy="<policy-name>" | line_format "{{.auditID}}"
```

A few IDs across many lines ⇒ those events are stuck on an unfixed policy;
read `err` on a sample line for the exact failing rule/field.

### 4. Common causes

**Policy never fixed**:
Expand Down
13 changes: 9 additions & 4 deletions docs/runbooks/dlq/dlq-publish-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,18 @@ kubectl exec -n nats-system deploy/nats-box -- nats stream info ACTIVITY_DEAD_LE
kubectl logs -n nats-system -l app.kubernetes.io/name=nats --tail=100 | grep -i error
```

### 2. Check processor logs
### 2. Check processor logs (Loki)

```bash
# Look for DLQ publish errors
kubectl logs -n activity-system -l app=activity-processor --tail=200 | grep -i "failed to publish to DLQ"
Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md):

```logql
{namespace="activity-system", container="processor"} |~ "(?i)failed to publish.*dlq|NAK"
```

These lines mean the processor could not even write to the DLQ — events are
being lost. This is almost always a NATS/JetStream problem (see below), not a
policy problem.

### 3. Check NATS connectivity

```bash
Expand Down
18 changes: 14 additions & 4 deletions docs/runbooks/dlq/dlq-retry-failing.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,21 @@ kubectl exec -n activity-system deploy/prometheus -- \
promtool query instant 'sum by (api_group, kind) (rate(activity_processor_dlq_retry_attempts_total{result="failed"}[15m]))'
```

### 2. Check processor logs for retry errors
### 2. Check processor logs for retry errors (Loki)

```bash
# Look for retry failures
kubectl logs -n activity-system -l app=activity-processor --tail=300 | grep -i "failed to republish\|failed to retry"
Query Loki — see [querying-logs-with-loki.md](./querying-logs-with-loki.md).
Compare failed vs. succeeded across retry runs:

```logql
sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalFailed [15m]))
sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalSucceeded [15m]))
```

If the same `auditID` keeps re-failing, the underlying policy is still broken
(retry can't fix a bad policy) — pull the IDs to confirm a loop:

```logql
{namespace="activity-system", container="processor"} | json | errorType != "" | line_format "{{.auditID}}"
```

### 3. Identify root cause
Expand Down
13 changes: 10 additions & 3 deletions docs/runbooks/dlq/policy-dlq-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,12 +26,19 @@ Alert includes these labels:
kubectl get activitypolicy <policy-name> -o yaml
```

### 2. Check processor logs for this policy
### 2. Check processor logs for this policy (Loki)

```bash
kubectl logs -n activity-system -l app=activity-processor --tail=200 | grep "<policy-name>"
Query Loki (failures may predate the current pods) — see
[querying-logs-with-loki.md](./querying-logs-with-loki.md):

```logql
{namespace="activity-system", container="processor"} | json | policy="<policy-name>" |~ "(?i)evaluat|dlq"
```

Read the `err` field — for `errorType="cel_summary"` it names the failing rule
and the missing key (e.g. `no such key: name`), which tells you exactly which
field the summary/match dereferences unsafely.

### 3. Test policy with sample event

```bash
Expand Down
113 changes: 113 additions & 0 deletions docs/runbooks/dlq/querying-logs-with-loki.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Querying activity-processor logs with Loki

Every DLQ runbook in this directory eventually needs you to read the
processor's logs. **Use Loki, not `kubectl logs`, for DLQ triage.**
Dead-lettered events are frequently older than the current pods — they survive
rollouts and can sit unprocessed for days — and triage usually means
*aggregating* or *counting* across time, which `kubectl logs --tail` cannot do.

## Running these queries

Open **Grafana → Explore → select the Loki datasource**, then paste any LogQL
query below into the query box and run it. Expand a log line to see all of its
parsed fields. For the aggregating queries (the `sum ... count_over_time` and
`unwrap` examples), switch the Explore panel to **Table** or **Graph** to read
the result.

## Base selector & log shape

Start every query from:

```logql
{namespace="activity-system", container="processor"}
```

The processor logs **structured JSON** — pipe through `| json` to filter and
extract fields. The useful ones:

| Field | Meaning |
|---|---|
| `msg` | log message (wording varies across releases — see note below) |
| `policy` | ActivityPolicy name, e.g. `networking.datumapis.com-connector` |
| `errorType` | DLQ failure class: `cel_summary`, `cel_match`, `unmarshal`, `kind_resolve` — **the primary triage dimension** |
| `err` | underlying error string; for CEL errors it names the failing rule and missing key |
| `auditID` | source audit event ID — use it to follow one event across retries |
| `subject` | NATS subject the event was dead-lettered to (`activity.dlq.<group>.<Kind>`) |
| `retryCount`, `totalProcessed`, `totalSucceeded`, `totalFailed` | on retry-run lines |

> Prefer filtering on the **stable structured fields** (`errorType`, `policy`,
> `err`) and broad keywords (`|~ "(?i)dlq|dead.?letter|evaluat"`) over full
> `msg` strings — message wording is refactored from time to time, the fields
> and the `errorType` taxonomy are not.

## Triage recipes (by alert)

### Which policies / fields are failing — `DLQQueueGrowing`, `ActivityPolicyDLQErrors`

```logql
# everything dead-lettered, grouped by policy + error class, last hour
sum by (policy, errorType) (
count_over_time({namespace="activity-system", container="processor"} | json | errorType != "" [1h])
)
```

Then read `err` on a sample line. For `errorType="cel_summary"`, `err` looks
like:

```
rule N summary: ... eval "link(audit.objectRef.name, audit.objectRef)": no such key: name
```

i.e. the summary dereferences a field that isn't present on this event shape.
The classic case: `objectRef.name` is empty for `generateName` creates (the
assigned name is in `responseObject.metadata.name`). See
[policy-dlq-errors.md](./policy-dlq-errors.md) for the fix pattern.

### Pin the exact failing expression for one policy

```logql
{namespace="activity-system", container="processor"} | json | policy="<policy-name>" |~ "(?i)evaluat|dlq"
```

### Real backlog vs. a churn loop — `DLQHighRetryCount`, `DLQRetryIneffective`

A single event re-failing over and over (same `auditID`) is a stuck loop, not
progress. Pull the auditIDs and look for repeats:

```logql
{namespace="activity-system", container="processor"} | json | errorType="cel_summary" | line_format "{{.auditID}}"
```

Few distinct IDs across many lines ⇒ a handful of events looping (fix the
policy). Many distinct IDs ⇒ broad breakage (likely a shared field/shape bug).

### Retry effectiveness — `DLQRetryIneffective`

```logql
# failed vs succeeded across retry runs, last hour
sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalFailed [1h]))
sum(sum_over_time({namespace="activity-system", container="processor"} | json |~ "retry run" | unwrap totalSucceeded [1h]))
```

### DLQ publish failures (events being lost) — `DLQPublishErrors`

```logql
{namespace="activity-system", container="processor"} |~ "(?i)failed to publish.*dlq|NAK"
```

## Gotchas

- **Lines are truncated (~4096 chars)** by the log collector. `auditID`,
`policy`, `errorType`, and `err` sit early and survive, but the full
`eventJSON` payload is usually cut off. To inspect a complete failing event,
look it up live by name/auditID rather than relying on the logged payload.
- **Mind the query window.** Loki enforces `max_query_length` (commonly ~30d) —
very wide ranges error out — and retains logs only for a limited period.
**Failures older than retention are not in Loki**; fall back to metrics or the
NATS stream.
- **DLQ "depth" is a metric, not a log, and can mislead.** Real depth is
`nats_stream_total_messages{stream_name="ACTIVITY_DEAD_LETTER"}` in
VictoriaMetrics. It can read **0 even while events are actively failing**,
because the stream is WorkQueue with a dedup window — failing events are
acked/dropped, not retained. Never infer "healthy" from depth alone;
corroborate with the eval-failure logs above.