feat: add DLQ & Policy Health dashboard by scotwells · Pull Request #211 · milo-os/activity

scotwells · 2026-06-26T12:10:19Z

What this does

Adds a new Grafana dashboard — Activity — DLQ & Policy Health — giving on-call engineers a single pane to answer the three questions that matter during a DLQ incident:

Is the backlog growing or draining? The stat row shows DLQ backlog size, publish rate, retry resolve rate, and net drain at a glance — positive net drain means the queue is growing, negative means retries are winning.
Which policy is broken and why? The top-N table (Row 2) surfaces the exact (policy_name, api_group, kind, error_type) tuple driving the DLQ — the same combination the ActivityPolicyDLQErrors alert fires on. Clicking a row jumps straight to Loki logs filtered to that policy.
Are retries actually recovering events? The retry & recovery row (Row 4) shows retry outcomes over time, which policies are still failing after re-eval, and which events are stuck past the high-retry threshold.

The dashboard covers all six active DLQ alerts: DLQQueueGrowing, DLQSlowLeak, ActivityPolicyDLQErrors, DLQRetryIneffective, DLQHighRetryCount, and DLQPublishErrors.

How it ships

No infrastructure change needed. The dashboard is authored in Grafonnet (same as the other six dashboards in this repo), compiled to JSON, and picked up automatically by the existing Flux observability OCI sync in datum-cloud/infra. A new GrafanaDashboard CR matching kind: GrafanaDashboard is patched to vm-grafana by the infra overlay and lands in the Platform / Activity folder on both staging and production after the next OCI sync post-merge.

Runbooks shipped by docs(dlq): help on-call find what's failing when a DLQ alert fires #210 (DLQ triage guides, Loki querying guide)
Policy-side fixes that stop the bleeding: NSO#223, fix: show newly created projects and resources in the activity feed milo#668, fix: show new billing accounts in the activity feed billing#67, dns-operator#50

Single-pane triage dashboard for on-call engineers responding to DLQ and ActivityPolicy alerts. Surfaces the failing policy, error type, and resource kind in one view, distinguishes growing backlog from retry churn, and links directly to Loki logs filtered to the clicked policy. Key features: - Row 1 stat row: DLQ backlog, publish rate, net drain (growing vs draining), retry success rate, publish errors — all with or vector(0) for zero-series safety - Row 2 top-N table: topk(25) by (policy_name, api_group, kind, error_type) matching the ActivityPolicyDLQErrors alert tuple; each row links to Loki Explore pre-filtered to that policy - Row 3 trend series: DLQ rate by error_type, policy, and kind - Row 4 retry recovery: outcomes by result, still-failing re-eval table (dlq_retry_failed_total), poison events (high_retry_total), batch duration p99 - Row 5 publish-path: errors by phase, latency p99/p95/p50 - Row 6 logs: JSON-parsed DLQ lines filtered by $policy_name and $error_type vars; catch-all regex fallback panel All metric label constraints respected per §6: retry_attempts and publish_errors not filtered by policy_name/error_type (those labels don't exist on those metrics). Dashboard ships via existing Flux observability OCI sync — no infra change required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add DLQ & Policy Health dashboard#211

feat: add DLQ & Policy Health dashboard#211
scotwells wants to merge 1 commit into
mainfrom
feat/dlq-policy-health-dashboard

scotwells commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

scotwells commented Jun 26, 2026

What this does

How it ships

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant