Skip to content

feat: add DLQ & Policy Health dashboard#211

Draft
scotwells wants to merge 1 commit into
mainfrom
feat/dlq-policy-health-dashboard
Draft

feat: add DLQ & Policy Health dashboard#211
scotwells wants to merge 1 commit into
mainfrom
feat/dlq-policy-health-dashboard

Conversation

@scotwells

Copy link
Copy Markdown
Contributor

What this does

Adds a new Grafana dashboard — Activity — DLQ & Policy Health — giving on-call engineers a single pane to answer the three questions that matter during a DLQ incident:

  1. Is the backlog growing or draining? The stat row shows DLQ backlog size, publish rate, retry resolve rate, and net drain at a glance — positive net drain means the queue is growing, negative means retries are winning.

  2. Which policy is broken and why? The top-N table (Row 2) surfaces the exact (policy_name, api_group, kind, error_type) tuple driving the DLQ — the same combination the ActivityPolicyDLQErrors alert fires on. Clicking a row jumps straight to Loki logs filtered to that policy.

  3. Are retries actually recovering events? The retry & recovery row (Row 4) shows retry outcomes over time, which policies are still failing after re-eval, and which events are stuck past the high-retry threshold.

The dashboard covers all six active DLQ alerts: DLQQueueGrowing, DLQSlowLeak, ActivityPolicyDLQErrors, DLQRetryIneffective, DLQHighRetryCount, and DLQPublishErrors.

How it ships

No infrastructure change needed. The dashboard is authored in Grafonnet (same as the other six dashboards in this repo), compiled to JSON, and picked up automatically by the existing Flux observability OCI sync in datum-cloud/infra. A new GrafanaDashboard CR matching kind: GrafanaDashboard is patched to vm-grafana by the infra overlay and lands in the Platform / Activity folder on both staging and production after the next OCI sync post-merge.

Related

Single-pane triage dashboard for on-call engineers responding to DLQ
and ActivityPolicy alerts. Surfaces the failing policy, error type, and
resource kind in one view, distinguishes growing backlog from retry
churn, and links directly to Loki logs filtered to the clicked policy.

Key features:
- Row 1 stat row: DLQ backlog, publish rate, net drain (growing vs
  draining), retry success rate, publish errors — all with or vector(0)
  for zero-series safety
- Row 2 top-N table: topk(25) by (policy_name, api_group, kind,
  error_type) matching the ActivityPolicyDLQErrors alert tuple; each
  row links to Loki Explore pre-filtered to that policy
- Row 3 trend series: DLQ rate by error_type, policy, and kind
- Row 4 retry recovery: outcomes by result, still-failing re-eval table
  (dlq_retry_failed_total), poison events (high_retry_total), batch
  duration p99
- Row 5 publish-path: errors by phase, latency p99/p95/p50
- Row 6 logs: JSON-parsed DLQ lines filtered by $policy_name and
  $error_type vars; catch-all regex fallback panel

All metric label constraints respected per §6: retry_attempts and
publish_errors not filtered by policy_name/error_type (those labels
don't exist on those metrics). Dashboard ships via existing Flux
observability OCI sync — no infra change required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant