feat: add DLQ & Policy Health dashboard#211
Draft
scotwells wants to merge 1 commit into
Draft
Conversation
Single-pane triage dashboard for on-call engineers responding to DLQ and ActivityPolicy alerts. Surfaces the failing policy, error type, and resource kind in one view, distinguishes growing backlog from retry churn, and links directly to Loki logs filtered to the clicked policy. Key features: - Row 1 stat row: DLQ backlog, publish rate, net drain (growing vs draining), retry success rate, publish errors — all with or vector(0) for zero-series safety - Row 2 top-N table: topk(25) by (policy_name, api_group, kind, error_type) matching the ActivityPolicyDLQErrors alert tuple; each row links to Loki Explore pre-filtered to that policy - Row 3 trend series: DLQ rate by error_type, policy, and kind - Row 4 retry recovery: outcomes by result, still-failing re-eval table (dlq_retry_failed_total), poison events (high_retry_total), batch duration p99 - Row 5 publish-path: errors by phase, latency p99/p95/p50 - Row 6 logs: JSON-parsed DLQ lines filtered by $policy_name and $error_type vars; catch-all regex fallback panel All metric label constraints respected per §6: retry_attempts and publish_errors not filtered by policy_name/error_type (those labels don't exist on those metrics). Dashboard ships via existing Flux observability OCI sync — no infra change required.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds a new Grafana dashboard — Activity — DLQ & Policy Health — giving on-call engineers a single pane to answer the three questions that matter during a DLQ incident:
Is the backlog growing or draining? The stat row shows DLQ backlog size, publish rate, retry resolve rate, and net drain at a glance — positive net drain means the queue is growing, negative means retries are winning.
Which policy is broken and why? The top-N table (Row 2) surfaces the exact
(policy_name, api_group, kind, error_type)tuple driving the DLQ — the same combination theActivityPolicyDLQErrorsalert fires on. Clicking a row jumps straight to Loki logs filtered to that policy.Are retries actually recovering events? The retry & recovery row (Row 4) shows retry outcomes over time, which policies are still failing after re-eval, and which events are stuck past the high-retry threshold.
The dashboard covers all six active DLQ alerts:
DLQQueueGrowing,DLQSlowLeak,ActivityPolicyDLQErrors,DLQRetryIneffective,DLQHighRetryCount, andDLQPublishErrors.How it ships
No infrastructure change needed. The dashboard is authored in Grafonnet (same as the other six dashboards in this repo), compiled to JSON, and picked up automatically by the existing Flux observability OCI sync in
datum-cloud/infra. A newGrafanaDashboardCR matchingkind: GrafanaDashboardis patched tovm-grafanaby the infra overlay and lands in thePlatform / Activityfolder on both staging and production after the next OCI sync post-merge.Related