Skip to content

feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures#233

Draft
darrenjaneczek wants to merge 1 commit into
fix/wait-for-grafana-default-startup-timeoutfrom
feat/grafana-startup-logs
Draft

feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures#233
darrenjaneczek wants to merge 1 commit into
fix/wait-for-grafana-default-startup-timeoutfrom
feat/grafana-startup-logs

Conversation

@darrenjaneczek
Copy link
Copy Markdown
Contributor

Summary

Adds a sibling action — grafana-startup-logs — that pairs with wait-for-grafana on the failure path. wait-for-grafana can only observe Current status: 000 (ECONNREFUSED); it can't distinguish "Grafana is still booting" from "Grafana crashed before binding :3000." This action recovers the missing signal.

Stacks on top of #227. That PR lowers wait-for-grafana's startupTimeout default from 300s → 60s precisely because waiting longer at the polling layer doesn't help when the process is dead. The right place for the operator's signal is the diagnostic layer — i.e. this action.

What the action does

On failure (if: failure() && steps.wait.outcome == 'failure'):

  1. Discovers the Grafana container. Heuristic chain: explicit container/service inputs → first compose service whose name contains grafana → port-based docker ps --filter publish=<port> → literal grafana container.
  2. Pulls full container logs to $RUNNER_TEMP and applies a configurable regex redaction pipeline. Default patterns: Bearer …, glsa_…, glc_…, eyJ… (JWT prefix), ?token=… / ?api_key=… URL params, inline password=… / token: … assignments.
  3. Prints three workflow log groups:
    • Grafana errors and warnings (filtered) — multi-line aware so stack frames after a lvl=error line stay attached
    • Container statedocker compose ps -a or docker ps -a
    • Recent Grafana logs — bounded tail (default 500 lines), redacted, all levels
  4. Uploads the full redacted logs as an artifact with 7-day retention (vs. GitHub's 90-day default) and writes the artifact URL to the job summary as a markdown link.

Example

- name: Wait for Grafana to start
  id: wait
  uses: grafana/plugin-actions/wait-for-grafana@wait-for-grafana/v1.0.5

- name: Dump Grafana startup logs on failure
  if: ${{ failure() && steps.wait.outcome == 'failure' }}
  uses: grafana/plugin-actions/grafana-startup-logs@grafana-startup-logs/v1.0.0
  with:
    additional-secrets: |
      ${{ env.HL_TOKEN }}
      ${{ env.GRAFANACLOUD_USAGE_TOKEN }}

PII trust model

The README has the long version. Summary:

  • Artifact retention defaults to 7 days (shorter than GitHub's 90-day default to limit the download window)
  • Artifact upload is opt-out (upload-artifact: false)
  • Regex redactor strips common token / JWT / URL-param shapes from both inline log groups and the artifact
  • additional-secrets registers plugin-specific literal values with ::add-mask:: so GitHub's masker covers them on the log stream
  • The redactor is best-effort; treat the artifact as semi-sensitive

This is a net reduction of PII surface compared to common alternatives — operators today either re-run with ACTIONS_STEP_DEBUG=true (which dumps everything, less filtered) or attach tmate sessions, both of which leak more.

Verification

A standalone smoke test (scripts/test-gsl-pipeline.sh in the AgentLand workspace, not part of this PR) runs the same sed redactor + awk level filter against a synthetic Grafana log with mixed levels, an embedded glsa_* token, a Bearer … header, a ?token=… URL, and a multi-line goroutine stack trace following a database is locked error. All 11 assertions pass:

  • ✅ Tokens redacted to [REDACTED] in the full log
  • ✅ Filter captures warn, error, eror lines (Grafana's truncated spelling included)
  • ✅ Stack-trace continuation lines stay attached to their lvl=error parent
  • info lines including the one between two error events are correctly dropped

Why not bake this into wait-for-grafana?

Single responsibility per action keeps the polling contract narrow. wait-for-grafana takes a URL; this takes a container. Composing them as two steps means consumers can use them independently (e.g. dump logs after a different failure mode, or wait without dumping).

Test plan

  • bash -n syntax check on grafana-startup-logs.sh
  • Standalone pipeline smoke test (scripts/test-gsl-pipeline.sh, 11 assertions)
  • CI on this PR (will exercise the action's lint/schema validation)
  • After merge: smoke test on grafana-adaptivelogs-app by intentionally failing wait-for-grafana to confirm artifact upload + redaction in the real GHA environment

Stacking

Branched off fix/wait-for-grafana-default-startup-timeout (PR #227). Will rebase onto main once #227 merges.

…a failures

Introduces a sibling action that pairs with wait-for-grafana on the failure
path. wait-for-grafana can only observe "Current status: 000" (connection
refused) and cannot distinguish a slow-but-alive Grafana from one that
crashed before binding :3000. This action recovers the missing signal.

On failure, the action:

1. Discovers the Grafana container via a heuristic chain (explicit
   container/service inputs → first compose service matching /grafana/
   → docker ps publish-port lookup → literal "grafana" container).
2. Pulls full container logs to $RUNNER_TEMP and applies a configurable
   regex redaction pipeline (default catches Bearer tokens, glsa_/glc_
   service-account tokens, JWT prefixes, ?token= URL parameters, and
   inline password/secret/token assignments).
3. Prints three workflow log groups:
   - "Grafana errors and warnings (filtered)" — level-aware, multi-line
     so stack-trace continuations after an error line stay attached
   - "Container state" — docker compose ps -a / docker ps -a
   - "Recent Grafana logs (last N lines, all levels, redacted)"
4. Optionally uploads the full redacted logs as an actions artifact
   with a 7-day default retention and surfaces the artifact URL in
   the job summary.

PII trust model is documented in the README. Defaults treat container
logs as semi-sensitive (shorter retention than GitHub's 90-day default,
opt-out artifact upload, regex redaction layered on top of GHA's
secret masker). The masker only protects log streams; redact-patterns
covers the artifact path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tolzhabayev tolzhabayev moved this from 📬 Triage to 🔬 In review in Grafana Catalog Team May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🔬 In review

Development

Successfully merging this pull request may close these issues.

2 participants