feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures by darrenjaneczek · Pull Request #233 · grafana/plugin-actions

darrenjaneczek · 2026-05-27T18:39:46Z

Summary

Adds a sibling action — grafana-startup-logs — that pairs with wait-for-grafana on the failure path. wait-for-grafana can only observe Current status: 000 (ECONNREFUSED); it can't distinguish "Grafana is still booting" from "Grafana crashed before binding :3000." This action recovers the missing signal.

Stacks on top of #227. That PR lowers wait-for-grafana's startupTimeout default from 300s → 60s precisely because waiting longer at the polling layer doesn't help when the process is dead. The right place for the operator's signal is the diagnostic layer — i.e. this action.

What the action does

On failure (if: failure() && steps.wait.outcome == 'failure'):

Discovers the Grafana container. Heuristic chain: explicit container/service inputs → first compose service whose name contains grafana → port-based docker ps --filter publish=<port> → literal grafana container.
Pulls full container logs to $RUNNER_TEMP and applies a configurable regex redaction pipeline. Default patterns: Bearer …, glsa_…, glc_…, eyJ… (JWT prefix), ?token=… / ?api_key=… URL params, inline password=… / token: … assignments.
Prints three workflow log groups:
- Grafana errors and warnings (filtered) — multi-line aware so stack frames after a lvl=error line stay attached
- Container state — docker compose ps -a or docker ps -a
- Recent Grafana logs — bounded tail (default 500 lines), redacted, all levels
Uploads the full redacted logs as an artifact with 7-day retention (vs. GitHub's 90-day default) and writes the artifact URL to the job summary as a markdown link.

Example

- name: Wait for Grafana to start
  id: wait
  uses: grafana/plugin-actions/wait-for-grafana@wait-for-grafana/v1.0.5

- name: Dump Grafana startup logs on failure
  if: ${{ failure() && steps.wait.outcome == 'failure' }}
  uses: grafana/plugin-actions/grafana-startup-logs@grafana-startup-logs/v1.0.0
  with:
    additional-secrets: |
      ${{ env.HL_TOKEN }}
      ${{ env.GRAFANACLOUD_USAGE_TOKEN }}

PII trust model

The README has the long version. Summary:

Artifact retention defaults to 7 days (shorter than GitHub's 90-day default to limit the download window)
Artifact upload is opt-out (upload-artifact: false)
Regex redactor strips common token / JWT / URL-param shapes from both inline log groups and the artifact
additional-secrets registers plugin-specific literal values with ::add-mask:: so GitHub's masker covers them on the log stream
The redactor is best-effort; treat the artifact as semi-sensitive

This is a net reduction of PII surface compared to common alternatives — operators today either re-run with ACTIONS_STEP_DEBUG=true (which dumps everything, less filtered) or attach tmate sessions, both of which leak more.

Verification

A standalone smoke test (scripts/test-gsl-pipeline.sh in the AgentLand workspace, not part of this PR) runs the same sed redactor + awk level filter against a synthetic Grafana log with mixed levels, an embedded glsa_* token, a Bearer … header, a ?token=… URL, and a multi-line goroutine stack trace following a database is locked error. All 11 assertions pass:

✅ Tokens redacted to [REDACTED] in the full log
✅ Filter captures warn, error, eror lines (Grafana's truncated spelling included)
✅ Stack-trace continuation lines stay attached to their lvl=error parent
✅ info lines including the one between two error events are correctly dropped

Why not bake this into `wait-for-grafana`?

Single responsibility per action keeps the polling contract narrow. wait-for-grafana takes a URL; this takes a container. Composing them as two steps means consumers can use them independently (e.g. dump logs after a different failure mode, or wait without dumping).

Test plan

bash -n syntax check on grafana-startup-logs.sh
Standalone pipeline smoke test (scripts/test-gsl-pipeline.sh, 11 assertions)
CI on this PR (will exercise the action's lint/schema validation)
After merge: smoke test on grafana-adaptivelogs-app by intentionally failing wait-for-grafana to confirm artifact upload + redaction in the real GHA environment

Stacking

Branched off fix/wait-for-grafana-default-startup-timeout (PR #227). Will rebase onto main once #227 merges.

…a failures Introduces a sibling action that pairs with wait-for-grafana on the failure path. wait-for-grafana can only observe "Current status: 000" (connection refused) and cannot distinguish a slow-but-alive Grafana from one that crashed before binding :3000. This action recovers the missing signal. On failure, the action: 1. Discovers the Grafana container via a heuristic chain (explicit container/service inputs → first compose service matching /grafana/ → docker ps publish-port lookup → literal "grafana" container). 2. Pulls full container logs to $RUNNER_TEMP and applies a configurable regex redaction pipeline (default catches Bearer tokens, glsa_/glc_ service-account tokens, JWT prefixes, ?token= URL parameters, and inline password/secret/token assignments). 3. Prints three workflow log groups: - "Grafana errors and warnings (filtered)" — level-aware, multi-line so stack-trace continuations after an error line stay attached - "Container state" — docker compose ps -a / docker ps -a - "Recent Grafana logs (last N lines, all levels, redacted)" 4. Optionally uploads the full redacted logs as an actions artifact with a 7-day default retention and surfaces the artifact URL in the job summary. PII trust model is documented in the README. Defaults treat container logs as semi-sensitive (shorter retention than GitHub's 90-day default, opt-out artifact upload, regex redaction layered on top of GHA's secret masker). The masker only protects log streams; redact-patterns covers the artifact path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-project-automation Bot added this to Grafana Catalog Team May 27, 2026

github-project-automation Bot moved this to 📬 Triage in Grafana Catalog Team May 27, 2026

tolzhabayev moved this from 📬 Triage to 🔬 In review in Grafana Catalog Team May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures#233

feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures#233
darrenjaneczek wants to merge 1 commit into
fix/wait-for-grafana-default-startup-timeoutfrom
feat/grafana-startup-logs

darrenjaneczek commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

darrenjaneczek commented May 27, 2026

Summary

What the action does

Example

PII trust model

Verification

Why not bake this into wait-for-grafana?

Test plan

Stacking

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why not bake this into `wait-for-grafana`?