feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures#233
Draft
darrenjaneczek wants to merge 1 commit into
Draft
Conversation
…a failures
Introduces a sibling action that pairs with wait-for-grafana on the failure
path. wait-for-grafana can only observe "Current status: 000" (connection
refused) and cannot distinguish a slow-but-alive Grafana from one that
crashed before binding :3000. This action recovers the missing signal.
On failure, the action:
1. Discovers the Grafana container via a heuristic chain (explicit
container/service inputs → first compose service matching /grafana/
→ docker ps publish-port lookup → literal "grafana" container).
2. Pulls full container logs to $RUNNER_TEMP and applies a configurable
regex redaction pipeline (default catches Bearer tokens, glsa_/glc_
service-account tokens, JWT prefixes, ?token= URL parameters, and
inline password/secret/token assignments).
3. Prints three workflow log groups:
- "Grafana errors and warnings (filtered)" — level-aware, multi-line
so stack-trace continuations after an error line stay attached
- "Container state" — docker compose ps -a / docker ps -a
- "Recent Grafana logs (last N lines, all levels, redacted)"
4. Optionally uploads the full redacted logs as an actions artifact
with a 7-day default retention and surfaces the artifact URL in
the job summary.
PII trust model is documented in the README. Defaults treat container
logs as semi-sensitive (shorter retention than GitHub's 90-day default,
opt-out artifact upload, regex redaction layered on top of GHA's
secret masker). The masker only protects log streams; redact-patterns
covers the artifact path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a sibling action —
grafana-startup-logs— that pairs withwait-for-grafanaon the failure path.wait-for-grafanacan only observeCurrent status: 000(ECONNREFUSED); it can't distinguish "Grafana is still booting" from "Grafana crashed before binding:3000." This action recovers the missing signal.Stacks on top of #227. That PR lowers
wait-for-grafana'sstartupTimeoutdefault from 300s → 60s precisely because waiting longer at the polling layer doesn't help when the process is dead. The right place for the operator's signal is the diagnostic layer — i.e. this action.What the action does
On failure (
if: failure() && steps.wait.outcome == 'failure'):container/serviceinputs → first compose service whose name containsgrafana→ port-baseddocker ps --filter publish=<port>→ literalgrafanacontainer.$RUNNER_TEMPand applies a configurable regex redaction pipeline. Default patterns:Bearer …,glsa_…,glc_…,eyJ…(JWT prefix),?token=…/?api_key=…URL params, inlinepassword=…/token: …assignments.lvl=errorline stay attacheddocker compose ps -aordocker ps -aExample
PII trust model
The README has the long version. Summary:
upload-artifact: false)additional-secretsregisters plugin-specific literal values with::add-mask::so GitHub's masker covers them on the log streamThis is a net reduction of PII surface compared to common alternatives — operators today either re-run with
ACTIONS_STEP_DEBUG=true(which dumps everything, less filtered) or attachtmatesessions, both of which leak more.Verification
A standalone smoke test (
scripts/test-gsl-pipeline.shin the AgentLand workspace, not part of this PR) runs the samesedredactor +awklevel filter against a synthetic Grafana log with mixed levels, an embeddedglsa_*token, aBearer …header, a?token=…URL, and a multi-line goroutine stack trace following adatabase is lockederror. All 11 assertions pass:[REDACTED]in the full logwarn,error,erorlines (Grafana's truncated spelling included)lvl=errorparentinfolines including the one between two error events are correctly droppedWhy not bake this into
wait-for-grafana?Single responsibility per action keeps the polling contract narrow.
wait-for-grafanatakes a URL; this takes a container. Composing them as two steps means consumers can use them independently (e.g. dump logs after a different failure mode, or wait without dumping).Test plan
bash -nsyntax check ongrafana-startup-logs.shscripts/test-gsl-pipeline.sh, 11 assertions)grafana-adaptivelogs-appby intentionally failing wait-for-grafana to confirm artifact upload + redaction in the real GHA environmentStacking
Branched off
fix/wait-for-grafana-default-startup-timeout(PR #227). Will rebase ontomainonce #227 merges.