Skip to content

fix(wait-for-grafana): lower startupTimeout default from 300s to 60s#227

Open
darrenjaneczek wants to merge 1 commit into
mainfrom
fix/wait-for-grafana-default-startup-timeout
Open

fix(wait-for-grafana): lower startupTimeout default from 300s to 60s#227
darrenjaneczek wants to merge 1 commit into
mainfrom
fix/wait-for-grafana-default-startup-timeout

Conversation

@darrenjaneczek
Copy link
Copy Markdown
Contributor

@darrenjaneczek darrenjaneczek commented May 26, 2026

Summary

Lowers the startupTimeout default in wait-for-grafana from 300 seconds back to 60, restoring v1.0.2's fail-fast behaviour for startup crash scenarios while keeping the two-phase split available for repos with a genuine slow-start need.

The two-phase polling introduced in v1.0.3 (#213) was based on the hypothesis that Playwright matrix flakes are slow-but-alive Grafana startups on contested runners. That diagnosis was incomplete. The class of failure that's actually been hitting plugin repos in the matrix is a Grafana startup crash — the SQLite write lock is held by legacy provisioning while the Advisor's checktyperegisterer runner tries to POST to the unified resource server, causing a self-deadlock that prevents the HTTP listener from ever binding.

In that scenario the v1.0.3 default makes the failure 5× slower to diagnose than v1.0.2 was: the action waits 5 minutes observing only Current status: 000 before timing out, during which the operator has no signal at all. wait-for-grafana is structurally unable to distinguish "still booting" from "dead before booting" — both produce the same ECONNREFUSED.

Why not revert the two-phase logic entirely?

The Phase 1 / Phase 2 split (TCP-bind vs health endpoint) is still useful in the slow-but-alive case it was designed for; only the default was too generous. Repos that hit genuine slow-start situations can still opt into a higher value explicitly:

- uses: grafana/plugin-actions/wait-for-grafana@wait-for-grafana/v1.0.5
  with:
    startupTimeout: 180  # explicit opt-in for known-slow environments

Follow-up

A separate PR will propose a sibling grafana-startup-logs action that dumps docker compose ps + a filtered, secret-masked tail of Grafana's own logs when wait-for-grafana fails. That is the right place to surface the real signal — at the diagnostic layer, not by waiting longer at the polling layer.

Test plan

  • Action.yml + script + README all updated consistently
  • Rebased onto current main (post-1.0.4 release) with no functional conflicts
  • CI: the existing wait-for-grafana matrix test (if any) still passes with the new default
  • Consumers that previously relied on the implicit 300s window can override explicitly via startupTimeout: input

@tolzhabayev tolzhabayev moved this from 📬 Triage to 🧑‍💻 In development in Grafana Catalog Team May 26, 2026
The two-phase polling introduced in v1.0.3 was based on the hypothesis
that Grafana startups on contested CI runners are slow-but-alive. Recent
evidence (see grafana/grafana#122993, mitigated in grafana/grafana#123034
and grafana/logs-drilldown#1886) shows that an important class of
Playwright matrix failures is actually a *crash* during provisioning —
specifically, a SQLITE_BUSY self-deadlock between the Grafana Advisor
checktyperegisterer and legacy provisioning, which prevents the HTTP
listener from ever binding.

In that scenario the v1.0.3 default of 300 seconds makes the failure
5x slower to diagnose than v1.0.2 was: the action waits 5 minutes
observing only "Current status: 000" before timing out, during which
the operator has no signal at all. wait-for-grafana cannot distinguish
"still booting" from "dead before booting" — both produce the same
ECONNREFUSED.

Lowering the default to 60 seconds restores the v1.0.2 fail-fast
behaviour for crash scenarios while keeping the two-phase split (Phase 1
TCP-bind vs Phase 2 health endpoint) available for repos that have a
genuine slow-start need and want to opt into a higher value explicitly.

A follow-up will propose a separate `grafana-startup-logs` sibling
action that dumps `docker compose ps` and a filtered tail of Grafana's
own logs (with secret re-masking and configurable redaction) when
wait-for-grafana fails — that is the right place to surface the real
signal, rather than waiting longer at this layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@darrenjaneczek darrenjaneczek force-pushed the fix/wait-for-grafana-default-startup-timeout branch from a2f682c to 368e270 Compare May 27, 2026 18:19
@darrenjaneczek darrenjaneczek marked this pull request as ready for review May 27, 2026 18:30
@darrenjaneczek darrenjaneczek requested a review from a team as a code owner May 27, 2026 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 🧑‍💻 In development

Development

Successfully merging this pull request may close these issues.

2 participants