fix(wait-for-grafana): lower startupTimeout default from 300s to 60s by darrenjaneczek · Pull Request #227 · grafana/plugin-actions

darrenjaneczek · 2026-05-26T12:30:13Z

Summary

Lowers the startupTimeout default in wait-for-grafana from 300 seconds back to 60, restoring v1.0.2's fail-fast behaviour for startup crash scenarios while keeping the two-phase split available for repos with a genuine slow-start need.

The two-phase polling introduced in v1.0.3 (#213) was based on the hypothesis that Playwright matrix flakes are slow-but-alive Grafana startups on contested runners. That diagnosis was incomplete. The class of failure that's actually been hitting plugin repos in the matrix is a Grafana startup crash — the SQLite write lock is held by legacy provisioning while the Advisor's checktyperegisterer runner tries to POST to the unified resource server, causing a self-deadlock that prevents the HTTP listener from ever binding.

Upstream Grafana bug: grafana/grafana#122993
Upstream fix (lands in 13.0.2): grafana/grafana#123034
Workaround applied in the matrix: grafana/logs-drilldown#1886, grafana/grafana-adaptivelogs-app#1534

In that scenario the v1.0.3 default makes the failure 5× slower to diagnose than v1.0.2 was: the action waits 5 minutes observing only Current status: 000 before timing out, during which the operator has no signal at all. wait-for-grafana is structurally unable to distinguish "still booting" from "dead before booting" — both produce the same ECONNREFUSED.

Why not revert the two-phase logic entirely?

The Phase 1 / Phase 2 split (TCP-bind vs health endpoint) is still useful in the slow-but-alive case it was designed for; only the default was too generous. Repos that hit genuine slow-start situations can still opt into a higher value explicitly:

- uses: grafana/plugin-actions/wait-for-grafana@wait-for-grafana/v1.0.5
  with:
    startupTimeout: 180  # explicit opt-in for known-slow environments

Follow-up

A separate PR will propose a sibling grafana-startup-logs action that dumps docker compose ps + a filtered, secret-masked tail of Grafana's own logs when wait-for-grafana fails. That is the right place to surface the real signal — at the diagnostic layer, not by waiting longer at the polling layer.

Test plan

Action.yml + script + README all updated consistently
Rebased onto current main (post-1.0.4 release) with no functional conflicts
CI: the existing wait-for-grafana matrix test (if any) still passes with the new default
Consumers that previously relied on the implicit 300s window can override explicitly via startupTimeout: input

The two-phase polling introduced in v1.0.3 was based on the hypothesis that Grafana startups on contested CI runners are slow-but-alive. Recent evidence (see grafana/grafana#122993, mitigated in grafana/grafana#123034 and grafana/logs-drilldown#1886) shows that an important class of Playwright matrix failures is actually a *crash* during provisioning — specifically, a SQLITE_BUSY self-deadlock between the Grafana Advisor checktyperegisterer and legacy provisioning, which prevents the HTTP listener from ever binding. In that scenario the v1.0.3 default of 300 seconds makes the failure 5x slower to diagnose than v1.0.2 was: the action waits 5 minutes observing only "Current status: 000" before timing out, during which the operator has no signal at all. wait-for-grafana cannot distinguish "still booting" from "dead before booting" — both produce the same ECONNREFUSED. Lowering the default to 60 seconds restores the v1.0.2 fail-fast behaviour for crash scenarios while keeping the two-phase split (Phase 1 TCP-bind vs Phase 2 health endpoint) available for repos that have a genuine slow-start need and want to opt into a higher value explicitly. A follow-up will propose a separate `grafana-startup-logs` sibling action that dumps `docker compose ps` and a filtered tail of Grafana's own logs (with secret re-masking and configurable redaction) when wait-for-grafana fails — that is the right place to surface the real signal, rather than waiting longer at this layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-project-automation Bot added this to Grafana Catalog Team May 26, 2026

github-project-automation Bot moved this to 📬 Triage in Grafana Catalog Team May 26, 2026

tolzhabayev moved this from 📬 Triage to 🧑‍💻 In development in Grafana Catalog Team May 26, 2026

darrenjaneczek force-pushed the fix/wait-for-grafana-default-startup-timeout branch from a2f682c to 368e270 Compare May 27, 2026 18:19

darrenjaneczek marked this pull request as ready for review May 27, 2026 18:30

darrenjaneczek requested a review from a team as a code owner May 27, 2026 18:30

darrenjaneczek requested review from toddtreece, wbrowne and xnyo May 27, 2026 18:30

darrenjaneczek mentioned this pull request May 27, 2026

feat(grafana-startup-logs): add diagnostic action for wait-for-grafana failures #233

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wait-for-grafana): lower startupTimeout default from 300s to 60s#227

fix(wait-for-grafana): lower startupTimeout default from 300s to 60s#227
darrenjaneczek wants to merge 1 commit into
mainfrom
fix/wait-for-grafana-default-startup-timeout

darrenjaneczek commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

darrenjaneczek commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why not revert the two-phase logic entirely?

Follow-up

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

darrenjaneczek commented May 26, 2026 •

edited

Loading