fix(wait-for-grafana): lower startupTimeout default from 300s to 60s#227
Open
darrenjaneczek wants to merge 1 commit into
Open
fix(wait-for-grafana): lower startupTimeout default from 300s to 60s#227darrenjaneczek wants to merge 1 commit into
darrenjaneczek wants to merge 1 commit into
Conversation
The two-phase polling introduced in v1.0.3 was based on the hypothesis that Grafana startups on contested CI runners are slow-but-alive. Recent evidence (see grafana/grafana#122993, mitigated in grafana/grafana#123034 and grafana/logs-drilldown#1886) shows that an important class of Playwright matrix failures is actually a *crash* during provisioning — specifically, a SQLITE_BUSY self-deadlock between the Grafana Advisor checktyperegisterer and legacy provisioning, which prevents the HTTP listener from ever binding. In that scenario the v1.0.3 default of 300 seconds makes the failure 5x slower to diagnose than v1.0.2 was: the action waits 5 minutes observing only "Current status: 000" before timing out, during which the operator has no signal at all. wait-for-grafana cannot distinguish "still booting" from "dead before booting" — both produce the same ECONNREFUSED. Lowering the default to 60 seconds restores the v1.0.2 fail-fast behaviour for crash scenarios while keeping the two-phase split (Phase 1 TCP-bind vs Phase 2 health endpoint) available for repos that have a genuine slow-start need and want to opt into a higher value explicitly. A follow-up will propose a separate `grafana-startup-logs` sibling action that dumps `docker compose ps` and a filtered tail of Grafana's own logs (with secret re-masking and configurable redaction) when wait-for-grafana fails — that is the right place to surface the real signal, rather than waiting longer at this layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a2f682c to
368e270
Compare
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lowers the
startupTimeoutdefault inwait-for-grafanafrom 300 seconds back to 60, restoring v1.0.2's fail-fast behaviour for startup crash scenarios while keeping the two-phase split available for repos with a genuine slow-start need.The two-phase polling introduced in v1.0.3 (#213) was based on the hypothesis that Playwright matrix flakes are slow-but-alive Grafana startups on contested runners. That diagnosis was incomplete. The class of failure that's actually been hitting plugin repos in the matrix is a Grafana startup crash — the SQLite write lock is held by legacy provisioning while the Advisor's
checktyperegistererrunner tries to POST to the unified resource server, causing a self-deadlock that prevents the HTTP listener from ever binding.In that scenario the v1.0.3 default makes the failure 5× slower to diagnose than v1.0.2 was: the action waits 5 minutes observing only
Current status: 000before timing out, during which the operator has no signal at all.wait-for-grafanais structurally unable to distinguish "still booting" from "dead before booting" — both produce the same ECONNREFUSED.Why not revert the two-phase logic entirely?
The Phase 1 / Phase 2 split (TCP-bind vs health endpoint) is still useful in the slow-but-alive case it was designed for; only the default was too generous. Repos that hit genuine slow-start situations can still opt into a higher value explicitly:
Follow-up
A separate PR will propose a sibling
grafana-startup-logsaction that dumpsdocker compose ps+ a filtered, secret-masked tail of Grafana's own logs whenwait-for-grafanafails. That is the right place to surface the real signal — at the diagnostic layer, not by waiting longer at the polling layer.Test plan
main(post-1.0.4 release) with no functional conflictsstartupTimeout:input