fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch#1012
fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch#1012crandles wants to merge 8 commits into
Conversation
5ecba7e to
ffabe11
Compare
|
down to what I believe is an unrelated failure in TestDeltaProxyCacheRequestRangeMissChunks -- going to take a look at that again on a new branch/PR. Believe we have a timing flaw. |
ffabe11 to
b8dea17
Compare
Signed-off-by: Chris Randles <randles.chris@gmail.com>
Signed-off-by: Chris Randles <randles.chris@gmail.com>
Signed-off-by: Chris Randles <randles.chris@gmail.com>
…lclock gate Signed-off-by: Chris Randles <randles.chris@gmail.com>
b8dea17 to
4c61fd5
Compare
… origin counter Signed-off-by: Chris Randles <randles.chris@gmail.com>
Signed-off-by: Chris Randles <randles.chris@gmail.com>
…ling gauge Signed-off-by: Chris Randles <randles.chris@gmail.com>
|
having second thoughts on the auto probe feature; is no probe a valid use case? Leaning towards reverting it, or making it skippable at the least if i revert it, i think we might then want an additional warning around the healthy floor config. (if you have no configured health check, you'll want to use 0, i think) |
|
LGTM |
Follow-up: a secondary silent-fallback case that survives this fixConfirmed once more that this PR resolves the 502 from #1015 — thanks @crandles. While validating end-to-end, I hit a second failure mode that this PR's auto-probe alone does not surface, and I wanted to share it in case it's worth a small additional change or a docs note. SymptomAfter deploying the A direct call to the same pool member via path-routing ( Root causeFor a pool of
So this PR's Reproducer (minimal)backends:
alb:
provider: alb
alb:
mechanism: tsm
healthy_floor: 1
pool: [backend-a, backend-b]
backend-a:
provider: prometheus
origin_url: 'https://example.com/prom-a' # answers query=up normally
backend-b:
provider: prometheus
origin_url: 'https://example.com/prom-b' # rejects query=up with 400After startup, Workaround (config-only, no patch)Override the auto-probe query on each pool member so the probe stays inside whatever the backend will actually accept: backend-b:
provider: prometheus
origin_url: 'https://example.com/prom-b'
healthcheck:
query: 'query=vector(1)' # trivially valid PromQL, no series scan
interval: 5s
recovery_threshold: 1
failure_threshold: 3With this, the probe transitions Possible follow-ups (for maintainer judgement)This isn't a regression introduced by this PR — it's an existing edge of
Happy to open a separate issue for any of these and/or send a PR if helpful. |
Good info. At the moment we favor standard prometheus vs the alternative engines, so perhaps that is adequate for the current revision, but worth a documentation callout. Per #1012 (comment) I'm still not entirely sold on the auto probe feature - the alternative would be stricter config validations and warnings. For example, if you supply a healthy floor value != 0, then you must have supplied an appropriate health check, else trickster should fail to start -- OR it will start but heavily warn + ignore the invalid healthy_floor value. Would defer to operators on an appropriate health check vs baking in a default one (that can be wrong, or flawed). (the feedback about the proposed health check not working across different engines makes me think this is the better short term option. With true support for different prometheus engines, we'd need to have some way to inspect their differences, or have this defined per config, or at least some integration test to verify a default query actually works. While |
…dispatch Signed-off-by: Chris Randles <randles.chris@gmail.com>
Description
Re: #982 and #1015 -- both are
healthy_floormisconfigurations that fail silently. This surfaces them rather than guessing a default probe.healthy_floor < 0admits members the probe has confirmedunavailable(Unavailable member failed healthcheck still being queried and end up producing 502 response #982). Startup warning +trickster_alb_pool_admits_failinggauge.healthy_floor: 1with members that have no health check can never fill the pool, so every request 502s ([ALB/TSM] All requests return 502 Bad Gateway — single + multi-backend pool, even from main branch #1015). Trickster resets the effective floor to 0, warns, and setstrickster_alb_pool_floor_reset.warningsentry to the response and logs once per healthy->degraded transition.docs/alb.mdcoversrecovery_thresholdas the cold-start knob rather than a negative floor;docs/health.mdnotes the prometheusquery=updefault-probe caveat for multi-tenant gateways.Type of Change
AI Disclosure