fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch by crandles · Pull Request #1012 · trickstercache/trickster

crandles · 2026-05-27T14:08:40Z

Description

Re: #982 and #1015 -- both are healthy_floor misconfigurations that fail silently. This surfaces them rather than guessing a default probe.

healthy_floor < 0 admits members the probe has confirmed unavailable (Unavailable member failed healthcheck still being queried and end up producing 502 response #982). Startup warning + trickster_alb_pool_admits_failing gauge.
healthy_floor: 1 with members that have no health check can never fill the pool, so every request 502s ([ALB/TSM] All requests return 502 Bad Gateway — single + multi-backend pool, even from main branch #1015). Trickster resets the effective floor to 0, warns, and sets trickster_alb_pool_floor_reset.
A multi-member TSM pool serving from a single live member silently returned only that member's data. It now adds a warnings entry to the response and logs once per healthy->degraded transition.

docs/alb.md covers recovery_threshold as the cold-start knob rather than a negative floor; docs/health.md notes the prometheus query=up default-probe caveat for multi-tenant gateways.

Type of Change

- Bug fix
- Test coverage
- Documentation

AI Disclosure

- This contribution DOES include AI-generated changes, and I have reviewed the relevant contributing guidelines.

crandles · 2026-05-27T14:35:32Z

down to what I believe is an unrelated failure in TestDeltaProxyCacheRequestRangeMissChunks -- going to take a look at that again on a new branch/PR. Believe we have a timing flaw.

Signed-off-by: Chris Randles <randles.chris@gmail.com>

…lclock gate Signed-off-by: Chris Randles <randles.chris@gmail.com>

… origin counter Signed-off-by: Chris Randles <randles.chris@gmail.com>

Signed-off-by: Chris Randles <randles.chris@gmail.com>

…ling gauge Signed-off-by: Chris Randles <randles.chris@gmail.com>

crandles · 2026-05-27T17:53:01Z

having second thoughts on the auto probe feature; is no probe a valid use case? Leaning towards reverting it, or making it skippable at the least

if i revert it, i think we might then want an additional warning around the healthy floor config. (if you have no configured health check, you'll want to use 0, i think)

thinker0 · 2026-05-28T01:19:15Z

LGTM

thinker0 · 2026-05-28T01:46:19Z

Follow-up: a secondary silent-fallback case that survives this fix

Confirmed once more that this PR resolves the 502 from #1015 — thanks @crandles. While validating end-to-end, I hit a second failure mode that this PR's auto-probe alone does not surface, and I wanted to share it in case it's worth a small additional change or a docs note.

Symptom

After deploying the auto-probe-pool-members branch, the ALB endpoint returns HTTP 200 and a well-formed response — but the response only ever contains data from pool[0]. The other pool members are silently absent from the merge. analysis.children in the response is null, and no backendName=<other-member> access-log entry is emitted for ALB-routed traffic — confirming the per-member fanout goroutines are never launched.

A direct call to the same pool member via path-routing (/<member-name>/api/v1/query) returns the expected data, so the member itself is reachable and the prometheus client works.

Root cause

For a pool of [backend-a, backend-b] with healthy_floor: 1, two things conspire:

prometheus.DefaultHealthCheckConfig hardcodes the probe to /api/v1/query?query=up (pkg/backends/prometheus/health.go:30).
The auto-probe added by this PR runs that exact query against every pool member that doesn't have an explicit healthcheck: block. Some Prometheus-compatible backends reject query=up outright — e.g. multi-tenant frontends that refuse unbounded label-cardinality queries return 400 bad_data: "too many series found". The probe then never succeeds, and the target's status stays at StatusUnchecked (0) (or transitions to StatusFailing (-1) after the configured threshold).
Pool.Targets() filters with int(hcStatus.Get()) < healthyFloor (pkg/backends/alb/pool/pool.go:113). With healthy_floor: 1 (= StatusPassing), an Unchecked or Failing target is filtered out, so Targets() returns only the one healthy member.
time_series_merge.go:332 then takes the single-live-member fast path:
```
if l == 1 && len(stripKeys) == 0 && !needsDualQuery {
    defaultHandler.ServeHTTP(w, r)
    return
}
```
defaultHandler = hl[0].Handler() — i.e. pool[0]'s prometheus handler. Fanout to the other members never happens, but the request returns 200 with that member's body, so from the caller's perspective everything "looks fine."

So this PR's StatusUnchecked initial state is correctly admitted by the brief admission window noted in backends.go:97, but as soon as the auto-probe runs and consistently fails against a member that rejects query=up, that member exits the pool permanently and the merge silently degrades into a single-member proxy.

Reproducer (minimal)

backends:
  alb:
    provider: alb
    alb:
      mechanism: tsm
      healthy_floor: 1
      pool: [backend-a, backend-b]
  backend-a:
    provider: prometheus
    origin_url: 'https://example.com/prom-a'   # answers query=up normally
  backend-b:
    provider: prometheus
    origin_url: 'https://example.com/prom-b'   # rejects query=up with 400

After startup, /alb/api/v1/query?query=<anything> returns 200 with backend-a-only data, no merge, no children in analysis, no error visible to the caller.

Workaround (config-only, no patch)

Override the auto-probe query on each pool member so the probe stays inside whatever the backend will actually accept:

backend-b:
  provider: prometheus
  origin_url: 'https://example.com/prom-b'
  healthcheck:
    query: 'query=vector(1)'    # trivially valid PromQL, no series scan
    interval: 5s
    recovery_threshold: 1
    failure_threshold: 3

With this, the probe transitions backend-b to StatusPassing within one interval, Pool.Targets() returns both members, the single-member fast path no longer triggers, and analysis.children shows both shards merging as expected. Verified 10/10 over a fresh cache window after this change.

Possible follow-ups (for maintainer judgement)

This isn't a regression introduced by this PR — it's an existing edge of DefaultHealthCheckConfig's probe choice that this PR's auto-probe now exposes more often, because more pool members run that probe than before. A few ideas, in case any are worth doing in a follow-up:

Make the prometheus default probe inert. query=vector(1) is a 0-series no-op that's accepted by any spec-compliant Prometheus, including multi-tenant gateways that refuse cardinality-unbounded queries. Swapping the query=up default would remove a class of silent failures without changing observability semantics for normal backends.
Surface single-member silent fallback. The l == 1 branch in time_series_merge.go:332 is invisible to the caller and to logs. A level=warn emission when a multi-member pool dispatches via the single-live-member fast path — once per poolVersion to avoid spam — would let operators notice this state without having to diff response bodies against a direct backend call. Adding the elided shard count to the analysis block would also help.
Docs note in the healthcheck section: "If a pool member rejects query=up (e.g. multi-tenant Prometheus gateways with cardinality guards), set an explicit healthcheck.query such as query=vector(1). Otherwise the auto-probe will keep the member out of the pool and the ALB merge will silently degrade to single-member dispatch."

Happy to open a separate issue for any of these and/or send a PR if helpful.

crandles · 2026-05-28T13:49:43Z

Some Prometheus-compatible backends reject query=up outright

Good info.

At the moment we favor standard prometheus vs the alternative engines, so perhaps that is adequate for the current revision, but worth a documentation callout.

Per #1012 (comment) I'm still not entirely sold on the auto probe feature - the alternative would be stricter config validations and warnings.

For example, if you supply a healthy floor value != 0, then you must have supplied an appropriate health check, else trickster should fail to start -- OR it will start but heavily warn + ignore the invalid healthy_floor value. Would defer to operators on an appropriate health check vs baking in a default one (that can be wrong, or flawed).

(the feedback about the proposed health check not working across different engines makes me think this is the better short term option. With true support for different prometheus engines, we'd need to have some way to inspect their differences, or have this defined per config, or at least some integration test to verify a default query actually works. While vertor(1) might work, I'm not sure we can say what is and isn't spec-compliant Prometheus without a conformance test. )

…dispatch Signed-off-by: Chris Randles <randles.chris@gmail.com>

crandles force-pushed the auto-probe-pool-members branch 3 times, most recently from 5ecba7e to ffabe11 Compare May 27, 2026 14:27

crandles force-pushed the auto-probe-pool-members branch from ffabe11 to b8dea17 Compare May 27, 2026 15:11

crandles added 4 commits May 27, 2026 11:11

fix+test: [healthcheck,alb] auto-probe unconfigured pool members

09eab16

Signed-off-by: Chris Randles <randles.chris@gmail.com>

test: demonstrate dpc chunk test wallclock regression

18f777e

Signed-off-by: Chris Randles <randles.chris@gmail.com>

test: [proxy/engines] anchor chunk-test reference time to wallclock

dc0c4c5

Signed-off-by: Chris Randles <randles.chris@gmail.com>

test: [proxy/engines] pin retention so fixed chunk fixtures clear wal…

4c61fd5

…lclock gate Signed-off-by: Chris Randles <randles.chris@gmail.com>

crandles force-pushed the auto-probe-pool-members branch from b8dea17 to 4c61fd5 Compare May 27, 2026 15:12

crandles mentioned this pull request May 27, 2026

fix: [proxy/engines] dpc chunk test wallclock dependency #1013

Closed

test: [integration/engines] ignore auto-applied healthcheck probes in…

baba35b

… origin counter Signed-off-by: Chris Randles <randles.chris@gmail.com>

crandles marked this pull request as ready for review May 27, 2026 15:34

crandles requested a review from a team as a code owner May 27, 2026 15:34

crandles mentioned this pull request May 27, 2026

feat+perf: [alb/fanout,proxy/capture] loser-drain histogram + capture body presize #1014

Open

crandles added 2 commits May 27, 2026 13:03

fix: [alb] warn + metric when healthy_floor admits failing members

1ef6e30

Signed-off-by: Chris Randles <randles.chris@gmail.com>

docs: [alb,metrics] document healthy_floor pitfall and new admits-fai…

6c24f39

…ling gauge Signed-off-by: Chris Randles <randles.chris@gmail.com>

thinker0 mentioned this pull request May 28, 2026

[ALB/TSM] All requests return 502 Bad Gateway — single + multi-backend pool, even from main branch #1015

Open

thinker0 mentioned this pull request May 28, 2026

fix+test: [timeseries] panic on ExtentList.Splice when shard count exceeds len(el)*4 hint #1016

Open

fix: [alb] validate healthy_floor and warn on degraded single-member …

c7780a8

…dispatch Signed-off-by: Chris Randles <randles.chris@gmail.com>

crandles changed the title ~~fix+test: [healthcheck,alb] auto-probe unconfigured pool members~~ fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch#1012

fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch#1012
crandles wants to merge 8 commits into
trickstercache:mainfrom
crandles:auto-probe-pool-members

crandles commented May 27, 2026 •

edited

Loading

Uh oh!

crandles commented May 27, 2026

Uh oh!

crandles commented May 27, 2026 •

edited

Loading

Uh oh!

thinker0 commented May 28, 2026

Uh oh!

thinker0 commented May 28, 2026

Uh oh!

crandles commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

crandles commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

AI Disclosure

Uh oh!

crandles commented May 27, 2026

Uh oh!

crandles commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thinker0 commented May 28, 2026

Uh oh!

thinker0 commented May 28, 2026

Follow-up: a secondary silent-fallback case that survives this fix

Symptom

Root cause

Reproducer (minimal)

Workaround (config-only, no patch)

Possible follow-ups (for maintainer judgement)

Uh oh!

crandles commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crandles commented May 27, 2026 •

edited

Loading

crandles commented May 27, 2026 •

edited

Loading

crandles commented May 28, 2026 •

edited

Loading