Skip to content

fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch#1012

Open
crandles wants to merge 8 commits into
trickstercache:mainfrom
crandles:auto-probe-pool-members
Open

fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch#1012
crandles wants to merge 8 commits into
trickstercache:mainfrom
crandles:auto-probe-pool-members

Conversation

@crandles
Copy link
Copy Markdown
Contributor

@crandles crandles commented May 27, 2026

Description

Re: #982 and #1015 -- both are healthy_floor misconfigurations that fail silently. This surfaces them rather than guessing a default probe.

docs/alb.md covers recovery_threshold as the cold-start knob rather than a negative floor; docs/health.md notes the prometheus query=up default-probe caveat for multi-tenant gateways.

Type of Change

    • Bug fix
    • Test coverage
    • Documentation

AI Disclosure

    • This contribution DOES include AI-generated changes, and I have reviewed the relevant contributing guidelines.

@crandles crandles force-pushed the auto-probe-pool-members branch 3 times, most recently from 5ecba7e to ffabe11 Compare May 27, 2026 14:27
@crandles
Copy link
Copy Markdown
Contributor Author

down to what I believe is an unrelated failure in TestDeltaProxyCacheRequestRangeMissChunks -- going to take a look at that again on a new branch/PR. Believe we have a timing flaw.

@crandles crandles force-pushed the auto-probe-pool-members branch from ffabe11 to b8dea17 Compare May 27, 2026 15:11
crandles added 4 commits May 27, 2026 11:11
Signed-off-by: Chris Randles <randles.chris@gmail.com>
Signed-off-by: Chris Randles <randles.chris@gmail.com>
Signed-off-by: Chris Randles <randles.chris@gmail.com>
…lclock gate

Signed-off-by: Chris Randles <randles.chris@gmail.com>
… origin counter

Signed-off-by: Chris Randles <randles.chris@gmail.com>
@crandles crandles marked this pull request as ready for review May 27, 2026 15:34
@crandles crandles requested a review from a team as a code owner May 27, 2026 15:34
crandles added 2 commits May 27, 2026 13:03
Signed-off-by: Chris Randles <randles.chris@gmail.com>
…ling gauge

Signed-off-by: Chris Randles <randles.chris@gmail.com>
@crandles
Copy link
Copy Markdown
Contributor Author

crandles commented May 27, 2026

having second thoughts on the auto probe feature; is no probe a valid use case? Leaning towards reverting it, or making it skippable at the least

if i revert it, i think we might then want an additional warning around the healthy floor config. (if you have no configured health check, you'll want to use 0, i think)

@thinker0
Copy link
Copy Markdown

LGTM

@thinker0
Copy link
Copy Markdown

Follow-up: a secondary silent-fallback case that survives this fix

Confirmed once more that this PR resolves the 502 from #1015 — thanks @crandles. While validating end-to-end, I hit a second failure mode that this PR's auto-probe alone does not surface, and I wanted to share it in case it's worth a small additional change or a docs note.

Symptom

After deploying the auto-probe-pool-members branch, the ALB endpoint returns HTTP 200 and a well-formed response — but the response only ever contains data from pool[0]. The other pool members are silently absent from the merge. analysis.children in the response is null, and no backendName=<other-member> access-log entry is emitted for ALB-routed traffic — confirming the per-member fanout goroutines are never launched.

A direct call to the same pool member via path-routing (/<member-name>/api/v1/query) returns the expected data, so the member itself is reachable and the prometheus client works.

Root cause

For a pool of [backend-a, backend-b] with healthy_floor: 1, two things conspire:

  1. prometheus.DefaultHealthCheckConfig hardcodes the probe to /api/v1/query?query=up (pkg/backends/prometheus/health.go:30).

  2. The auto-probe added by this PR runs that exact query against every pool member that doesn't have an explicit healthcheck: block. Some Prometheus-compatible backends reject query=up outright — e.g. multi-tenant frontends that refuse unbounded label-cardinality queries return 400 bad_data: "too many series found". The probe then never succeeds, and the target's status stays at StatusUnchecked (0) (or transitions to StatusFailing (-1) after the configured threshold).

  3. Pool.Targets() filters with int(hcStatus.Get()) < healthyFloor (pkg/backends/alb/pool/pool.go:113). With healthy_floor: 1 (= StatusPassing), an Unchecked or Failing target is filtered out, so Targets() returns only the one healthy member.

  4. time_series_merge.go:332 then takes the single-live-member fast path:

    if l == 1 && len(stripKeys) == 0 && !needsDualQuery {
        defaultHandler.ServeHTTP(w, r)
        return
    }

    defaultHandler = hl[0].Handler() — i.e. pool[0]'s prometheus handler. Fanout to the other members never happens, but the request returns 200 with that member's body, so from the caller's perspective everything "looks fine."

So this PR's StatusUnchecked initial state is correctly admitted by the brief admission window noted in backends.go:97, but as soon as the auto-probe runs and consistently fails against a member that rejects query=up, that member exits the pool permanently and the merge silently degrades into a single-member proxy.

Reproducer (minimal)

backends:
  alb:
    provider: alb
    alb:
      mechanism: tsm
      healthy_floor: 1
      pool: [backend-a, backend-b]
  backend-a:
    provider: prometheus
    origin_url: 'https://example.com/prom-a'   # answers query=up normally
  backend-b:
    provider: prometheus
    origin_url: 'https://example.com/prom-b'   # rejects query=up with 400

After startup, /alb/api/v1/query?query=<anything> returns 200 with backend-a-only data, no merge, no children in analysis, no error visible to the caller.

Workaround (config-only, no patch)

Override the auto-probe query on each pool member so the probe stays inside whatever the backend will actually accept:

backend-b:
  provider: prometheus
  origin_url: 'https://example.com/prom-b'
  healthcheck:
    query: 'query=vector(1)'    # trivially valid PromQL, no series scan
    interval: 5s
    recovery_threshold: 1
    failure_threshold: 3

With this, the probe transitions backend-b to StatusPassing within one interval, Pool.Targets() returns both members, the single-member fast path no longer triggers, and analysis.children shows both shards merging as expected. Verified 10/10 over a fresh cache window after this change.

Possible follow-ups (for maintainer judgement)

This isn't a regression introduced by this PR — it's an existing edge of DefaultHealthCheckConfig's probe choice that this PR's auto-probe now exposes more often, because more pool members run that probe than before. A few ideas, in case any are worth doing in a follow-up:

  1. Make the prometheus default probe inert. query=vector(1) is a 0-series no-op that's accepted by any spec-compliant Prometheus, including multi-tenant gateways that refuse cardinality-unbounded queries. Swapping the query=up default would remove a class of silent failures without changing observability semantics for normal backends.

  2. Surface single-member silent fallback. The l == 1 branch in time_series_merge.go:332 is invisible to the caller and to logs. A level=warn emission when a multi-member pool dispatches via the single-live-member fast path — once per poolVersion to avoid spam — would let operators notice this state without having to diff response bodies against a direct backend call. Adding the elided shard count to the analysis block would also help.

  3. Docs note in the healthcheck section: "If a pool member rejects query=up (e.g. multi-tenant Prometheus gateways with cardinality guards), set an explicit healthcheck.query such as query=vector(1). Otherwise the auto-probe will keep the member out of the pool and the ALB merge will silently degrade to single-member dispatch."

Happy to open a separate issue for any of these and/or send a PR if helpful.

@crandles
Copy link
Copy Markdown
Contributor Author

crandles commented May 28, 2026

Some Prometheus-compatible backends reject query=up outright

Good info.

At the moment we favor standard prometheus vs the alternative engines, so perhaps that is adequate for the current revision, but worth a documentation callout.

Per #1012 (comment) I'm still not entirely sold on the auto probe feature - the alternative would be stricter config validations and warnings.

For example, if you supply a healthy floor value != 0, then you must have supplied an appropriate health check, else trickster should fail to start -- OR it will start but heavily warn + ignore the invalid healthy_floor value. Would defer to operators on an appropriate health check vs baking in a default one (that can be wrong, or flawed).

(the feedback about the proposed health check not working across different engines makes me think this is the better short term option. With true support for different prometheus engines, we'd need to have some way to inspect their differences, or have this defined per config, or at least some integration test to verify a default query actually works. While vertor(1) might work, I'm not sure we can say what is and isn't spec-compliant Prometheus without a conformance test. )

…dispatch

Signed-off-by: Chris Randles <randles.chris@gmail.com>
@crandles crandles changed the title fix+test: [healthcheck,alb] auto-probe unconfigured pool members fix+test: [alb] surface healthy_floor misconfigurations and degraded dispatch May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants