feat(runtime-capabilities): capability retry endpoint + unavailable-capability UI by jamby77 · Pull Request #213 · BetterDB-inc/monitor

jamby77 · 2026-05-20T09:18:54Z

Summary

Adds a POST /connections/:id/capabilities/retry endpoint that re-probes a previously-disabled runtime capability (e.g. SLOWLOG on Upstash) and clears the cached "unavailable" state on success.
Extends RuntimeCapabilityTracker with per-capability disabled-reason + timestamp metadata, surfaced through the health endpoint and shared health.ts types.
Adds front-end CapabilityUnavailableBanner + UnavailableOverlay driven by useCapabilities, with a retry action wired into Slow Log and other capability-gated pages.
Recognizes Upstash-style "command is not available" / "not supported" error strings so capabilities get auto-disabled instead of repeatedly erroring.

Why draft

Pushed to keep the work alive after master picked up the related MonitorSupportProbe work (#207). This branch is the other half — capability retry / UI affordances — and hasn't been reviewed yet.

Test plan

pnpm test --filter @betterdb/api (new runtime-capability-tracker.service.spec.ts)
Manual: connect Upstash, hit Slow Log page, see banner, click retry, confirm capability re-evaluates
pnpm lint clean
Verify health endpoint exposes disabledReasons shape matching packages/shared/src/types/health.ts

Note

Medium Risk
Touches live connection command execution and health/capability state used across polling and the UI; behavior is well-tested but misclassified errors could flip feature availability.

Overview
Adds manual runtime capability retry and clearer operator UI when Redis/Valkey commands are blocked (e.g. Upstash, ElastiCache).

API: New POST /connections/:id/capabilities/:capability/retry runs a cheap probe from CAPABILITY_TEST_COMMAND, dedupes in-flight probes, and applies a 5s timeout so hung calls return available: 'unknown' without changing prior state. Definitive server rejections re-record failure and return available: false with the error text; success calls resetCapability. RuntimeCapabilityTracker now stores per-capability disable reasons + timestamps, recognizes more blocked-error patterns (Upstash “not available”, cluster disabled, etc.), and exposes getDisabledReasons / resetCapability. Health responses include runtimeCapabilityReasons (shared types in packages/shared).

Web: App loads reasons from health and exposes retryCapability via context. Slow Log replaces the full-page unavailable overlay with CapabilityStatusBanner (verbatim ERR, retry, transient vs blocked styling). UnavailableMessage / UnavailableOverlay gain optional reason + retry for other gated pages.

Tests: Jest specs for the controller retry path and tracker; Vitest for the banner.

^{Reviewed by Cursor Bugbot for commit fb509f2. Bugbot is set up for automated code reviews on this repo. Configure here.}

Extended backend and frontend to support retrying runtime capabilities (e.g., `SLOWLOG`) when disabled due to server restrictions. Added `/connections/:id/capabilities/:capability/retry` endpoint, React component `CapabilityUnavailableBanner` for inline warnings, and enhanced state handling in `useCapabilities`. Includes tests and better surfacing of disabled reasons in the UI.

jamby77 · 2026-05-21T07:48:19Z

@BugBot review

cursor · 2026-05-21T07:48:22Z

Bugbot couldn't run

Bugbot is not enabled for your user on this team.

Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard.

Retry now actually re-runs the capability's command instead of just clearing the disabled flag, and the SlowLog page surfaces the verdict in stacked banners over always-rendered content. - POST /connections/:id/capabilities/:capability/retry now executes the capability's cheapest test command (SLOWLOG LEN, LATENCY LATEST, etc.) against the live adapter and returns { available, reason }. Success re-enables the capability; a blocked-pattern failure re-records the rejection with the fresh reason; transient errors leave previous state. - Per-capability test-command map exported as CAPABILITY_TEST_COMMAND from RuntimeCapabilityTracker for the controller to consume. - New CapabilityStatusBanner: warn banner (description) + error banner (verbatim ERR) + left-aligned Retry button. During retry the error banner shows "Force-retrying X…" for a 3 s window so the action is perceptible before the (typically sub-100 ms) probe resolves. - SlowLog page now always renders the table; the unavailable state is expressed as banners stacked above it instead of a centered overlay card. Both the full-block-unavailable and per-capability partial cases use the same banner. - Extract UnavailableMessage from UnavailableOverlay so other pages (Clients, AuditTrail, VectorAi, ClientAnalytics) keep the overlay visual unchanged via composition. - Remove the now-unused CapabilityUnavailableBanner. - Fix a stray missing-paren JSX expression in start-session-modal.tsx that broke type-check.

CLUSTER SLOT-STATS without ORDERBY/LIMIT errors with "wrong number of arguments" — that's not a blocked-command pattern, so the probe was returning available='unknown' on capable servers (false negative). Match the shape used by the live poller (ORDERBY key-count LIMIT 1) so the probe and the polled call exercise the same server path. CLIENT LIST as a probe is O(N) over the connection table. CLIENT GETNAME exercises the same ACL gate at O(1). Also recognize "cluster support disabled" as a definitive rejection so standalone servers correctly report canClusterSlotStats=false instead of 'unknown' on the new probe.

Covers all five outcomes of POST /connections/:id/capabilities/:capability/retry: - 400 on unknown capability key, 404 on missing connection. - Success path: returns { available: true } and re-enables a previously- disabled capability via the tracker. - Blocked-pattern rejection (Upstash "Command is not available", cluster support disabled): returns { available: false, reason } and records the failure with the fresh reason. - Transient/non-blocked rejection (ECONNRESET, Connection timeout): returns { available: 'unknown', reason } and leaves prior tracker state intact — including not re-enabling a previously-disabled capability. - String (non-Error) rejection values are surfaced through the reason field. Also asserts the probe call actually sends the command + args from CAPABILITY_TEST_COMMAND so the map stays in sync with the controller.

…tone The probe can return three verdicts (true / false / 'unknown') but the banner previously rendered the same red "blocked" error regardless. An operator on a flaky connection couldn't tell "server refused MONITOR" from "network blip swallowed the probe" — the former requires a server config change, the latter just another click. - Add CapabilityRetryVerdict to the shared health types and wire it through controller → fetchApi → useCapabilities.retryCapability so callers get the structured verdict back, not just void. - CapabilityStatusBanner tracks the most recent verdict locally. When it was 'unknown' the error slot becomes an amber "Couldn't verify — transient error. Try again." block (HelpCircle icon) with the FRESH reason from the verdict (an ECONNRESET, timeout, etc. — not the stale prop-level disabled-because reason). Definitive 'false' verdicts and initial-render state keep the red error tone. - Extract the 3s perceptible delay into PERCEPTIBLE_RETRY_DELAY_MS. - During retry the banner now shows a dedicated amber spinner row instead of overloading the error banner with the "Force-retrying X…" string. 6 vitest cases cover initial render, in-progress state, the three verdict paths (true unmounts via parent, false stays red, unknown turns amber), and the verdict-reason-vs-prop-reason fallback. All 62 backend tests (tracker + controller + registry) still green.

These changes were IDE auto-format noise picked up earlier in the branch's history; they're not part of the runtime-capability work this PR ships. Restoring the master version keeps the PR diff focused.

…t pin iovalkey's commandTimeout normally bounds a Redis call, but adapters that lack one — or that hang BEFORE a command is even queued (TLS handshake on a dead server, etc.) — would otherwise keep the retry endpoint open forever. The HTTP client times out, the user retries, and the orphaned probe is still using up a connection slot. - New CAPABILITY_PROBE_TIMEOUT_MS = 5000. - Wrap adapter.call(...) in a small Promise-race helper. On timeout the rejection ("Probe timed out after 5000ms") flows through the existing catch branch: pattern doesn't match a blocked-command rule, so the tracker stays untouched and the verdict is { available: 'unknown', reason }. Operator gets the amber "couldn't verify" UI added in the previous commit and can try again. - New controller test (jest fake timers) covers the hung-probe path and asserts the prior disabled state is preserved across timeout.

KIvanow · 2026-05-21T12:33:37Z

@BugBot review

KIvanow · 2026-05-21T12:34:01Z

Should be fixed now. Bugbot's vibecoded API had a new issue

- Derive RUNTIME_CAPABILITY_KEYS from CAPABILITY_TEST_COMMAND so new capabilities are automatically retryable instead of silently missing from the controller's allowlist. - Render a success state in CapabilityStatusBanner when retry yields available: true, so the red error branch no longer briefly flashes the stale disabled-reason while parent refreshCapabilities is in flight.

jamby77 · 2026-05-25T14:09:11Z

@BugBot review

When both canSlowLog and canCommandLog were unavailable the page rendered a single SLOWLOG/COMMANDLOG banner whose retry only re-probed canSlowLog, so operators who fixed COMMANDLOG on the server could not re-enable that half from the UI. Split into two stacked banners so each retry targets the correct capability.

jamby77 · 2026-05-25T14:58:55Z

@BugBot review

Reset capabilities/reasons synchronously when the active connection changes so SlowLog and other capability banners cannot briefly show another database's disable reason while /health is still in flight.

jamby77 · 2026-05-25T15:11:34Z

@BugBot review

cursor

✅ Bugbot reviewed your changes and found no new issues!

2 issues from previous reviews remain unresolved.

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 16f2821. Configure here.}

- canSlowLog probe now runs SLOWLOG GET 1 (the same subcommand the live poller/UI uses) so a server permitting LEN but blocking GET cannot pass the retry and then fail on the next poll. - retryCapability awaits refreshCapabilities so context flags clear before the verdict resolves, preventing CapabilityStatusBanner from showing success while old disabled flags still gate the page.

jamby77 · 2026-05-25T15:18:22Z

@BugBot review

- canCommandLog probe now runs COMMANDLOG GET 1 slow (the poller path) instead of LEN, so a server permitting LEN but blocking GET cannot pass retry and then fail on the next poll. - When both SLOWLOG and COMMANDLOG are unavailable but neither has a probe reason, surface an informational banner (no retry) so the Slow Log page never silently renders empty tables. Single-capability absences without a reason stay silent so Redis-only servers don't see noisy "COMMANDLOG not exposed" banners on every visit.

jamby77 · 2026-05-25T15:25:59Z

@BugBot review

Key each CapabilityStatusBanner by the active connection ID so React discards the local lastVerdict state when the user switches connections. Without the key, a banner could carry a prior connection's success or transient retry message into the new connection's UI.

jamby77 · 2026-05-25T15:30:48Z

@BugBot review

CapabilityStatusBanner now records the upstream reason alongside each retry verdict and ignores the verdict once /health delivers a different reason. Without this, an amber 'Couldn't verify' state could persist even after health refreshed the definitive disabled reason — masking the fresh detail with the prior transient error.

jamby77 · 2026-05-25T15:37:31Z

@BugBot review

…ility) The 5s timeout on withTimeout cannot cancel the underlying adapter.call; without deduplication, repeated retry requests stack new commands on the same Redis connection while earlier probes are still hung. Track each in-flight probe in a per-(connection, capability) map and reuse it for overlapping requests so only one outstanding command exists at a time. Entries clear when the underlying promise settles, regardless of whether the caller already returned 'unknown' on timeout.

jamby77 · 2026-05-25T15:44:09Z

@BugBot review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit fb509f2. Configure here.}

KIvanow

@jamby77 How similar are CapabilityStatus

jamby77 requested a review from KIvanow May 21, 2026 06:25

jamby77 marked this pull request as ready for review May 21, 2026 07:48

jamby77 added 6 commits May 21, 2026 10:48

chore: revert unrelated formatting in start-session-modal.tsx

392c49d

These changes were IDE auto-format noise picked up earlier in the branch's history; they're not part of the runtime-capability work this PR ships. Restoring the master version keeps the PR diff focused.

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread apps/web/src/components/CapabilityStatusBanner.tsx Outdated

Comment thread apps/api/src/connections/connections.controller.ts Outdated