Skip to content

feat(runtime-capabilities): capability retry endpoint + unavailable-capability UI#213

Merged
jamby77 merged 15 commits into
masterfrom
bugfix/detect-upstash-slowlog-unavailable
May 28, 2026
Merged

feat(runtime-capabilities): capability retry endpoint + unavailable-capability UI#213
jamby77 merged 15 commits into
masterfrom
bugfix/detect-upstash-slowlog-unavailable

Conversation

@jamby77

@jamby77 jamby77 commented May 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds a POST /connections/:id/capabilities/retry endpoint that re-probes a previously-disabled runtime capability (e.g. SLOWLOG on Upstash) and clears the cached "unavailable" state on success.
  • Extends RuntimeCapabilityTracker with per-capability disabled-reason + timestamp metadata, surfaced through the health endpoint and shared health.ts types.
  • Adds front-end CapabilityUnavailableBanner + UnavailableOverlay driven by useCapabilities, with a retry action wired into Slow Log and other capability-gated pages.
  • Recognizes Upstash-style "command is not available" / "not supported" error strings so capabilities get auto-disabled instead of repeatedly erroring.

Why draft

Pushed to keep the work alive after master picked up the related MonitorSupportProbe work (#207). This branch is the other half — capability retry / UI affordances — and hasn't been reviewed yet.

Test plan

  • pnpm test --filter @betterdb/api (new runtime-capability-tracker.service.spec.ts)
  • Manual: connect Upstash, hit Slow Log page, see banner, click retry, confirm capability re-evaluates
  • pnpm lint clean
  • Verify health endpoint exposes disabledReasons shape matching packages/shared/src/types/health.ts

Note

Medium Risk
Touches live connection command execution and health/capability state used across polling and the UI; behavior is well-tested but misclassified errors could flip feature availability.

Overview
Adds manual runtime capability retry and clearer operator UI when Redis/Valkey commands are blocked (e.g. Upstash, ElastiCache).

API: New POST /connections/:id/capabilities/:capability/retry runs a cheap probe from CAPABILITY_TEST_COMMAND, dedupes in-flight probes, and applies a 5s timeout so hung calls return available: 'unknown' without changing prior state. Definitive server rejections re-record failure and return available: false with the error text; success calls resetCapability. RuntimeCapabilityTracker now stores per-capability disable reasons + timestamps, recognizes more blocked-error patterns (Upstash “not available”, cluster disabled, etc.), and exposes getDisabledReasons / resetCapability. Health responses include runtimeCapabilityReasons (shared types in packages/shared).

Web: App loads reasons from health and exposes retryCapability via context. Slow Log replaces the full-page unavailable overlay with CapabilityStatusBanner (verbatim ERR, retry, transient vs blocked styling). UnavailableMessage / UnavailableOverlay gain optional reason + retry for other gated pages.

Tests: Jest specs for the controller retry path and tracker; Vitest for the banner.

Reviewed by Cursor Bugbot for commit fb509f2. Bugbot is set up for automated code reviews on this repo. Configure here.

Extended backend and frontend to support retrying runtime capabilities (e.g., `SLOWLOG`) when disabled due to server restrictions. Added `/connections/:id/capabilities/:capability/retry` endpoint, React component `CapabilityUnavailableBanner` for inline warnings, and enhanced state handling in `useCapabilities`. Includes tests and better surfacing of disabled reasons in the UI.
@jamby77 jamby77 requested a review from KIvanow May 21, 2026 06:25
@jamby77 jamby77 marked this pull request as ready for review May 21, 2026 07:48
@jamby77

jamby77 commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor

cursor Bot commented May 21, 2026

Copy link
Copy Markdown

Bugbot couldn't run

Bugbot is not enabled for your user on this team.

Ask your team administrator to increase your team's hard limit for Bugbot seats or add you to the allowlist in the Cursor dashboard.

jamby77 added 6 commits May 21, 2026 10:48
Retry now actually re-runs the capability's command instead of just clearing
the disabled flag, and the SlowLog page surfaces the verdict in stacked
banners over always-rendered content.

- POST /connections/:id/capabilities/:capability/retry now executes the
  capability's cheapest test command (SLOWLOG LEN, LATENCY LATEST, etc.)
  against the live adapter and returns { available, reason }. Success
  re-enables the capability; a blocked-pattern failure re-records the
  rejection with the fresh reason; transient errors leave previous state.
- Per-capability test-command map exported as CAPABILITY_TEST_COMMAND from
  RuntimeCapabilityTracker for the controller to consume.
- New CapabilityStatusBanner: warn banner (description) + error banner
  (verbatim ERR) + left-aligned Retry button. During retry the error
  banner shows "Force-retrying X…" for a 3 s window so the action is
  perceptible before the (typically sub-100 ms) probe resolves.
- SlowLog page now always renders the table; the unavailable state is
  expressed as banners stacked above it instead of a centered overlay
  card. Both the full-block-unavailable and per-capability partial cases
  use the same banner.
- Extract UnavailableMessage from UnavailableOverlay so other pages
  (Clients, AuditTrail, VectorAi, ClientAnalytics) keep the overlay
  visual unchanged via composition.
- Remove the now-unused CapabilityUnavailableBanner.
- Fix a stray missing-paren JSX expression in start-session-modal.tsx
  that broke type-check.
CLUSTER SLOT-STATS without ORDERBY/LIMIT errors with
"wrong number of arguments" — that's not a blocked-command pattern, so the
probe was returning available='unknown' on capable servers (false negative).
Match the shape used by the live poller (ORDERBY key-count LIMIT 1) so the
probe and the polled call exercise the same server path.

CLIENT LIST as a probe is O(N) over the connection table. CLIENT GETNAME
exercises the same ACL gate at O(1).

Also recognize "cluster support disabled" as a definitive rejection so
standalone servers correctly report canClusterSlotStats=false instead of
'unknown' on the new probe.
Covers all five outcomes of POST /connections/:id/capabilities/:capability/retry:

- 400 on unknown capability key, 404 on missing connection.
- Success path: returns { available: true } and re-enables a previously-
  disabled capability via the tracker.
- Blocked-pattern rejection (Upstash "Command is not available", cluster
  support disabled): returns { available: false, reason } and records the
  failure with the fresh reason.
- Transient/non-blocked rejection (ECONNRESET, Connection timeout):
  returns { available: 'unknown', reason } and leaves prior tracker state
  intact — including not re-enabling a previously-disabled capability.
- String (non-Error) rejection values are surfaced through the reason field.

Also asserts the probe call actually sends the command + args from
CAPABILITY_TEST_COMMAND so the map stays in sync with the controller.
…tone

The probe can return three verdicts (true / false / 'unknown') but the
banner previously rendered the same red "blocked" error regardless. An
operator on a flaky connection couldn't tell "server refused MONITOR" from
"network blip swallowed the probe" — the former requires a server config
change, the latter just another click.

- Add CapabilityRetryVerdict to the shared health types and wire it
  through controller → fetchApi → useCapabilities.retryCapability so
  callers get the structured verdict back, not just void.
- CapabilityStatusBanner tracks the most recent verdict locally. When it
  was 'unknown' the error slot becomes an amber "Couldn't verify —
  transient error. Try again." block (HelpCircle icon) with the FRESH
  reason from the verdict (an ECONNRESET, timeout, etc. — not the stale
  prop-level disabled-because reason). Definitive 'false' verdicts and
  initial-render state keep the red error tone.
- Extract the 3s perceptible delay into PERCEPTIBLE_RETRY_DELAY_MS.
- During retry the banner now shows a dedicated amber spinner row instead
  of overloading the error banner with the "Force-retrying X…" string.

6 vitest cases cover initial render, in-progress state, the three verdict
paths (true unmounts via parent, false stays red, unknown turns amber),
and the verdict-reason-vs-prop-reason fallback. All 62 backend tests
(tracker + controller + registry) still green.
These changes were IDE auto-format noise picked up earlier in the branch's
history; they're not part of the runtime-capability work this PR ships.
Restoring the master version keeps the PR diff focused.
…t pin

iovalkey's commandTimeout normally bounds a Redis call, but adapters that
lack one — or that hang BEFORE a command is even queued (TLS handshake on
a dead server, etc.) — would otherwise keep the retry endpoint open
forever. The HTTP client times out, the user retries, and the orphaned
probe is still using up a connection slot.

- New CAPABILITY_PROBE_TIMEOUT_MS = 5000.
- Wrap adapter.call(...) in a small Promise-race helper. On timeout the
  rejection ("Probe timed out after 5000ms") flows through the existing
  catch branch: pattern doesn't match a blocked-command rule, so the
  tracker stays untouched and the verdict is { available: 'unknown',
  reason }. Operator gets the amber "couldn't verify" UI added in the
  previous commit and can try again.
- New controller test (jest fake timers) covers the hung-probe path and
  asserts the prior disabled state is preserved across timeout.
@KIvanow

KIvanow commented May 21, 2026

Copy link
Copy Markdown
Member

@BugBot review

@KIvanow

KIvanow commented May 21, 2026

Copy link
Copy Markdown
Member

Should be fixed now. Bugbot's vibecoded API had a new issue

Comment thread apps/web/src/components/CapabilityStatusBanner.tsx Outdated
Comment thread apps/api/src/connections/connections.controller.ts Outdated
- Derive RUNTIME_CAPABILITY_KEYS from CAPABILITY_TEST_COMMAND so new
  capabilities are automatically retryable instead of silently missing
  from the controller's allowlist.
- Render a success state in CapabilityStatusBanner when retry yields
  available: true, so the red error branch no longer briefly flashes
  the stale disabled-reason while parent refreshCapabilities is in
  flight.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/web/src/pages/SlowLog.tsx Outdated
When both canSlowLog and canCommandLog were unavailable the page rendered
a single SLOWLOG/COMMANDLOG banner whose retry only re-probed canSlowLog,
so operators who fixed COMMANDLOG on the server could not re-enable that
half from the UI. Split into two stacked banners so each retry targets
the correct capability.
Comment thread apps/web/src/App.tsx
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/web/src/App.tsx
Comment thread apps/api/src/connections/runtime-capability-tracker.service.ts Outdated
Reset capabilities/reasons synchronously when the active connection
changes so SlowLog and other capability banners cannot briefly show
another database's disable reason while /health is still in flight.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

2 issues from previous reviews remain unresolved.

Fix All in Cursor

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 16f2821. Configure here.

- canSlowLog probe now runs SLOWLOG GET 1 (the same subcommand the live
  poller/UI uses) so a server permitting LEN but blocking GET cannot
  pass the retry and then fail on the next poll.
- retryCapability awaits refreshCapabilities so context flags clear
  before the verdict resolves, preventing CapabilityStatusBanner from
  showing success while old disabled flags still gate the page.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/api/src/connections/runtime-capability-tracker.service.ts Outdated
Comment thread apps/web/src/pages/SlowLog.tsx Outdated
- canCommandLog probe now runs COMMANDLOG GET 1 slow (the poller path)
  instead of LEN, so a server permitting LEN but blocking GET cannot
  pass retry and then fail on the next poll.
- When both SLOWLOG and COMMANDLOG are unavailable but neither has a
  probe reason, surface an informational banner (no retry) so the
  Slow Log page never silently renders empty tables. Single-capability
  absences without a reason stay silent so Redis-only servers don't
  see noisy "COMMANDLOG not exposed" banners on every visit.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/web/src/pages/SlowLog.tsx
Key each CapabilityStatusBanner by the active connection ID so React
discards the local lastVerdict state when the user switches connections.
Without the key, a banner could carry a prior connection's success or
transient retry message into the new connection's UI.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/web/src/components/CapabilityStatusBanner.tsx Outdated
CapabilityStatusBanner now records the upstream reason alongside each
retry verdict and ignores the verdict once /health delivers a different
reason. Without this, an amber 'Couldn't verify' state could persist
even after health refreshed the definitive disabled reason — masking
the fresh detail with the prior transient error.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

Comment thread apps/api/src/connections/connections.controller.ts
…ility)

The 5s timeout on withTimeout cannot cancel the underlying adapter.call;
without deduplication, repeated retry requests stack new commands on the
same Redis connection while earlier probes are still hung. Track each
in-flight probe in a per-(connection, capability) map and reuse it for
overlapping requests so only one outstanding command exists at a time.
Entries clear when the underlying promise settles, regardless of
whether the caller already returned 'unknown' on timeout.
@jamby77

jamby77 commented May 25, 2026

Copy link
Copy Markdown
Collaborator Author

@BugBot review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit fb509f2. Configure here.

@KIvanow KIvanow left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamby77 How similar are CapabilityStatus

@jamby77 jamby77 merged commit 8e5a397 into master May 28, 2026
3 checks passed
@jamby77 jamby77 deleted the bugfix/detect-upstash-slowlog-unavailable branch May 28, 2026 06:47
@github-actions github-actions Bot locked and limited conversation to collaborators May 28, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants