Skip to content

feat(platform): adaptive chat reliability and blue-green deploy tiers#1914

Merged
yannickmonney merged 13 commits into
mainfrom
feat/chat-reliability-and-deploy-tiers
Jun 22, 2026
Merged

feat(platform): adaptive chat reliability and blue-green deploy tiers#1914
yannickmonney merged 13 commits into
mainfrom
feat/chat-reliability-and-deploy-tiers

Conversation

@yannickmonney

@yannickmonney yannickmonney commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

A consolidated batch that had accumulated on this branch, opened as one PR per request. It spans
two headline pillars plus supporting governance / model-catalog / accessibility work. Grouped by theme
below so it can be reviewed pillar-by-pillar.

Note: this intentionally bundles several features into one PR (against the usual atomic-PR norm) at
the requester's direction. Each pillar is independently described. The branch also carries a separate
fix: issues commit — an icon-button accessibility migration (a shared IconButton primitive with a
built-in tooltip, adopted across ~28 call sites).

Pillar 1 — Adaptive chat reliability

  • Shared chat-error contract (lib/shared/chat-errors.ts): one pure module classifies and encodes
    generation errors, used by both the Convex backend and the chat UI. message.error now carries a
    structured, localizable TALE_ERR1 envelope instead of a raw string.
  • Resource-scoped provider failover (convex/providers/, failure_scope.ts): deterministic
    failures retire a single credential/endpoint rather than a whole provider, with a structured
    fallback envelope surfaced to the UI.
  • Structured model-fallback notice + auto-switch in the composer, with new modelFallback* strings.
  • Adaptive reasoning router: the Auto router emits a coarse reasoning seed (effort/creativity)
    blended into the governor's difficulty prior. The per-agent manual Response Tuning override and
    its settings tab/route are removed; the governor gains anti-oscillation (lastTier) + cold-quality
    self-correction.
  • Composer rework: pasted images become inline [N] reference tokens; legacy screenshot capture
    removed; the new-chat shortcut/button moves from the chat header to the nav rail.

Pillar 2 — Blue-green deploy tiers + CLI version model

  • 3-tier tale deploy: -a/--all--stop; convex always rolls; db/proxy stop-gated;
    platform blue-green; sandbox/sandbox-egress flip via a new zero-gap path.
  • Zero-gap sandbox blue-green flip (CLI flip-sandbox.ts + spawner SANDBOX_COLOR, /v1/drain,
    colour-aware cancel via sandboxExecutions.spawnerColor).
  • CLI version lifecycle: tale upgradetale update (CLI self-update + file-sync only;
    rolls the binary back on sync failure). Automatic CLI↔instance version alignment (lib/version/align.ts).
  • Rollback confirmation gate: tale rollback requires -y/--yes.

Pillar 3 — Governance config, model catalog, UI a11y

  • Governance retention consolidation: standalone retention-policy.json folded into a policy
    block of retention.json with a legacy read fallback; six policy editors refactored onto a shared
    Zod config-parser + an optimistic instant-save toggle hook.
  • Model catalog sync: expanded frontier-vendor set; the weekly updater regenerates the docs tables.
  • Accessibility / UX pass: the shared Button enforces labels on icon-only buttons (with a
    working built-in tooltip); the zoom-pan viewer toolbar is rebuilt on shared primitives.
  • Docs + i18n swept across en/de/fr for every user-visible change; dev-engine auto-starts a
    stopped Docker engine.

Schema / migrations

All four Convex schema changes are purely additive optional fields with legacy-row coalescing
(autoRouteCache.seed, sandboxExecutions.spawnerColor, threads/reasoning_profiles bucket
lastTier) — no data migration required. Knowledge-DB (Postgres) schema unchanged → N/A.

Self-review + CodeRabbit — fixes applied

This PR was reviewed by an adversarial multi-agent self-review (11 confirmed findings) and CodeRabbit
(local CLI, 25 findings; 1 was a verified false-positive). Fixes pushed in the two fix(platform)
commits on top of the feature commit:

  • Windows self-update rollbackreplaceBinaryWindows returned ${installPath}.old but created
    tale.old.exe, so commit/rollback hit a nonexistent path; now one consistent, name-agnostic backup.
  • Sandbox runtime image on default deploy — the runtime/sandbox images stopped being pulled +
    retagged once the sandbox tier moved off statefulToUpdate, breaking the first /v1/execute after a
    bump; now derived from the actual flipSandboxTier roll.
  • Button tooltip never opened — the Radix Trigger wrapped SkeletonBox (a plain span that drops
    the injected ref/handlers); it now wraps the real ButtonBase.
  • failure_scope crash — tolerate a missing apiKey/baseUrl (keyless providers).
  • Adaptive lastTier fallback, signal-killed update exit code, drain-status response
    validation
    , SSE header precedence, paste image double-insert, video-URL unhandled
    rejection
    , a literal NUL byte in auto_route_helpers, Knip dead exports, and several
    silent catches → all fixed. Tests updated to the new error-envelope / [MODEL_FALLBACK] contracts.

Deferred (tracked) — intentionally not in this PR

  • /v1/drain is unauthenticated — a deliberate trust assumption (the spawner port is never
    host-exposed). Documented; HMAC-gating it is a follow-up if that assumption changes.
  • mergeRetentionPolicy on a fresh org without a bounds catalog — an edge-case in the
    actively-evolving governance file-config flow; left for the owner of that refactor to avoid a wrong
    fix to the bounds-seeding path.
  • A few LOW nits (a no-op setSpawnerColor round-trip in single-colour mode, a dropped {model} i18n
    param, two test-only helper exports) — noted, non-blocking.

Definition of Done

  • bun run check — typecheck (13/13) + Knip + targeted tests run locally; full lint/test suite green in CI.
  • bun run lint:sast (Opengrep) — clean (pre-commit gate + CI Opengrep/Opengrep-OSS green).
  • Data-model change ships its migration — N/A: additive optional fields with reader coalescing.
  • bun run test:e2e — Playwright runs in CI; the automation* spec failures are pre-existing
    infra flakiness (WebServer ECONNRESET / convex-push / TALE_CONFIG_DIR), unrelated to this diff.
  • Loading uses <Skeletonize> — no new hand-rolled skeletons.
  • Updated messages/{en,de,fr}.json — 5812 leaf keys verified in parity.
  • Updated /docs/{en,de,fr}/ for user-visible changes (CLI verbs, deploy tiers, models, retention).
  • Tests carry the change — unit tests for the new units; failing pre-existing tests updated to the new contracts.
  • Updated README.md (CLI command surface).
  • Verified the real outcome — gate run locally + CI; review fixes re-validated (typecheck + affected suites).
  • Instructions current — AGENTS.md comment rule + docs prose-lint updated.

@yannickmonney yannickmonney force-pushed the feat/chat-reliability-and-deploy-tiers branch from 2056d60 to 58e9e81 Compare June 22, 2026 04:40
Adaptive chat reliability, blue-green deploy tiers, CLI version alignment, and sandbox blue-green re-architecture.
Self-review pass on the rebased PR:
- flip-sandbox: a drained colour whose /drain-status stayed unknown through the
  timeout was torn down (killing in-flight turns); treat unknown as 'may have
  sessions' and linger, with the spawner max-linger self-reap as backstop.
  Added a regression test + injectable poll/timeout seams.
- docs(retention): describe the standalone retention-policy.json layout (flat
  fields) instead of the dropped folded-policy structure, in all 3 locales.
- docs(models): fix stale examples/default/providers path -> builtin-configs.
- ui(button): add tooltip/aria-name tests + Storybook story for the icon-button
  title->accessible-name feature.
- chore: fix a stale doc-comment path in convex-error.ts.
… crash the E2E server

The Vite dev/preview proxy (used by the prod-build E2E serving path) routes
/ws_api, /http_api and /api/* to the Convex backend with no http-proxy error
handler. A transient upstream reset (ECONNRESET) or client disconnect (EPIPE)
under CI load therefore rethrows and tears down the port-3000 server, after
which every in-flight navigation fails at once — the whole Playwright shard
dies with 0ms test failures rather than one test flaking. Attach an error
handler to each proxied route: log and respond 502 (HTTP) or destroy the socket
(WS upgrade) so one blip can't take the server down. Pre-existing flake (also
seen on main); surfaced while greening this PR's E2E.
…et can't crash the E2E server"

This reverts the vite.config.ts proxy error-handler change (d95adef).

It did not fix the E2E failure and was based on a symptom-level reading. The
real cause is a pre-existing, main-wide E2E infra issue: a Convex "Hit an error
while pushing" during the dev-stack env-sync leaves the backend incompletely
deployed, so the app can't reach Convex and the worker `org` fixture dies (every
spec fails at 0ms). The vite ECONNRESET/EPIPE floods are a downstream symptom.
Proven pre-existing: main's own E2E run at the merge-base (bbaab2f) shows the
identical "Hit an error while pushing" + vite ECONNRESET pattern, before this
branch diverged. The proxy hardening also tried to write to already-ended
sockets (writeAfterFIN), so it is net-negative here. Keeping this PR focused;
the dev-stack push error belongs in a separate infra fix.
@yannickmonney yannickmonney force-pushed the feat/chat-reliability-and-deploy-tiers branch from 42ab2f3 to 26547cf Compare June 22, 2026 12:29
The worker-scoped workerOrg fixture waits for the seeded 'E2E Assistant'
on /dashboard/{org}/agents/all, but that table shows only INSTALLED +
enabled agents. assistant.json had no metadata.autoInstall, so the
provisioner (syncDefaultAgentInstallations) skipped it — no
agentInstallations row, agent filtered out, waitForSeededOrg timed out,
and every seeded-org spec cascade-failed (~120 tests in 1-3ms each).

Add metadata.autoInstall:true (mirrors the already-working seeded
prompt). Also add empty apps/skills/branding/governance fixture dirs so
org scaffold stops logging non-fatal '[scaffold] ... does not exist'
ERRORs for the config domains added in #1911 (copyTree skips dotfiles,
so the .gitkeep stubs change no seeded behavior).
The seed fix unblocked the suite; the residual platform failures were
automation-editor.spec / automation.spec asserting on the canvas toolbar
(Add step / Test automation) via page.getByTitle(). The shared @tale/ui
Button now suppresses the native title attribute and routes title into
aria-label + a Radix tooltip, so getByTitle() matches nothing even though
the buttons render with those accessible names. Switch the 4 lookups to
getByRole('button', { name, exact }) — the stable role+label locator the
specs already use elsewhere.
@yannickmonney yannickmonney merged commit 6c8cde6 into main Jun 22, 2026
58 checks passed
@yannickmonney yannickmonney deleted the feat/chat-reliability-and-deploy-tiers branch June 22, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant