Bug: tale deploy hard-fails on transient Convex module-analyze timeout instead of auto-retrying

## Summary

`tale deploy` hard-fails (exit 1, no retry) when the Convex push hits a **module-analyze timeout** — even though the condition is transient and a plain re-run succeeds with zero code changes. The deploy entrypoint classifies this error as `unclassified` and skips the auto-retry path that every other transient failure already uses.

## Observed error

During a normal `tale deploy`, the `platform-blue` color aborted on the Convex push step:

```
platform-blue  ✖ Error: Unable to start push to http://convex:3210
platform-blue  ✖ Error fetching POST http://convex:3210/api/deploy2/evaluate_push 400 Bad Request:
               InvalidModules: Hit an error while pushing:
               Loading the pushed modules encountered the following error:
               Function execution timed out (maximum duration: 2s)
platform-blue  ERROR Convex deploy failed (exit code: 1)
platform-blue  ━━━ Error diagnosis ━━━
platform-blue  Reason: unclassified. See full deploy log above.
               fix: try RUST_LOG=debug on the convex service and re-run deploy.
```

Failure occurred **before** the blue-green traffic switch, so the live color kept serving (no rollback needed). **Re-running `tale deploy` immediately succeeded with no code change.**

## Root cause

- `…/api/deploy2/evaluate_push … Function execution timed out (maximum duration: 2s)` is Convex's **analyze step**: on push, the backend imports each function module once in a V8 isolate to discover its exports, under a hard ~2s per-module evaluation budget.
- This is **not** the runtime/action timeout (`ACTIONS_USER_TIMEOUT_SECS=1800`) nor the deploy wall-clock (`CONVEX_DEPLOY_TIMEOUT=300`). The 2s analyze budget is baked into the pinned convex-backend image and is not configurable from this repo.
- Investigation found **no top-level `await`** and **no genuinely 2s-heavy module-scope work** anywhere under `services/platform/convex/` — the heaviest top-level costs (a single `renderPrompt`, a few compiled regexes, auth role construction) are all sub-millisecond. None of them is a deterministic 2s offender, and none changed in the deployed version.
- The convex container has **no CPU limit** (`compose.yml`, `mem_limit: 12g` only). During a deploy it competes for CPU with concurrent image pulls and the in-flight chat-generation drain. Under that contention a module that normally analyzes in ~1.5s tips over 2s — which is exactly why it is non-deterministic and why an immediate re-run (images already local, less contention) succeeds.

## Why this is a bug

In `services/platform/docker-entrypoint.sh`, the deploy error classifier routes genuinely-transient failures (`RaceDetected`, `SearchIndexesUnavailable`, `fetch failed`/`ECONNREFUSED`) to `retry=true`, which then runs a 3× exponential-backoff retry loop (`services/platform/docker-entrypoint.sh:485-508`).

The analyze timeout instead falls through to the final `else` branch (`services/platform/docker-entrypoint.sh:477-480`):

```bash
else
  log_error "Reason: unclassified. See full deploy log above."
  echo "  fix: try RUST_LOG=debug on the convex service and re-run deploy."
fi
```

`retry` stays `false`, so the deploy **exits 1 immediately** and forces a manual re-run — for a condition we've shown is transient and self-healing. This makes deploys flaky for operators with no upside.

## Proposed fix

Classify the analyze timeout as retryable so it reuses the existing backoff loop, mirroring the `SearchIndexesUnavailable` precedent (`services/platform/docker-entrypoint.sh:453-456`). Add a branch **before** the final `else` in `deploy_convex_functions`:

```bash
elif grep -q "InvalidModules" "$deploy_log" && grep -q "Function execution timed out" "$deploy_log"; then
  log_error "Reason: module analyze timed out (2s isolate budget) — usually transient deploy-time CPU contention"
  echo "  → A module's import-time evaluation exceeded the 2s analyze budget,"
  echo "    typically because convex was CPU-starved during image pulls / chat drain."
  echo "  fix: auto-retrying; if persistent, set RUST_LOG=debug to find the slow module and defer its top-level work."
  retry=true
```

Matching `InvalidModules` **and** `Function execution timed out` together is specific to the push-analyze timeout and won't catch unrelated runtime timeouts.

## Secondary hardening (optional, not required)

- Give the convex container some CPU priority during the deploy window (it currently has no CPU reservation), or stagger image pulls vs. the push, to reduce how often the 2s budget is tipped in the first place.
- The auto-retry above is the primary fix; this is belt-and-suspenders.

## Acceptance criteria

- A Convex push that fails with `InvalidModules` + `Function execution timed out (maximum duration: 2s)` is logged as a transient/retryable reason and auto-retried up to 3× with backoff, instead of exiting 1.
- A deploy that previously needed a manual re-run now self-heals on retry.
- Non-transient push failures (`InvalidSchema`, `ModulesTooLarge`, `AuthConfigMissingEnvironmentVariable`, etc.) still fail fast — no change to their behavior.

## References

- `services/platform/docker-entrypoint.sh:411-512` — push error classifier + retry loop
- `services/platform/docker-entrypoint.sh:477-480` — current `unclassified` fallthrough (no retry)
- `services/platform/docker-entrypoint.sh:453-456` — `SearchIndexesUnavailable` retryable precedent
- `compose.yml` (convex service) — `mem_limit: 12g`, no CPU limit, `env_file: - .env`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: tale deploy hard-fails on transient Convex module-analyze timeout instead of auto-retrying #2117

Summary

Observed error

Root cause

Why this is a bug

Proposed fix

Secondary hardening (optional, not required)

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bug: tale deploy hard-fails on transient Convex module-analyze timeout instead of auto-retrying #2117

Description

Summary

Observed error

Root cause

Why this is a bug

Proposed fix

Secondary hardening (optional, not required)

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions