Skip to content

Bug: tale deploy hard-fails on transient Convex module-analyze timeout instead of auto-retrying #2117

Description

@larryro

Summary

tale deploy hard-fails (exit 1, no retry) when the Convex push hits a module-analyze timeout — even though the condition is transient and a plain re-run succeeds with zero code changes. The deploy entrypoint classifies this error as unclassified and skips the auto-retry path that every other transient failure already uses.

Observed error

During a normal tale deploy, the platform-blue color aborted on the Convex push step:

platform-blue  ✖ Error: Unable to start push to http://convex:3210
platform-blue  ✖ Error fetching POST http://convex:3210/api/deploy2/evaluate_push 400 Bad Request:
               InvalidModules: Hit an error while pushing:
               Loading the pushed modules encountered the following error:
               Function execution timed out (maximum duration: 2s)
platform-blue  ERROR Convex deploy failed (exit code: 1)
platform-blue  ━━━ Error diagnosis ━━━
platform-blue  Reason: unclassified. See full deploy log above.
               fix: try RUST_LOG=debug on the convex service and re-run deploy.

Failure occurred before the blue-green traffic switch, so the live color kept serving (no rollback needed). Re-running tale deploy immediately succeeded with no code change.

Root cause

  • …/api/deploy2/evaluate_push … Function execution timed out (maximum duration: 2s) is Convex's analyze step: on push, the backend imports each function module once in a V8 isolate to discover its exports, under a hard ~2s per-module evaluation budget.
  • This is not the runtime/action timeout (ACTIONS_USER_TIMEOUT_SECS=1800) nor the deploy wall-clock (CONVEX_DEPLOY_TIMEOUT=300). The 2s analyze budget is baked into the pinned convex-backend image and is not configurable from this repo.
  • Investigation found no top-level await and no genuinely 2s-heavy module-scope work anywhere under services/platform/convex/ — the heaviest top-level costs (a single renderPrompt, a few compiled regexes, auth role construction) are all sub-millisecond. None of them is a deterministic 2s offender, and none changed in the deployed version.
  • The convex container has no CPU limit (compose.yml, mem_limit: 12g only). During a deploy it competes for CPU with concurrent image pulls and the in-flight chat-generation drain. Under that contention a module that normally analyzes in ~1.5s tips over 2s — which is exactly why it is non-deterministic and why an immediate re-run (images already local, less contention) succeeds.

Why this is a bug

In services/platform/docker-entrypoint.sh, the deploy error classifier routes genuinely-transient failures (RaceDetected, SearchIndexesUnavailable, fetch failed/ECONNREFUSED) to retry=true, which then runs a 3× exponential-backoff retry loop (services/platform/docker-entrypoint.sh:485-508).

The analyze timeout instead falls through to the final else branch (services/platform/docker-entrypoint.sh:477-480):

else
  log_error "Reason: unclassified. See full deploy log above."
  echo "  fix: try RUST_LOG=debug on the convex service and re-run deploy."
fi

retry stays false, so the deploy exits 1 immediately and forces a manual re-run — for a condition we've shown is transient and self-healing. This makes deploys flaky for operators with no upside.

Proposed fix

Classify the analyze timeout as retryable so it reuses the existing backoff loop, mirroring the SearchIndexesUnavailable precedent (services/platform/docker-entrypoint.sh:453-456). Add a branch before the final else in deploy_convex_functions:

elif grep -q "InvalidModules" "$deploy_log" && grep -q "Function execution timed out" "$deploy_log"; then
  log_error "Reason: module analyze timed out (2s isolate budget) — usually transient deploy-time CPU contention"
  echo "  → A module's import-time evaluation exceeded the 2s analyze budget,"
  echo "    typically because convex was CPU-starved during image pulls / chat drain."
  echo "  fix: auto-retrying; if persistent, set RUST_LOG=debug to find the slow module and defer its top-level work."
  retry=true

Matching InvalidModules and Function execution timed out together is specific to the push-analyze timeout and won't catch unrelated runtime timeouts.

Secondary hardening (optional, not required)

  • Give the convex container some CPU priority during the deploy window (it currently has no CPU reservation), or stagger image pulls vs. the push, to reduce how often the 2s budget is tipped in the first place.
  • The auto-retry above is the primary fix; this is belt-and-suspenders.

Acceptance criteria

  • A Convex push that fails with InvalidModules + Function execution timed out (maximum duration: 2s) is logged as a transient/retryable reason and auto-retried up to 3× with backoff, instead of exiting 1.
  • A deploy that previously needed a manual re-run now self-heals on retry.
  • Non-transient push failures (InvalidSchema, ModulesTooLarge, AuthConfigMissingEnvironmentVariable, etc.) still fail fast — no change to their behavior.

References

  • services/platform/docker-entrypoint.sh:411-512 — push error classifier + retry loop
  • services/platform/docker-entrypoint.sh:477-480 — current unclassified fallthrough (no retry)
  • services/platform/docker-entrypoint.sh:453-456SearchIndexesUnavailable retryable precedent
  • compose.yml (convex service) — mem_limit: 12g, no CPU limit, env_file: - .env

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions