Summary
tale deploy hard-fails (exit 1, no retry) when the Convex push hits a module-analyze timeout — even though the condition is transient and a plain re-run succeeds with zero code changes. The deploy entrypoint classifies this error as unclassified and skips the auto-retry path that every other transient failure already uses.
Observed error
During a normal tale deploy, the platform-blue color aborted on the Convex push step:
platform-blue ✖ Error: Unable to start push to http://convex:3210
platform-blue ✖ Error fetching POST http://convex:3210/api/deploy2/evaluate_push 400 Bad Request:
InvalidModules: Hit an error while pushing:
Loading the pushed modules encountered the following error:
Function execution timed out (maximum duration: 2s)
platform-blue ERROR Convex deploy failed (exit code: 1)
platform-blue ━━━ Error diagnosis ━━━
platform-blue Reason: unclassified. See full deploy log above.
fix: try RUST_LOG=debug on the convex service and re-run deploy.
Failure occurred before the blue-green traffic switch, so the live color kept serving (no rollback needed). Re-running tale deploy immediately succeeded with no code change.
Root cause
…/api/deploy2/evaluate_push … Function execution timed out (maximum duration: 2s) is Convex's analyze step: on push, the backend imports each function module once in a V8 isolate to discover its exports, under a hard ~2s per-module evaluation budget.
- This is not the runtime/action timeout (
ACTIONS_USER_TIMEOUT_SECS=1800) nor the deploy wall-clock (CONVEX_DEPLOY_TIMEOUT=300). The 2s analyze budget is baked into the pinned convex-backend image and is not configurable from this repo.
- Investigation found no top-level
await and no genuinely 2s-heavy module-scope work anywhere under services/platform/convex/ — the heaviest top-level costs (a single renderPrompt, a few compiled regexes, auth role construction) are all sub-millisecond. None of them is a deterministic 2s offender, and none changed in the deployed version.
- The convex container has no CPU limit (
compose.yml, mem_limit: 12g only). During a deploy it competes for CPU with concurrent image pulls and the in-flight chat-generation drain. Under that contention a module that normally analyzes in ~1.5s tips over 2s — which is exactly why it is non-deterministic and why an immediate re-run (images already local, less contention) succeeds.
Why this is a bug
In services/platform/docker-entrypoint.sh, the deploy error classifier routes genuinely-transient failures (RaceDetected, SearchIndexesUnavailable, fetch failed/ECONNREFUSED) to retry=true, which then runs a 3× exponential-backoff retry loop (services/platform/docker-entrypoint.sh:485-508).
The analyze timeout instead falls through to the final else branch (services/platform/docker-entrypoint.sh:477-480):
else
log_error "Reason: unclassified. See full deploy log above."
echo " fix: try RUST_LOG=debug on the convex service and re-run deploy."
fi
retry stays false, so the deploy exits 1 immediately and forces a manual re-run — for a condition we've shown is transient and self-healing. This makes deploys flaky for operators with no upside.
Proposed fix
Classify the analyze timeout as retryable so it reuses the existing backoff loop, mirroring the SearchIndexesUnavailable precedent (services/platform/docker-entrypoint.sh:453-456). Add a branch before the final else in deploy_convex_functions:
elif grep -q "InvalidModules" "$deploy_log" && grep -q "Function execution timed out" "$deploy_log"; then
log_error "Reason: module analyze timed out (2s isolate budget) — usually transient deploy-time CPU contention"
echo " → A module's import-time evaluation exceeded the 2s analyze budget,"
echo " typically because convex was CPU-starved during image pulls / chat drain."
echo " fix: auto-retrying; if persistent, set RUST_LOG=debug to find the slow module and defer its top-level work."
retry=true
Matching InvalidModules and Function execution timed out together is specific to the push-analyze timeout and won't catch unrelated runtime timeouts.
Secondary hardening (optional, not required)
- Give the convex container some CPU priority during the deploy window (it currently has no CPU reservation), or stagger image pulls vs. the push, to reduce how often the 2s budget is tipped in the first place.
- The auto-retry above is the primary fix; this is belt-and-suspenders.
Acceptance criteria
- A Convex push that fails with
InvalidModules + Function execution timed out (maximum duration: 2s) is logged as a transient/retryable reason and auto-retried up to 3× with backoff, instead of exiting 1.
- A deploy that previously needed a manual re-run now self-heals on retry.
- Non-transient push failures (
InvalidSchema, ModulesTooLarge, AuthConfigMissingEnvironmentVariable, etc.) still fail fast — no change to their behavior.
References
services/platform/docker-entrypoint.sh:411-512 — push error classifier + retry loop
services/platform/docker-entrypoint.sh:477-480 — current unclassified fallthrough (no retry)
services/platform/docker-entrypoint.sh:453-456 — SearchIndexesUnavailable retryable precedent
compose.yml (convex service) — mem_limit: 12g, no CPU limit, env_file: - .env
Summary
tale deployhard-fails (exit 1, no retry) when the Convex push hits a module-analyze timeout — even though the condition is transient and a plain re-run succeeds with zero code changes. The deploy entrypoint classifies this error asunclassifiedand skips the auto-retry path that every other transient failure already uses.Observed error
During a normal
tale deploy, theplatform-bluecolor aborted on the Convex push step:Failure occurred before the blue-green traffic switch, so the live color kept serving (no rollback needed). Re-running
tale deployimmediately succeeded with no code change.Root cause
…/api/deploy2/evaluate_push … Function execution timed out (maximum duration: 2s)is Convex's analyze step: on push, the backend imports each function module once in a V8 isolate to discover its exports, under a hard ~2s per-module evaluation budget.ACTIONS_USER_TIMEOUT_SECS=1800) nor the deploy wall-clock (CONVEX_DEPLOY_TIMEOUT=300). The 2s analyze budget is baked into the pinned convex-backend image and is not configurable from this repo.awaitand no genuinely 2s-heavy module-scope work anywhere underservices/platform/convex/— the heaviest top-level costs (a singlerenderPrompt, a few compiled regexes, auth role construction) are all sub-millisecond. None of them is a deterministic 2s offender, and none changed in the deployed version.compose.yml,mem_limit: 12gonly). During a deploy it competes for CPU with concurrent image pulls and the in-flight chat-generation drain. Under that contention a module that normally analyzes in ~1.5s tips over 2s — which is exactly why it is non-deterministic and why an immediate re-run (images already local, less contention) succeeds.Why this is a bug
In
services/platform/docker-entrypoint.sh, the deploy error classifier routes genuinely-transient failures (RaceDetected,SearchIndexesUnavailable,fetch failed/ECONNREFUSED) toretry=true, which then runs a 3× exponential-backoff retry loop (services/platform/docker-entrypoint.sh:485-508).The analyze timeout instead falls through to the final
elsebranch (services/platform/docker-entrypoint.sh:477-480):retrystaysfalse, so the deploy exits 1 immediately and forces a manual re-run — for a condition we've shown is transient and self-healing. This makes deploys flaky for operators with no upside.Proposed fix
Classify the analyze timeout as retryable so it reuses the existing backoff loop, mirroring the
SearchIndexesUnavailableprecedent (services/platform/docker-entrypoint.sh:453-456). Add a branch before the finalelseindeploy_convex_functions:Matching
InvalidModulesandFunction execution timed outtogether is specific to the push-analyze timeout and won't catch unrelated runtime timeouts.Secondary hardening (optional, not required)
Acceptance criteria
InvalidModules+Function execution timed out (maximum duration: 2s)is logged as a transient/retryable reason and auto-retried up to 3× with backoff, instead of exiting 1.InvalidSchema,ModulesTooLarge,AuthConfigMissingEnvironmentVariable, etc.) still fail fast — no change to their behavior.References
services/platform/docker-entrypoint.sh:411-512— push error classifier + retry loopservices/platform/docker-entrypoint.sh:477-480— currentunclassifiedfallthrough (no retry)services/platform/docker-entrypoint.sh:453-456—SearchIndexesUnavailableretryable precedentcompose.yml(convex service) —mem_limit: 12g, no CPU limit,env_file: - .env