Sub-issue of #55 (DSP-1). Found in smoke test of PR #61.
Observation
Smoke test of npm run swarm (n=3) had this stage timeline:
07:52:32 decompose done
07:52:33 all 3 workers started (parallel)
07:53:17 worker[1] ✓ (~44s)
07:53:34 worker[2] ✓ (~61s)
07:58:19 worker[3] ▸ Update Todos ← woke up after FIVE MINUTES of silence
07:59:08 worker[3] ✓ (~6m35s)
Total swarm: 447s (~7.5 min) — almost all of it waiting on a single stalled worker. Workers 1 & 2 finished in ~60s.
Why it broke
Workers are awaited via Promise.all(subtasks.map(...)) with no per-worker timeout. If one worker's underlying LLM call hangs (rate-limit backoff, model thinking-loop, network glitch), the entire swarm pays the slowest-worker wall clock.
Fix direction
- Wrap
runWorker(...) in a Promise.race([worker, timeout]). Timeout default 120s, configurable via WORKER_TIMEOUT_MS env.
- On timeout, return a
WorkerOutput with result: "[TIMED_OUT after Nms]" and let synthesizer handle it.
- Update the synthesizer prompt to recognize TIMED_OUT results and treat them as missing data rather than a contradiction.
Cancellation of the in-flight graff.chat() stream is best-effort for this PR — the worker function returns the timeout marker even if the underlying request continues in the background. Real cancellation is a deeper SDK question.
Acceptance criteria
Related
Sub-issue of #55 (DSP-1). Found in smoke test of PR #61.
Observation
Smoke test of
npm run swarm(n=3) had this stage timeline:Total swarm: 447s (~7.5 min) — almost all of it waiting on a single stalled worker. Workers 1 & 2 finished in ~60s.
Why it broke
Workers are awaited via
Promise.all(subtasks.map(...))with no per-worker timeout. If one worker's underlying LLM call hangs (rate-limit backoff, model thinking-loop, network glitch), the entire swarm pays the slowest-worker wall clock.Fix direction
runWorker(...)in aPromise.race([worker, timeout]). Timeout default 120s, configurable viaWORKER_TIMEOUT_MSenv.WorkerOutputwithresult: "[TIMED_OUT after Nms]"and let synthesizer handle it.Cancellation of the in-flight
graff.chat()stream is best-effort for this PR — the worker function returns the timeout marker even if the underlying request continues in the background. Real cancellation is a deeper SDK question.Acceptance criteria
WORKER_TIMEOUT_MSenv var honored, default 120000.decompose + max(worker_timeout) + synthesize.Related