Skip to content

[DSP-1.2] swarm: workers can stall indefinitely (5min stall observed) — need per-worker timeout #63

@justrach

Description

@justrach

Sub-issue of #55 (DSP-1). Found in smoke test of PR #61.

Observation

Smoke test of npm run swarm (n=3) had this stage timeline:

07:52:32  decompose done
07:52:33  all 3 workers started (parallel)
07:53:17  worker[1] ✓  (~44s)
07:53:34  worker[2] ✓  (~61s)
07:58:19  worker[3] ▸ Update Todos  ← woke up after FIVE MINUTES of silence
07:59:08  worker[3] ✓  (~6m35s)

Total swarm: 447s (~7.5 min) — almost all of it waiting on a single stalled worker. Workers 1 & 2 finished in ~60s.

Why it broke

Workers are awaited via Promise.all(subtasks.map(...)) with no per-worker timeout. If one worker's underlying LLM call hangs (rate-limit backoff, model thinking-loop, network glitch), the entire swarm pays the slowest-worker wall clock.

Fix direction

  1. Wrap runWorker(...) in a Promise.race([worker, timeout]). Timeout default 120s, configurable via WORKER_TIMEOUT_MS env.
  2. On timeout, return a WorkerOutput with result: "[TIMED_OUT after Nms]" and let synthesizer handle it.
  3. Update the synthesizer prompt to recognize TIMED_OUT results and treat them as missing data rather than a contradiction.

Cancellation of the in-flight graff.chat() stream is best-effort for this PR — the worker function returns the timeout marker even if the underlying request continues in the background. Real cancellation is a deeper SDK question.

Acceptance criteria

  • WORKER_TIMEOUT_MS env var honored, default 120000.
  • A worker that exceeds the timeout doesn't block synthesis.
  • Synthesizer output explicitly notes any TIMED_OUT workers rather than ignoring them.
  • Total swarm wall-clock is bounded at decompose + max(worker_timeout) + synthesize.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    severity: highSignificant impact; core functionality is impaired.type: bugSomething isn't working.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions