[DSP-1.2] swarm: workers can stall indefinitely (5min stall observed) — need per-worker timeout

Sub-issue of #55 (DSP-1). Found in smoke test of PR #61.

## Observation

Smoke test of `npm run swarm` (n=3) had this stage timeline:

```
07:52:32  decompose done
07:52:33  all 3 workers started (parallel)
07:53:17  worker[1] ✓  (~44s)
07:53:34  worker[2] ✓  (~61s)
07:58:19  worker[3] ▸ Update Todos  ← woke up after FIVE MINUTES of silence
07:59:08  worker[3] ✓  (~6m35s)
```

Total swarm: 447s (~7.5 min) — almost all of it waiting on a single stalled worker. Workers 1 & 2 finished in ~60s.

## Why it broke

Workers are awaited via `Promise.all(subtasks.map(...))` with no per-worker timeout. If one worker's underlying LLM call hangs (rate-limit backoff, model thinking-loop, network glitch), the entire swarm pays the slowest-worker wall clock.

## Fix direction

1. Wrap `runWorker(...)` in a `Promise.race([worker, timeout])`. Timeout default 120s, configurable via `WORKER_TIMEOUT_MS` env.
2. On timeout, return a `WorkerOutput` with `result: "[TIMED_OUT after Nms]"` and let synthesizer handle it.
3. Update the synthesizer prompt to recognize TIMED_OUT results and treat them as missing data rather than a contradiction.

Cancellation of the in-flight `graff.chat()` stream is best-effort for this PR — the worker function returns the timeout marker even if the underlying request continues in the background. Real cancellation is a deeper SDK question.

## Acceptance criteria

- [ ] `WORKER_TIMEOUT_MS` env var honored, default 120000.
- [ ] A worker that exceeds the timeout doesn't block synthesis.
- [ ] Synthesizer output explicitly notes any TIMED_OUT workers rather than ignoring them.
- [ ] Total swarm wall-clock is bounded at `decompose + max(worker_timeout) + synthesize`.

## Related

- Observability: #28 trajectory_events SQLite table, #33 subagent runs bypass TrajectoryRecorder — once those land, we can see *why* a worker stalled (rate-limit retry vs thinking loop vs network).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DSP-1.2] swarm: workers can stall indefinitely (5min stall observed) — need per-worker timeout #63

Observation

Why it broke

Fix direction

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[DSP-1.2] swarm: workers can stall indefinitely (5min stall observed) — need per-worker timeout #63

Description

Observation

Why it broke

Fix direction

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions