fix(design): detect all-dead personas and cancel immediately#76
fix(design): detect all-dead personas and cancel immediately#76kwliang1 wants to merge 1 commit into
Conversation
When all design personas crash, the system previously waited 5–20 minutes for timeouts before cancelling. Now it detects the all-dead state on each disconnect and cancels immediately with a clear retry message. Also fixes the "Continuing with N remaining personas" count — it was always showing total-1 because it only excluded the currently disconnecting persona, not previously dead ones. Now counts only personas with a live transport connection. Removes the `> 0` guard on questionsExpected/proposalsExpected/ refinementExpected checks — since the all-dead early return handles the zero case, these guards were preventing legitimate "last persona disconnected but some had already submitted" from advancing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| process.stderr.write(`daemon: design: all personas dead — cancelling\n`) | ||
| void gateway.send(threadId, `All personas crashed. Design cancelled.\nUse \`design: ${state.topic}\` to retry.`).catch(() => {}) | ||
| void cancelDesign(threadId) | ||
| return |
There was a problem hiding this comment.
Blocker — cancelDesign is not idempotent; double-call race on near-simultaneous deaths
When two personas disconnect near-simultaneously, both compute aliveCount === 0 (both transports already removed) and both call cancelDesign(threadId). Looking at cancelDesign (line 327): it checks designs.get(threadId) and returns if !state, but designs.delete happens after await cleanupDesignSessions. So the second call enters before the first finishes, finds state still present (with phase === 'cancelled' but no guard for that), and runs cleanup + sends duplicate "Design session cancelled" messages.
Fix: Add an idempotency guard at the top of cancelDesign:
if (state.phase === 'cancelled' || state.phase === 'complete') returnOr set state.phase = 'cancelled' synchronously here before calling cancelDesign.
(flagged by both reviewers)
| process.stderr.write(`daemon: design: all personas dead — cancelling\n`) | ||
| void gateway.send(threadId, `All personas crashed. Design cancelled.\nUse \`design: ${state.topic}\` to retry.`).catch(() => {}) | ||
| void cancelDesign(threadId) | ||
| return |
There was a problem hiding this comment.
Should-fix — state machine bypass
Every other phase-ending path in this function routes through designMachine.transition(). The all-dead path calls cancelDesign directly, bypassing the state machine. If the machine enforces invariants on exit (guard conditions, terminal state marking), they're skipped.
Consider routing through the machine with an explicit cancel event, or at minimum verify cancelDesign covers all the machine's exit invariants.
| process.stderr.write(`daemon: design: all personas dead — cancelling\n`) | ||
| void gateway.send(threadId, `All personas crashed. Design cancelled.\nUse \`design: ${state.topic}\` to retry.`).catch(() => {}) | ||
| void cancelDesign(threadId) | ||
| return |
There was a problem hiding this comment.
Should-fix — missing .catch() on fire-and-forget cancelDesign
cancelDesign is async and gateway.send inside it (line 340) is awaited without .catch(). The void operator here discards the promise, producing an unhandled rejection if it throws.
The singleton-role disconnect path (line ~688) already uses the correct pattern:
void cancelDesign(threadId).catch(e => process.stderr.write(`daemon: cancelDesign failed: ${e}\n`))This should match.
| } else if (state.phase === 'independent' && !persona.proposed) { | ||
| state.proposalsExpected-- | ||
| if (state.proposalsExpected > 0 && state.proposalsReceived >= state.proposalsExpected) { | ||
| if (state.proposalsReceived >= state.proposalsExpected) { |
There was a problem hiding this comment.
Should-fix (pre-existing bug, good catch) — old code was a tautology
The old line was state.proposalsExpected > 0 && state.proposalsExpected >= state.proposalsExpected — comparing proposalsExpected to itself, always true when > 0. This meant the independent phase was transitioning on the first disconnect regardless of proposalsReceived.
The fix to state.proposalsReceived >= state.proposalsExpected is correct. Worth a quick check that no downstream behavior was accidentally depending on the premature transition.
| if (persona) { | ||
| process.stderr.write(`daemon: design: ${label} disconnected/died\n`) | ||
| void gateway.send(threadId, `_${label} disconnected. Continuing with ${state.personas.filter(p => p.sessionId !== sessionId).length} remaining personas._`).catch(() => {}) | ||
| const aliveCount = state.personas.filter(p => p.sessionId !== sessionId && transport.has(p.sessionId)).length |
There was a problem hiding this comment.
Nit — transport layer coupling deepens
transport.has(p.sessionId) reaches across the module boundary into the transport layer to make liveness decisions. Not a new pattern in this file, but this deepens the coupling. A session.isAlive(sessionId) abstraction would let the session lifecycle layer own the liveness definition. Non-urgent.
| void gateway.send(threadId, `_${label} disconnected. Continuing with ${state.personas.filter(p => p.sessionId !== sessionId).length} remaining personas._`).catch(() => {}) | ||
| const aliveCount = state.personas.filter(p => p.sessionId !== sessionId && transport.has(p.sessionId)).length | ||
| process.stderr.write(`daemon: design: ${label} disconnected/died (${aliveCount} alive)\n`) | ||
| void gateway.send(threadId, `_${label} disconnected. ${aliveCount > 0 ? `Continuing with ${aliveCount} remaining persona${aliveCount !== 1 ? 's' : ''}.` : 'All personas dead.'}_`).catch(() => {}) |
There was a problem hiding this comment.
Nit — italic markdown may not close cleanly
When aliveCount === 0, the message becomes _... All personas dead._ — the closing _ is jammed against the period. Some markdown renderers may not close the italic. Consider a space or restructure:
`_${label} disconnected._` // always close italic here
// then send the all-dead message separately (which you already do on lines 649-650)
Summary
total - 1> 0guard on expected-count checks — the all-dead early return handles that case, and the guard was preventing legitimate "last persona disconnected but others already submitted" from advancingTest plan
bun test(268 pass, 0 fail)🤖 Generated with Claude Code