feat(design): auto-retry crashed persona once before giving up#77
feat(design): auto-retry crashed persona once before giving up#77kwliang1 wants to merge 2 commits into
Conversation
When all design personas crash, the system previously waited 5–20 minutes for timeouts before cancelling. Now it detects the all-dead state on each disconnect and cancels immediately with a clear retry message. Also fixes the "Continuing with N remaining personas" count — it was always showing total-1 because it only excluded the currently disconnecting persona, not previously dead ones. Now counts only personas with a live transport connection. Removes the `> 0` guard on questionsExpected/proposalsExpected/ refinementExpected checks — since the all-dead early return handles the zero case, these guards were preventing legitimate "last persona disconnected but some had already submitted" from advancing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a design persona crashes, the system now automatically respawns it once before decrementing expected counts. Tracks retried personas in a Set to ensure at most one retry per persona name. The retry spawns a fresh session with the same persona prompt, waits for its bridge to connect (health check), and updates the design state with the new session ID. If the retry fails (spawn error or bridge timeout), the disconnect handler fires again and falls through to the existing decrement/cancel logic. If the persona respawns into the independent or answering phase, it receives a nudge notification to read the thread and post its proposal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sf8193
left a comment
There was a problem hiding this comment.
Reviewed with dual-agent process (sp-reviewer + typescript-reviewer). 5 should-fix, 3 nits.
| } | ||
| } catch (err) { | ||
| process.stderr.write(`daemon: design: ${name} auto-retry failed: ${err}\n`) | ||
| onDesignParticipantDisconnect(deadPersona.sessionId) |
There was a problem hiding this comment.
Should-fix (flagged by both reviewers): catch block can double-decrement expectation counters.
If doSpawnSession throws before line 642 mutates deadPersona.sessionId, this calls onDesignParticipantDisconnect(oldSessionId). The handler finds the same persona again (sessionId still matches), sees it's in retriedPersonas, and decrements questionsExpected/proposalsExpected/refinementExpected a second time for the same dead persona. This can push counters negative.
Conversely, if spawn succeeds (sessionId was mutated), this calls disconnect with the new sessionId — which works correctly (falls through to retry-exhausted path). But the asymmetry is subtle and the failure path is buggy.
Suggestion: capture the old sessionId before the try block and use it to guard against re-entry, or restructure so the catch doesn't re-enter the disconnect handler.
| }) | ||
|
|
||
| // Update session ID immediately so disconnect handler can find it if the new session crashes fast | ||
| deadPersona.sessionId = result.sessionId |
There was a problem hiding this comment.
Should-fix: in-place sessionId mutation loses the old session ID.
Two consequences:
- If the old session's process is still lingering (delayed teardown),
cleanupDesignSessionsiteratesstate.personasand only sees the new sessionId — the old process leaks. - Late-arriving disconnect events for the old sessionId won't match any persona in
state.personas, so they're silently ignored. This is benign but makes debugging harder.
Consider keeping the old ID (e.g., a previousSessionIds set on the persona) or treating the replacement as a distinct entry.
| if (state.phase === 'questioning') { | ||
| state.questionsExpected-- | ||
| if (state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected) { | ||
| if (state.questionsReceived >= state.questionsExpected) { |
There was a problem hiding this comment.
Should-fix (flagged by both reviewers): removing the > 0 guards enables premature phase transitions.
The old code had state.questionsExpected > 0 && (and similarly for proposals/refinement). With the guard removed, if the last persona dies and retry is exhausted, questionsExpected decrements to 0, and 0 >= 0 triggers the transition. The aliveCount === 0 cancellation check should catch this first — but it won't if another persona is still alive (aliveCount > 0) while this counter hits 0. That advances the phase with zero expectations from this counter.
Suggestion: restore the > 0 guards, or move the aliveCount === 0 cancellation to before the per-phase counter adjustment.
| }) | ||
| } else if (state.phase === 'refinement' && state.activeDivergence) { | ||
| if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) { | ||
| state.refinementExpected++ |
There was a problem hiding this comment.
Should-fix: refinementExpected++ without checking if the dead persona already responded.
When the first crash triggers the retry path (line 718-719), the early return skips the refinementExpected-- that normally runs. So the count is still inflated by the dead persona's slot. Then here, refinementExpected++ adds another slot for the replacement — net +1 over what it should be. If the dead persona had already responded before crashing, the design now waits for an extra response that may never come (timeout).
Suggestion: check state.refinementRespondedIds.has(deadPersona.sessionId) or equivalent before incrementing.
| if (persona) { | ||
| process.stderr.write(`daemon: design: ${label} disconnected/died\n`) | ||
| void gateway.send(threadId, `_${label} disconnected. Continuing with ${state.personas.filter(p => p.sessionId !== sessionId).length} remaining personas._`).catch(() => {}) | ||
| const aliveCount = state.personas.filter(p => p.sessionId !== sessionId && transport.has(p.sessionId)).length |
There was a problem hiding this comment.
Should-fix: aliveCount can be stale under concurrent persona crashes.
If two personas crash near-simultaneously and both retries are exhausted, each computes aliveCount === 1 (seeing the other as alive before its transport is torn down). Neither triggers the aliveCount === 0 cancellation, and the design hangs until timeout.
This is a narrow race but possible under load. A mitigation: recheck aliveCount just before the cancel decision, or move the cancel check into a shared function that recalculates.
| refinementExpected: number | ||
| refinementResponses: number | ||
| refinementRespondedIds: Set<string> | ||
| activeDivergence?: { description: string; personas: string[]; impact: string } |
There was a problem hiding this comment.
Nit: activeDivergence is typed inline. Since refinementQueue presumably holds the same shape, consider referencing a shared Divergence type to avoid drift.
| meta: { chat_id: threadId, message_id: '', user: 'system', user_id: 'system', ts: cutoffTs }, | ||
| }) | ||
| } else if (state.phase === 'refinement' && state.activeDivergence) { | ||
| if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) { |
There was a problem hiding this comment.
Nit (flagged by both reviewers): The name.replace('_', ' ') normalization duplicates logic from processNextDivergence. If the rule changes, both sites need updating. Consider extracting a helper or normalizing at the source (where activeDivergence.personas is populated).
| } | ||
|
|
||
| const divergence = state.refinementQueue.shift()! | ||
| state.activeDivergence = divergence |
There was a problem hiding this comment.
Nit: activeDivergence is set here but never cleared (e.g., when refinement completes or the queue is exhausted). Currently safe because retryPersona checks state.phase === 'refinement' before using it, but clearing it explicitly would prevent future bugs.
sf8193
left a comment
There was a problem hiding this comment.
Dual-agent review (sp-reviewer + typescript-reviewer). 2 blockers, 3 should-fix, 2 nits.
| }) | ||
|
|
||
| // Update session ID immediately so disconnect handler can find it if the new session crashes fast | ||
| deadPersona.sessionId = result.sessionId |
There was a problem hiding this comment.
Blocker (flagged by both reviewers): deadPersona.sessionId is mutated to the new session ID before the health check. If waitForBridge returns false, killSession runs, then the catch block calls onDesignParticipantDisconnect(deadPersona.sessionId) with the new (already-killed) session ID. The disconnect handler lookup may silently no-op if the kill already cleaned it from the registry, leaving the dead persona unaccounted for — expectations never decrement.
Fix: move this assignment after the waitForBridge success path (after line 649), or stash the original ID and pass it to the recursive onDesignParticipantDisconnect call.
| if (state.phase === 'questioning') { | ||
| state.questionsExpected-- | ||
| if (state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected) { | ||
| if (state.questionsReceived >= state.questionsExpected) { |
There was a problem hiding this comment.
Blocker (flagged by both reviewers): Removing the questionsExpected > 0 guard (and equivalently at lines 744, 755) allows 0 >= 0 to trigger a phase transition with zero contributions.
Scenario: persona A is alive and working, persona B (retry exhausted) disconnects. aliveCount > 0 so the cancellation path doesn't fire. questionsExpected decrements to 0, questionsReceived is 0, 0 >= 0 is true → phase advances with zero questions. Persona A's eventual submission arrives in the wrong phase and is silently dropped.
Fix: restore a > 0 guard (e.g. state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected), or ensure the all-dead check accounts for this edge.
| } | ||
|
|
||
| const divergence = state.refinementQueue.shift()! | ||
| state.activeDivergence = divergence |
There was a problem hiding this comment.
Should-fix (flagged by both reviewers): activeDivergence is set here but never cleared when refinement for this divergence completes or when the phase changes. If a persona crashes after refinement finishes but before the next divergence (or during synthesis), retryPersona will see the stale divergence, send a refinement prompt, and increment refinementExpected — corrupting the counter.
Fix: set state.activeDivergence = undefined at the end of processNextDivergence (when all responses are in) or on phase transitions out of refinement.
| if (!state.retriedPersonas.has(persona.name)) { | ||
| state.retriedPersonas.add(persona.name) | ||
| void gateway.send(threadId, `_${label} crashed — auto-retrying..._`).catch(() => {}) | ||
| void retryPersona(state, persona, threadId) |
There was a problem hiding this comment.
Should-fix: retryPersona is fire-and-forget (void) and takes 30+ seconds (spawn + health check). During that time, timeouts can fire and advance the phase. The phase-specific prompt selection (lines 661-683) uses the phase at completion time, not at crash time — so a persona that crashed during questioning could receive an independent or refinement prompt if the phase advanced while retry was in-flight.
Also, if cancelDesign runs concurrently (user cancel or all-dead from another disconnect), the retry can finish spawning a session that cancel already iterated past, leaving an orphan.
Consider: (1) capture the phase at crash time and use it for prompt selection, (2) check phase before spawn in addition to after.
| }) | ||
| } else if (state.phase === 'refinement' && state.activeDivergence) { | ||
| if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) { | ||
| state.refinementExpected++ |
There was a problem hiding this comment.
Should-fix: If between the persona's death (which decremented refinementExpected) and the retry completing here, refinementResponses >= refinementExpected became true and processNextDivergence already advanced, then this increment makes refinementExpected permanently higher than the number of responses that will arrive — the divergence counter is off.
| refinementExpected: number | ||
| refinementResponses: number | ||
| refinementRespondedIds: Set<string> | ||
| activeDivergence?: { description: string; personas: string[]; impact: string } |
There was a problem hiding this comment.
Nit: This inline type { description: string; personas: string[]; impact: string } is presumably the same shape as divergence objects in refinementQueue. Extract a named type to keep them in sync.
| meta: { chat_id: threadId, message_id: '', user: 'system', user_id: 'system', ts: cutoffTs }, | ||
| }) | ||
| } else if (state.phase === 'refinement' && state.activeDivergence) { | ||
| if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) { |
There was a problem hiding this comment.
Nit: String.replace with a string pattern only replaces the first occurrence. If persona names can contain multiple underscores (e.g. ux_design_lead), this won't match. Consider replaceAll('_', ' ') or a regex /_ /g. (Pre-existing at line 584 too.)
Summary
Set<string>to ensure at most one auto-retry per personadeadPersona.sessionIdimmediately after spawn (beforewaitForBridge) to prevent fast-crash race conditionactiveDivergencefield toDesignStateto track the current divergence being discussedindependent, divergence context duringrefinementStack: merge #76 first.
Test plan
bun test(268 pass, 0 fail)🤖 Generated with Claude Code