feat(design): add retry design command for batch persona recovery#78
feat(design): add retry design command for batch persona recovery#78kwliang1 wants to merge 3 commits into
Conversation
When all design personas crash, the system previously waited 5–20 minutes for timeouts before cancelling. Now it detects the all-dead state on each disconnect and cancels immediately with a clear retry message. Also fixes the "Continuing with N remaining personas" count — it was always showing total-1 because it only excluded the currently disconnecting persona, not previously dead ones. Now counts only personas with a live transport connection. Removes the `> 0` guard on questionsExpected/proposalsExpected/ refinementExpected checks — since the all-dead early return handles the zero case, these guards were preventing legitimate "last persona disconnected but some had already submitted" from advancing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a design persona crashes, the system now automatically respawns it once before decrementing expected counts. Tracks retried personas in a Set to ensure at most one retry per persona name. The retry spawns a fresh session with the same persona prompt, waits for its bridge to connect (health check), and updates the design state with the new session ID. If the retry fails (spawn error or bridge timeout), the disconnect handler fires again and falls through to the existing decrement/cancel logic. If the persona respawns into the independent or answering phase, it receives a nudge notification to read the thread and post its proposal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds `retry design` command that respawns all dead personas in a single action. Typing `retry design` in a design thread will: - Kill lingering dead persona sessions - Respawn all personas lacking a live transport connection - Re-register them with the design state machine - Nudge into the correct phase (independent gets proposal nudge, refinement uses activeDivergence for correct context + count) - Re-adjust expected counts and auto-advance if already satisfied Uses the new `activeDivergence` field (set in processNextDivergence) to reference the correct divergence — refinementQueue[0] would be wrong since .shift() removes the current one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sf8193
left a comment
There was a problem hiding this comment.
Review: PR #78 — retry design command
Reviewed by sp-reviewer (architecture) and typescript-reviewer (correctness/concurrency) in parallel.
Summary: The feature is well-structured and the phase-aware retry logic is solid in concept. Main concerns are around race conditions between the manual (retryDesign) and auto (retryPersona) paths, a session ID mutation that can corrupt disconnect handling, and a vacuous phase-advance edge case from removing the > 0 guards.
- Blockers: 3
- Should-fix: 5
- Nits: 2
| }) | ||
|
|
||
| // Update session ID immediately so disconnect handler can find it if the new session crashes fast | ||
| deadPersona.sessionId = result.sessionId |
There was a problem hiding this comment.
Blocker (flagged by both reviewers): deadPersona.sessionId is mutated before waitForBridge succeeds. If the health check fails, the catch block calls onDesignParticipantDisconnect(deadPersona.sessionId) using the new session ID — but the original session ID is lost, so the disconnect handler may fail to find the persona in its designs iteration.
Additionally, if the new session crashes quickly and triggers a transport-layer disconnect before waitForBridge returns, the catch block also calls onDesignParticipantDisconnect, double-firing the disconnect handler for the same persona — potentially cancelling the entire design prematurely.
Fix: Move deadPersona.sessionId = result.sessionId to after the waitForBridge check passes.
| let alreadyAlive = 0 | ||
|
|
||
| for (const persona of state.personas) { | ||
| if (transport.has(persona.sessionId)) { |
There was a problem hiding this comment.
Blocker: retryDesign (manual command) and retryPersona (auto-retry from disconnect) can race on the same persona. If a persona crashed and retryPersona is in-flight, the user can invoke retry design. The transport.has(persona.sessionId) check here uses the session ID that retryPersona may have already mutated, so transport.has returns false (bridge not connected yet), causing a second spawn. Two sessions run for one persona slot; only one ID is stored. The other becomes an orphaned tmux+Claude process.
Fix: Add a per-persona retryInProgress guard (e.g. a retryingPersonas: Set<string>) that both paths check before spawning.
| // Re-adjust expected counts and check if phase should advance | ||
| const aliveNow = state.personas.filter(p => transport.has(p.sessionId)).length | ||
| if (state.phase === 'questioning') { | ||
| state.questionsExpected = aliveNow |
There was a problem hiding this comment.
Blocker: Missing clearTimeout(state.timeout) before transitioning. The existing phase timeout (from startDesign or startProposalPhase) is still live. If retry-triggered count adjustment advances the phase here, and then the old timeout fires, aggregateAndPostQuestions (or the proposal aggregator) runs twice.
The disconnect handler correctly clears the timeout before transitioning — this path should do the same. Same issue applies to the independent phase adjustment at line 433.
|
|
||
| persona.sessionId = result.sessionId | ||
| persona.proposed = false | ||
| state.retriedPersonas.delete(name) |
There was a problem hiding this comment.
Should-fix: Clearing retriedPersonas on manual retry undermines the auto-retry's once-only guard. If the manually-retried persona crashes again, onDesignParticipantDisconnect will auto-retry it (name no longer in retriedPersonas), creating an unbounded retry loop: crash → auto-retry → crash → user retries → crash → auto-retry → ...
Fix: Don't delete from retriedPersonas here, or use a separate counter for the auto-retry budget.
| if (state.phase === 'questioning') { | ||
| state.questionsExpected-- | ||
| if (state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected) { | ||
| if (state.questionsReceived >= state.questionsExpected) { |
There was a problem hiding this comment.
Should-fix (flagged by both reviewers): The > 0 guard on questionsExpected was removed here and in the independent/refinement branches below. With the new auto-retry path, there's a window where retryPersona is in-flight (async, not yet resolved) and another persona's disconnect comes in, decrementing questionsExpected to 0. The check 0 >= 0 is true, so the phase advances vacuously with zero questions.
The aliveCount === 0 cancellation check above doesn't fully cover this because it uses transport.has which doesn't account for in-flight retries.
Fix: Restore state.questionsExpected > 0 && (and the equivalent guards for proposalsExpected and refinementExpected).
| }) | ||
| } else if (state.phase === 'refinement' && state.activeDivergence) { | ||
| if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) { | ||
| state.refinementExpected++ |
There was a problem hiding this comment.
Should-fix: The disconnect handler returns early when auto-retrying (line 812), skipping state.refinementExpected--. Then retryPersona here does state.refinementExpected++. Net effect: refinementExpected is +1 too high (the dead persona's count was never decremented). The phase will stall waiting for a response that was already counted.
Same issue exists for questioning and independent phases — the disconnect handler's decrement is skipped when it returns early for auto-retry.
| // Retry a design — respawn dead personas and resume from current phase | ||
| // --------------------------------------------------------------------------- | ||
|
|
||
| export async function retryDesign(threadId: string): Promise<{ respawned: number; alreadyAlive: number }> { |
There was a problem hiding this comment.
Should-fix: retryDesign spawns personas and immediately sends transport.sendOrQueue messages without calling waitForBridge. By contrast, retryPersona correctly calls waitForBridge(result.sessionId, HEALTH_TIMEOUT_MS) and cleans up on failure. This means manual retry may silently queue messages to sessions that never connect, with no health-check fallback and no cleanup.
| } | ||
|
|
||
| const divergence = state.refinementQueue.shift()! | ||
| state.activeDivergence = divergence |
There was a problem hiding this comment.
Should-fix (flagged by both reviewers): activeDivergence is set here but never cleared when refinement finishes (when refinementQueue empties and autoAdvanceAfterRefinement runs). If a persona crashes during the subsequent synthesis phase, the stale activeDivergence from the last round could cause retryPersona to send an incorrect refinement prompt.
Fix: Clear state.activeDivergence = undefined when refinement completes.
| }) | ||
|
|
||
| persona.sessionId = result.sessionId | ||
| persona.proposed = false |
There was a problem hiding this comment.
Nit: persona.proposed = false is reset unconditionally on respawn. During refinement phase this has no effect but is misleading — proposed is only meaningful during independent. Additionally, this feeds into the proposalsExpected formula at line 433 which counts !p.proposed personas, potentially inflating expected counts for phases where proposals aren't relevant.
| // Auto-retry a crashed persona once | ||
| // --------------------------------------------------------------------------- | ||
|
|
||
| async function retryPersona( |
There was a problem hiding this comment.
Nit: The spawn + phase-notification logic for independent and refinement is copy-pasted nearly verbatim between retryDesign (lines 393-415) and retryPersona (lines 748-773). If the notification format changes, both must be updated in lockstep. Consider extracting a shared helper.
Summary
retry designcommand that respawns all dead personas in a single actionindependentgets proposal nudge,refinementgets divergence context viaactiveDivergenceretry designand/retry design(case-insensitive, thread-only)Stack: merge #76 and #77 first.
Test plan
retry design— verify all 5 respawn with correct messageretry design— verify "2 personas respawned, 3 already alive"retry designwith no active design — verify "No design session to retry"retry designduring synthesis phase — verify "Cannot retry personas during synthesis phase"bun test(268 pass, 0 fail)🤖 Generated with Claude Code