Skip to content

feat(design): add retry design command for batch persona recovery#78

Open
kwliang1 wants to merge 3 commits into
sf8193:mainfrom
kwliang1:kevinliang/design-retry-command
Open

feat(design): add retry design command for batch persona recovery#78
kwliang1 wants to merge 3 commits into
sf8193:mainfrom
kwliang1:kevinliang/design-retry-command

Conversation

@kwliang1

@kwliang1 kwliang1 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds retry design command that respawns all dead personas in a single action
  • Kills lingering dead sessions, respawns missing personas, re-registers with design state machine
  • Phase-appropriate nudges: independent gets proposal nudge, refinement gets divergence context via activeDivergence
  • Re-adjusts expected counts after retry and auto-advances if threshold already met (prevents questioning stall)
  • Wired into router: matches retry design and /retry design (case-insensitive, thread-only)

Stack: merge #76 and #77 first.

Test plan

  • Kill all 5 personas, type retry design — verify all 5 respawn with correct message
  • Kill 2 of 5, type retry design — verify "2 personas respawned, 3 already alive"
  • Type retry design with no active design — verify "No design session to retry"
  • Type retry design during synthesis phase — verify "Cannot retry personas during synthesis phase"
  • Existing tests pass: bun test (268 pass, 0 fail)

🤖 Generated with Claude Code

kwliang1 and others added 3 commits July 3, 2026 01:33
When all design personas crash, the system previously waited 5–20 minutes
for timeouts before cancelling. Now it detects the all-dead state on each
disconnect and cancels immediately with a clear retry message.

Also fixes the "Continuing with N remaining personas" count — it was
always showing total-1 because it only excluded the currently disconnecting
persona, not previously dead ones. Now counts only personas with a live
transport connection.

Removes the `> 0` guard on questionsExpected/proposalsExpected/
refinementExpected checks — since the all-dead early return handles the
zero case, these guards were preventing legitimate "last persona
disconnected but some had already submitted" from advancing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a design persona crashes, the system now automatically respawns it
once before decrementing expected counts. Tracks retried personas in a
Set to ensure at most one retry per persona name.

The retry spawns a fresh session with the same persona prompt, waits for
its bridge to connect (health check), and updates the design state with
the new session ID. If the retry fails (spawn error or bridge timeout),
the disconnect handler fires again and falls through to the existing
decrement/cancel logic.

If the persona respawns into the independent or answering phase, it
receives a nudge notification to read the thread and post its proposal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds `retry design` command that respawns all dead personas in a single
action. Typing `retry design` in a design thread will:

- Kill lingering dead persona sessions
- Respawn all personas lacking a live transport connection
- Re-register them with the design state machine
- Nudge into the correct phase (independent gets proposal nudge,
  refinement uses activeDivergence for correct context + count)
- Re-adjust expected counts and auto-advance if already satisfied

Uses the new `activeDivergence` field (set in processNextDivergence)
to reference the correct divergence — refinementQueue[0] would be
wrong since .shift() removes the current one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@sf8193 sf8193 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #78 — retry design command

Reviewed by sp-reviewer (architecture) and typescript-reviewer (correctness/concurrency) in parallel.

Summary: The feature is well-structured and the phase-aware retry logic is solid in concept. Main concerns are around race conditions between the manual (retryDesign) and auto (retryPersona) paths, a session ID mutation that can corrupt disconnect handling, and a vacuous phase-advance edge case from removing the > 0 guards.

  • Blockers: 3
  • Should-fix: 5
  • Nits: 2

Comment thread daemon/design.ts
})

// Update session ID immediately so disconnect handler can find it if the new session crashes fast
deadPersona.sessionId = result.sessionId

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (flagged by both reviewers): deadPersona.sessionId is mutated before waitForBridge succeeds. If the health check fails, the catch block calls onDesignParticipantDisconnect(deadPersona.sessionId) using the new session ID — but the original session ID is lost, so the disconnect handler may fail to find the persona in its designs iteration.

Additionally, if the new session crashes quickly and triggers a transport-layer disconnect before waitForBridge returns, the catch block also calls onDesignParticipantDisconnect, double-firing the disconnect handler for the same persona — potentially cancelling the entire design prematurely.

Fix: Move deadPersona.sessionId = result.sessionId to after the waitForBridge check passes.

Comment thread daemon/design.ts
let alreadyAlive = 0

for (const persona of state.personas) {
if (transport.has(persona.sessionId)) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: retryDesign (manual command) and retryPersona (auto-retry from disconnect) can race on the same persona. If a persona crashed and retryPersona is in-flight, the user can invoke retry design. The transport.has(persona.sessionId) check here uses the session ID that retryPersona may have already mutated, so transport.has returns false (bridge not connected yet), causing a second spawn. Two sessions run for one persona slot; only one ID is stored. The other becomes an orphaned tmux+Claude process.

Fix: Add a per-persona retryInProgress guard (e.g. a retryingPersonas: Set<string>) that both paths check before spawning.

Comment thread daemon/design.ts
// Re-adjust expected counts and check if phase should advance
const aliveNow = state.personas.filter(p => transport.has(p.sessionId)).length
if (state.phase === 'questioning') {
state.questionsExpected = aliveNow

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker: Missing clearTimeout(state.timeout) before transitioning. The existing phase timeout (from startDesign or startProposalPhase) is still live. If retry-triggered count adjustment advances the phase here, and then the old timeout fires, aggregateAndPostQuestions (or the proposal aggregator) runs twice.

The disconnect handler correctly clears the timeout before transitioning — this path should do the same. Same issue applies to the independent phase adjustment at line 433.

Comment thread daemon/design.ts

persona.sessionId = result.sessionId
persona.proposed = false
state.retriedPersonas.delete(name)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: Clearing retriedPersonas on manual retry undermines the auto-retry's once-only guard. If the manually-retried persona crashes again, onDesignParticipantDisconnect will auto-retry it (name no longer in retriedPersonas), creating an unbounded retry loop: crash → auto-retry → crash → user retries → crash → auto-retry → ...

Fix: Don't delete from retriedPersonas here, or use a separate counter for the auto-retry budget.

Comment thread daemon/design.ts
if (state.phase === 'questioning') {
state.questionsExpected--
if (state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected) {
if (state.questionsReceived >= state.questionsExpected) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix (flagged by both reviewers): The > 0 guard on questionsExpected was removed here and in the independent/refinement branches below. With the new auto-retry path, there's a window where retryPersona is in-flight (async, not yet resolved) and another persona's disconnect comes in, decrementing questionsExpected to 0. The check 0 >= 0 is true, so the phase advances vacuously with zero questions.

The aliveCount === 0 cancellation check above doesn't fully cover this because it uses transport.has which doesn't account for in-flight retries.

Fix: Restore state.questionsExpected > 0 && (and the equivalent guards for proposalsExpected and refinementExpected).

Comment thread daemon/design.ts
})
} else if (state.phase === 'refinement' && state.activeDivergence) {
if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) {
state.refinementExpected++

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: The disconnect handler returns early when auto-retrying (line 812), skipping state.refinementExpected--. Then retryPersona here does state.refinementExpected++. Net effect: refinementExpected is +1 too high (the dead persona's count was never decremented). The phase will stall waiting for a response that was already counted.

Same issue exists for questioning and independent phases — the disconnect handler's decrement is skipped when it returns early for auto-retry.

Comment thread daemon/design.ts
// Retry a design — respawn dead personas and resume from current phase
// ---------------------------------------------------------------------------

export async function retryDesign(threadId: string): Promise<{ respawned: number; alreadyAlive: number }> {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: retryDesign spawns personas and immediately sends transport.sendOrQueue messages without calling waitForBridge. By contrast, retryPersona correctly calls waitForBridge(result.sessionId, HEALTH_TIMEOUT_MS) and cleans up on failure. This means manual retry may silently queue messages to sessions that never connect, with no health-check fallback and no cleanup.

Comment thread daemon/design.ts
}

const divergence = state.refinementQueue.shift()!
state.activeDivergence = divergence

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix (flagged by both reviewers): activeDivergence is set here but never cleared when refinement finishes (when refinementQueue empties and autoAdvanceAfterRefinement runs). If a persona crashes during the subsequent synthesis phase, the stale activeDivergence from the last round could cause retryPersona to send an incorrect refinement prompt.

Fix: Clear state.activeDivergence = undefined when refinement completes.

Comment thread daemon/design.ts
})

persona.sessionId = result.sessionId
persona.proposed = false

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: persona.proposed = false is reset unconditionally on respawn. During refinement phase this has no effect but is misleading — proposed is only meaningful during independent. Additionally, this feeds into the proposalsExpected formula at line 433 which counts !p.proposed personas, potentially inflating expected counts for phases where proposals aren't relevant.

Comment thread daemon/design.ts
// Auto-retry a crashed persona once
// ---------------------------------------------------------------------------

async function retryPersona(

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The spawn + phase-notification logic for independent and refinement is copy-pasted nearly verbatim between retryDesign (lines 393-415) and retryPersona (lines 748-773). If the notification format changes, both must be updated in lockstep. Consider extracting a shared helper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants