Skip to content

feat(design): auto-retry crashed persona once before giving up#77

Open
kwliang1 wants to merge 2 commits into
sf8193:mainfrom
kwliang1:kevinliang/design-auto-retry-persona
Open

feat(design): auto-retry crashed persona once before giving up#77
kwliang1 wants to merge 2 commits into
sf8193:mainfrom
kwliang1:kevinliang/design-auto-retry-persona

Conversation

@kwliang1

@kwliang1 kwliang1 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • When a design persona crashes, automatically respawns it once before decrementing expected counts
  • Tracks retried personas in a Set<string> to ensure at most one auto-retry per persona
  • Updates deadPersona.sessionId immediately after spawn (before waitForBridge) to prevent fast-crash race condition
  • Adds post-spawn cancellation check — if design was cancelled during async retry, cleans up the new session
  • Adds activeDivergence field to DesignState to track the current divergence being discussed
  • Retried personas get phase-appropriate nudges: proposal nudge during independent, divergence context during refinement

Stack: merge #76 first.

Test plan

  • Start a design, kill one persona during questioning — verify it auto-retries and respawns
  • Kill a persona twice — verify first retry spawns, second crash falls through to decrement
  • Kill a persona during refinement — verify retried persona gets the correct divergence context
  • Cancel design while retry is in-flight — verify no orphaned sessions
  • Existing tests pass: bun test (268 pass, 0 fail)

🤖 Generated with Claude Code

kwliang1 and others added 2 commits July 3, 2026 01:33
When all design personas crash, the system previously waited 5–20 minutes
for timeouts before cancelling. Now it detects the all-dead state on each
disconnect and cancels immediately with a clear retry message.

Also fixes the "Continuing with N remaining personas" count — it was
always showing total-1 because it only excluded the currently disconnecting
persona, not previously dead ones. Now counts only personas with a live
transport connection.

Removes the `> 0` guard on questionsExpected/proposalsExpected/
refinementExpected checks — since the all-dead early return handles the
zero case, these guards were preventing legitimate "last persona
disconnected but some had already submitted" from advancing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a design persona crashes, the system now automatically respawns it
once before decrementing expected counts. Tracks retried personas in a
Set to ensure at most one retry per persona name.

The retry spawns a fresh session with the same persona prompt, waits for
its bridge to connect (health check), and updates the design state with
the new session ID. If the retry fails (spawn error or bridge timeout),
the disconnect handler fires again and falls through to the existing
decrement/cancel logic.

If the persona respawns into the independent or answering phase, it
receives a nudge notification to read the thread and post its proposal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@sf8193 sf8193 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with dual-agent process (sp-reviewer + typescript-reviewer). 5 should-fix, 3 nits.

Comment thread daemon/design.ts
}
} catch (err) {
process.stderr.write(`daemon: design: ${name} auto-retry failed: ${err}\n`)
onDesignParticipantDisconnect(deadPersona.sessionId)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix (flagged by both reviewers): catch block can double-decrement expectation counters.

If doSpawnSession throws before line 642 mutates deadPersona.sessionId, this calls onDesignParticipantDisconnect(oldSessionId). The handler finds the same persona again (sessionId still matches), sees it's in retriedPersonas, and decrements questionsExpected/proposalsExpected/refinementExpected a second time for the same dead persona. This can push counters negative.

Conversely, if spawn succeeds (sessionId was mutated), this calls disconnect with the new sessionId — which works correctly (falls through to retry-exhausted path). But the asymmetry is subtle and the failure path is buggy.

Suggestion: capture the old sessionId before the try block and use it to guard against re-entry, or restructure so the catch doesn't re-enter the disconnect handler.

Comment thread daemon/design.ts
})

// Update session ID immediately so disconnect handler can find it if the new session crashes fast
deadPersona.sessionId = result.sessionId

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: in-place sessionId mutation loses the old session ID.

Two consequences:

  1. If the old session's process is still lingering (delayed teardown), cleanupDesignSessions iterates state.personas and only sees the new sessionId — the old process leaks.
  2. Late-arriving disconnect events for the old sessionId won't match any persona in state.personas, so they're silently ignored. This is benign but makes debugging harder.

Consider keeping the old ID (e.g., a previousSessionIds set on the persona) or treating the replacement as a distinct entry.

Comment thread daemon/design.ts
if (state.phase === 'questioning') {
state.questionsExpected--
if (state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected) {
if (state.questionsReceived >= state.questionsExpected) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix (flagged by both reviewers): removing the > 0 guards enables premature phase transitions.

The old code had state.questionsExpected > 0 && (and similarly for proposals/refinement). With the guard removed, if the last persona dies and retry is exhausted, questionsExpected decrements to 0, and 0 >= 0 triggers the transition. The aliveCount === 0 cancellation check should catch this first — but it won't if another persona is still alive (aliveCount > 0) while this counter hits 0. That advances the phase with zero expectations from this counter.

Suggestion: restore the > 0 guards, or move the aliveCount === 0 cancellation to before the per-phase counter adjustment.

Comment thread daemon/design.ts
})
} else if (state.phase === 'refinement' && state.activeDivergence) {
if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) {
state.refinementExpected++

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: refinementExpected++ without checking if the dead persona already responded.

When the first crash triggers the retry path (line 718-719), the early return skips the refinementExpected-- that normally runs. So the count is still inflated by the dead persona's slot. Then here, refinementExpected++ adds another slot for the replacement — net +1 over what it should be. If the dead persona had already responded before crashing, the design now waits for an extra response that may never come (timeout).

Suggestion: check state.refinementRespondedIds.has(deadPersona.sessionId) or equivalent before incrementing.

Comment thread daemon/design.ts
if (persona) {
process.stderr.write(`daemon: design: ${label} disconnected/died\n`)
void gateway.send(threadId, `_${label} disconnected. Continuing with ${state.personas.filter(p => p.sessionId !== sessionId).length} remaining personas._`).catch(() => {})
const aliveCount = state.personas.filter(p => p.sessionId !== sessionId && transport.has(p.sessionId)).length

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: aliveCount can be stale under concurrent persona crashes.

If two personas crash near-simultaneously and both retries are exhausted, each computes aliveCount === 1 (seeing the other as alive before its transport is torn down). Neither triggers the aliveCount === 0 cancellation, and the design hangs until timeout.

This is a narrow race but possible under load. A mitigation: recheck aliveCount just before the cancel decision, or move the cancel check into a shared function that recalculates.

Comment thread daemon/design.ts
refinementExpected: number
refinementResponses: number
refinementRespondedIds: Set<string>
activeDivergence?: { description: string; personas: string[]; impact: string }

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: activeDivergence is typed inline. Since refinementQueue presumably holds the same shape, consider referencing a shared Divergence type to avoid drift.

Comment thread daemon/design.ts
meta: { chat_id: threadId, message_id: '', user: 'system', user_id: 'system', ts: cutoffTs },
})
} else if (state.phase === 'refinement' && state.activeDivergence) {
if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (flagged by both reviewers): The name.replace('_', ' ') normalization duplicates logic from processNextDivergence. If the rule changes, both sites need updating. Consider extracting a helper or normalizing at the source (where activeDivergence.personas is populated).

Comment thread daemon/design.ts
}

const divergence = state.refinementQueue.shift()!
state.activeDivergence = divergence

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: activeDivergence is set here but never cleared (e.g., when refinement completes or the queue is exhausted). Currently safe because retryPersona checks state.phase === 'refinement' before using it, but clearing it explicitly would prevent future bugs.

@sf8193 sf8193 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dual-agent review (sp-reviewer + typescript-reviewer). 2 blockers, 3 should-fix, 2 nits.

Comment thread daemon/design.ts
})

// Update session ID immediately so disconnect handler can find it if the new session crashes fast
deadPersona.sessionId = result.sessionId

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (flagged by both reviewers): deadPersona.sessionId is mutated to the new session ID before the health check. If waitForBridge returns false, killSession runs, then the catch block calls onDesignParticipantDisconnect(deadPersona.sessionId) with the new (already-killed) session ID. The disconnect handler lookup may silently no-op if the kill already cleaned it from the registry, leaving the dead persona unaccounted for — expectations never decrement.

Fix: move this assignment after the waitForBridge success path (after line 649), or stash the original ID and pass it to the recursive onDesignParticipantDisconnect call.

Comment thread daemon/design.ts
if (state.phase === 'questioning') {
state.questionsExpected--
if (state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected) {
if (state.questionsReceived >= state.questionsExpected) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (flagged by both reviewers): Removing the questionsExpected > 0 guard (and equivalently at lines 744, 755) allows 0 >= 0 to trigger a phase transition with zero contributions.

Scenario: persona A is alive and working, persona B (retry exhausted) disconnects. aliveCount > 0 so the cancellation path doesn't fire. questionsExpected decrements to 0, questionsReceived is 0, 0 >= 0 is true → phase advances with zero questions. Persona A's eventual submission arrives in the wrong phase and is silently dropped.

Fix: restore a > 0 guard (e.g. state.questionsExpected > 0 && state.questionsReceived >= state.questionsExpected), or ensure the all-dead check accounts for this edge.

Comment thread daemon/design.ts
}

const divergence = state.refinementQueue.shift()!
state.activeDivergence = divergence

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix (flagged by both reviewers): activeDivergence is set here but never cleared when refinement for this divergence completes or when the phase changes. If a persona crashes after refinement finishes but before the next divergence (or during synthesis), retryPersona will see the stale divergence, send a refinement prompt, and increment refinementExpected — corrupting the counter.

Fix: set state.activeDivergence = undefined at the end of processNextDivergence (when all responses are in) or on phase transitions out of refinement.

Comment thread daemon/design.ts
if (!state.retriedPersonas.has(persona.name)) {
state.retriedPersonas.add(persona.name)
void gateway.send(threadId, `_${label} crashed — auto-retrying..._`).catch(() => {})
void retryPersona(state, persona, threadId)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: retryPersona is fire-and-forget (void) and takes 30+ seconds (spawn + health check). During that time, timeouts can fire and advance the phase. The phase-specific prompt selection (lines 661-683) uses the phase at completion time, not at crash time — so a persona that crashed during questioning could receive an independent or refinement prompt if the phase advanced while retry was in-flight.

Also, if cancelDesign runs concurrently (user cancel or all-dead from another disconnect), the retry can finish spawning a session that cancel already iterated past, leaving an orphan.

Consider: (1) capture the phase at crash time and use it for prompt selection, (2) check phase before spawn in addition to after.

Comment thread daemon/design.ts
})
} else if (state.phase === 'refinement' && state.activeDivergence) {
if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) {
state.refinementExpected++

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix: If between the persona's death (which decremented refinementExpected) and the retry completing here, refinementResponses >= refinementExpected became true and processNextDivergence already advanced, then this increment makes refinementExpected permanently higher than the number of responses that will arrive — the divergence counter is off.

Comment thread daemon/design.ts
refinementExpected: number
refinementResponses: number
refinementRespondedIds: Set<string>
activeDivergence?: { description: string; personas: string[]; impact: string }

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This inline type { description: string; personas: string[]; impact: string } is presumably the same shape as divergence objects in refinementQueue. Extract a named type to keep them in sync.

Comment thread daemon/design.ts
meta: { chat_id: threadId, message_id: '', user: 'system', user_id: 'system', ts: cutoffTs },
})
} else if (state.phase === 'refinement' && state.activeDivergence) {
if (state.activeDivergence.personas.some(n => n === name || n === name.replace('_', ' '))) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: String.replace with a string pattern only replaces the first occurrence. If persona names can contain multiple underscores (e.g. ux_design_lead), this won't match. Consider replaceAll('_', ' ') or a regex /_ /g. (Pre-existing at line 584 too.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants