Skip to content

fix(design): detect all-dead personas and cancel immediately#76

Open
kwliang1 wants to merge 1 commit into
sf8193:mainfrom
kwliang1:kevinliang/design-detect-all-dead
Open

fix(design): detect all-dead personas and cancel immediately#76
kwliang1 wants to merge 1 commit into
sf8193:mainfrom
kwliang1:kevinliang/design-detect-all-dead

Conversation

@kwliang1

@kwliang1 kwliang1 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Fixes the "Continuing with 4 remaining personas" count bug — now shows actual alive count based on transport connections, not total - 1
  • Detects when all design personas are dead and cancels immediately instead of waiting 5–20 minutes for timeouts
  • Removes the > 0 guard on expected-count checks — the all-dead early return handles that case, and the guard was preventing legitimate "last persona disconnected but others already submitted" from advancing

Test plan

  • Start a design session, kill all 5 personas manually — verify design cancels immediately with retry message
  • Kill 3 of 5 personas during questioning — verify remaining count shows 2, not 4
  • Kill 1 persona after it submitted a question — verify remaining personas still advance correctly
  • Existing tests pass: bun test (268 pass, 0 fail)

🤖 Generated with Claude Code

When all design personas crash, the system previously waited 5–20 minutes
for timeouts before cancelling. Now it detects the all-dead state on each
disconnect and cancels immediately with a clear retry message.

Also fixes the "Continuing with N remaining personas" count — it was
always showing total-1 because it only excluded the currently disconnecting
persona, not previously dead ones. Now counts only personas with a live
transport connection.

Removes the `> 0` guard on questionsExpected/proposalsExpected/
refinementExpected checks — since the all-dead early return handles the
zero case, these guards were preventing legitimate "last persona
disconnected but some had already submitted" from advancing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@sf8193 sf8193 left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #76 Review — fix(design): detect all-dead personas and cancel immediately

Reviewed by sp-reviewer (architecture) and typescript-reviewer (correctness) in parallel.

1 blocker, 3 should-fix, 2 nits. See inline comments.

Comment thread daemon/design.ts
process.stderr.write(`daemon: design: all personas dead — cancelling\n`)
void gateway.send(threadId, `All personas crashed. Design cancelled.\nUse \`design: ${state.topic}\` to retry.`).catch(() => {})
void cancelDesign(threadId)
return

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker — cancelDesign is not idempotent; double-call race on near-simultaneous deaths

When two personas disconnect near-simultaneously, both compute aliveCount === 0 (both transports already removed) and both call cancelDesign(threadId). Looking at cancelDesign (line 327): it checks designs.get(threadId) and returns if !state, but designs.delete happens after await cleanupDesignSessions. So the second call enters before the first finishes, finds state still present (with phase === 'cancelled' but no guard for that), and runs cleanup + sends duplicate "Design session cancelled" messages.

Fix: Add an idempotency guard at the top of cancelDesign:

if (state.phase === 'cancelled' || state.phase === 'complete') return

Or set state.phase = 'cancelled' synchronously here before calling cancelDesign.

(flagged by both reviewers)

Comment thread daemon/design.ts
process.stderr.write(`daemon: design: all personas dead — cancelling\n`)
void gateway.send(threadId, `All personas crashed. Design cancelled.\nUse \`design: ${state.topic}\` to retry.`).catch(() => {})
void cancelDesign(threadId)
return

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix — state machine bypass

Every other phase-ending path in this function routes through designMachine.transition(). The all-dead path calls cancelDesign directly, bypassing the state machine. If the machine enforces invariants on exit (guard conditions, terminal state marking), they're skipped.

Consider routing through the machine with an explicit cancel event, or at minimum verify cancelDesign covers all the machine's exit invariants.

Comment thread daemon/design.ts
process.stderr.write(`daemon: design: all personas dead — cancelling\n`)
void gateway.send(threadId, `All personas crashed. Design cancelled.\nUse \`design: ${state.topic}\` to retry.`).catch(() => {})
void cancelDesign(threadId)
return

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix — missing .catch() on fire-and-forget cancelDesign

cancelDesign is async and gateway.send inside it (line 340) is awaited without .catch(). The void operator here discards the promise, producing an unhandled rejection if it throws.

The singleton-role disconnect path (line ~688) already uses the correct pattern:

void cancelDesign(threadId).catch(e => process.stderr.write(`daemon: cancelDesign failed: ${e}\n`))

This should match.

Comment thread daemon/design.ts
} else if (state.phase === 'independent' && !persona.proposed) {
state.proposalsExpected--
if (state.proposalsExpected > 0 && state.proposalsReceived >= state.proposalsExpected) {
if (state.proposalsReceived >= state.proposalsExpected) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should-fix (pre-existing bug, good catch) — old code was a tautology

The old line was state.proposalsExpected > 0 && state.proposalsExpected >= state.proposalsExpected — comparing proposalsExpected to itself, always true when > 0. This meant the independent phase was transitioning on the first disconnect regardless of proposalsReceived.

The fix to state.proposalsReceived >= state.proposalsExpected is correct. Worth a quick check that no downstream behavior was accidentally depending on the premature transition.

Comment thread daemon/design.ts
if (persona) {
process.stderr.write(`daemon: design: ${label} disconnected/died\n`)
void gateway.send(threadId, `_${label} disconnected. Continuing with ${state.personas.filter(p => p.sessionId !== sessionId).length} remaining personas._`).catch(() => {})
const aliveCount = state.personas.filter(p => p.sessionId !== sessionId && transport.has(p.sessionId)).length

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — transport layer coupling deepens

transport.has(p.sessionId) reaches across the module boundary into the transport layer to make liveness decisions. Not a new pattern in this file, but this deepens the coupling. A session.isAlive(sessionId) abstraction would let the session lifecycle layer own the liveness definition. Non-urgent.

Comment thread daemon/design.ts
void gateway.send(threadId, `_${label} disconnected. Continuing with ${state.personas.filter(p => p.sessionId !== sessionId).length} remaining personas._`).catch(() => {})
const aliveCount = state.personas.filter(p => p.sessionId !== sessionId && transport.has(p.sessionId)).length
process.stderr.write(`daemon: design: ${label} disconnected/died (${aliveCount} alive)\n`)
void gateway.send(threadId, `_${label} disconnected. ${aliveCount > 0 ? `Continuing with ${aliveCount} remaining persona${aliveCount !== 1 ? 's' : ''}.` : 'All personas dead.'}_`).catch(() => {})

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — italic markdown may not close cleanly

When aliveCount === 0, the message becomes _... All personas dead._ — the closing _ is jammed against the period. Some markdown renderers may not close the italic. Consider a space or restructure:

`_${label} disconnected._` // always close italic here
// then send the all-dead message separately (which you already do on lines 649-650)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants