Skip to content

WorkerSupervisor: drain dead worker queues before close_queues() in recreate_if_dead() #571

@coderabbitai

Description

@coderabbitai

Summary

When WorkerSupervisor.recreate_if_dead() replaces a dead worker, it calls _reap_dead() which immediately closes the dead worker's queues. Any pop_completed / pop_failed items or queued exception payloads emitted by the worker just before it crashed become unreachable at that point, so finished extract/validate work can be silently dropped.

Context

Deferred from PR #570 (release v1.0.0) — see review comment.

The impact is narrow and self-healing for the current release:

  1. Crash cause is not lostpropagate_exceptions() calls propagate_exception() and reports the fault at ERROR before recreate_if_dead() runs, so the exception is surfaced before any queue closing.
  2. Narrow at-risk window — only completions emitted between the last ModelUpdater.update() drain cycle and the crash are at risk.
  3. Self-healing — the Recreate dead extract/validate worker processes (follow-up to #511) #535 recovery contract accepts that in-flight work is redone rather than guaranteed preserved; a lost last-moment completion triggers a rescan and re-queue on the next cycle.

Proposed improvement

Before calling _reap_dead(dead) / close_queues() in recreate_if_dead(), drain the dead worker's result queues (e.g. pop_completed, pop_failed, pop_latest_statuses) and buffer/surface the results so no finished work is silently dropped.

File: src/python/controller/worker_supervisor.py, recreate_if_dead() method (lines ~120–148).

Requested by

@nitrobass24 (follow-up from PR #570 review)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions