You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When WorkerSupervisor.recreate_if_dead() replaces a dead worker, it calls _reap_dead() which immediately closes the dead worker's queues. Any pop_completed / pop_failed items or queued exception payloads emitted by the worker just before it crashed become unreachable at that point, so finished extract/validate work can be silently dropped.
The impact is narrow and self-healing for the current release:
Crash cause is not lost — propagate_exceptions() calls propagate_exception() and reports the fault at ERROR beforerecreate_if_dead() runs, so the exception is surfaced before any queue closing.
Narrow at-risk window — only completions emitted between the last ModelUpdater.update() drain cycle and the crash are at risk.
Before calling _reap_dead(dead) / close_queues() in recreate_if_dead(), drain the dead worker's result queues (e.g. pop_completed, pop_failed, pop_latest_statuses) and buffer/surface the results so no finished work is silently dropped.
Summary
When
WorkerSupervisor.recreate_if_dead()replaces a dead worker, it calls_reap_dead()which immediately closes the dead worker's queues. Anypop_completed/pop_faileditems or queued exception payloads emitted by the worker just before it crashed become unreachable at that point, so finished extract/validate work can be silently dropped.Context
Deferred from PR #570 (release v1.0.0) — see review comment.
The impact is narrow and self-healing for the current release:
propagate_exceptions()callspropagate_exception()and reports the fault at ERROR beforerecreate_if_dead()runs, so the exception is surfaced before any queue closing.ModelUpdater.update()drain cycle and the crash are at risk.Proposed improvement
Before calling
_reap_dead(dead)/close_queues()inrecreate_if_dead(), drain the dead worker's result queues (e.g.pop_completed,pop_failed,pop_latest_statuses) and buffer/surface the results so no finished work is silently dropped.File:
src/python/controller/worker_supervisor.py,recreate_if_dead()method (lines ~120–148).Requested by
@nitrobass24 (follow-up from PR #570 review)