Skip to content

Make the Tamanu task queue resilient to failing tasks#157

Open
xispa wants to merge 3 commits into
masterfrom
tamanu-queue-resilient
Open

Make the Tamanu task queue resilient to failing tasks#157
xispa wants to merge 3 commits into
masterfrom
tamanu-queue-resilient

Conversation

@xispa

@xispa xispa commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Description

Linked issue: #139

A single failing task could block the entire outbound queue. This adds per-task isolation, retry-with-back-off, and a dead-letter store so a bad record is parked instead of blocking everything behind it, and successful sends can no longer be rolled back into duplicates.

Changes:

  • tamanu/tasks/queue.py:
    • get() returns a (task_id, task) tuple so the worker can act on a failed task. (None, None) = nothing ready; (task_id, None) = head popped but unresolvable (invalid id / context gone) and should be dropped.
    • add fail(task_id, error): increments the attempt count and reschedules with a growing back-off for the first MAX_ATTEMPTS, then moves the task to a dead-letter store (TAMANU_TASKS_DEADLETTER) with the error and attempt count.
    • queue values are now {"when", "attempts"} dicts, but reads tolerate the legacy bare-int values, so no migration is required.
  • scripts/exec_tamanu_tasks.py:
    • commit per task, so a successful send durably removes the task and a later failure cannot roll back earlier sends.
    • wrap task.process(): on ConflictError, abort and retry next run without penalising the attempt count; on any other error, log it, abort the partial writes, and retry-with-back-off or dead-letter the task in a fresh transaction.
    • drop unresolvable tasks instead of stopping the batch.

Current behavior

exec_tamanu_tasks pops a task (deletes it from the queue), then runs task.process() with no exception handling and only a savepoint per task, committing once at the end. When Tamanu rejects a record (e.g. an empty result, raise_for_status), the exception aborts the transaction, which rolls back the dequeue — so the same task re-heads the queue on every run and blocks everything behind it. A late failure also rolls back already-completed dequeues whose Bundle POSTs already reached Tamanu, producing duplicate reports.

Desired behavior

A failing task is retried with back-off and, if it keeps failing, moved to a dead-letter store where it stays visible for inspection/re-injection — it never blocks the queue head. Successful tasks are committed individually, so no report is ever re-sent because of an unrelated later failure. Transient ConflictErrors are retried on the next run without counting against the task.

--
I confirm I have tested this PR thoroughly and coded it according to PEP8
and Plone's Python styleguide standards.

xispa added 3 commits June 10, 2026 23:01
A failing task could block the whole outbound queue. exec_tamanu_tasks popped
a task (del from the OOBTree), then ran task.process() with no exception
handling and only a savepoint per task, committing once at the end. When Tamanu
rejected a record (e.g. an empty result, raise_for_status), the exception
aborted the transaction, which rolled back the dequeue — so the same task
re-headed the queue every run and blocked everything behind it. A late failure
also rolled back already-completed dequeues whose Bundle POSTs had already
reached Tamanu, causing duplicate reports.

queue.py:

  * get() returns a (task_id, task) tuple so the worker can act on a failed
    task. (None, None) means nothing is ready; (task_id, None) means the head
    was popped but cannot be resolved (invalid id or its context no longer
    exists) and should be dropped — previously this made the worker stop the
    whole batch.
  * Add fail(task_id, error): increments the attempt count and reschedules the
    task with a growing back-off (attempt * RETRY_BACKOFF) for the first
    MAX_ATTEMPTS, then moves it to a dead-letter store (TAMANU_TASKS_DEADLETTER)
    with the error and attempt count. Transient outages recover within the
    retries; a persistently failing record is parked instead of blocking the
    head, while staying visible for inspection/re-injection.
  * Queue values are now {"when", "attempts"} dicts, but reads tolerate the
    legacy bare-int values, so no migration is required.

exec_tamanu_tasks.py:

  * Commit per task so a successful send durably removes the task and a later
    failure can no longer roll back earlier sends (no more duplicate reports).
  * Wrap task.process(): on ConflictError, abort and retry next run without
    penalising the attempt count; on any other error, log it, abort the partial
    writes, and retry-with-back-off or dead-letter the task in a fresh
    transaction. A failing task no longer aborts the batch.
  * Drop unresolvable tasks instead of stopping the batch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant