Prevent the Tamanu sync from hanging indefinitely by xispa · Pull Request #156 · beyondessential/bes.lims

xispa · 2026-06-10T21:07:41Z

Description

Linked issue: #139

The Tamanu sync could wedge and stop syncing in both directions, with no visible failure, until the process was restarted by hand. This addresses the two mechanisms behind that: HTTP calls with no timeout, and a sync wrapper that turned a single hung run into a standing outage while reporting success.

Changes:

tamanu/session.py: add default (connect, read) = (10, 60) timeouts, applied to every get()/post() via setdefault (still overridable per call), plus a tighter (10, 30) on login, and an optional timeout argument on TamanuSession.
scripts/sync_tamanu.py: a requests Timeout is not a ConnectionError, so the existing handler would miss it — catch (ConnectionError, Timeout) around the sync loop and around login(), exiting cleanly through connection_error() instead of an unhandled traceback.
templates/sync_tamanu.in: take one lock for the whole run; surface a held lock as a non-zero exit + log line instead of swallowing it; cap each step with timeout --signal=TERM --kill-after=30s so a stuck step is reaped and the next cron run self-recovers; add set -o pipefail so the step's real exit code propagates through tee to the cron monitor.

Current behavior

TamanuSession.post()/get() call requests with no timeout, so a stalled socket (overloaded server, half-open connection, no RST) blocks forever. A hung login — the first call of every run — hangs the whole sync.
The wrapper guards each step with flock -n: a hung run never releases /tmp/sync_tamanu.lock, so every subsequent cron run fails to acquire it and silently does nothing until someone restarts the process.
Each wrapper step is ... |& tee, whose exit code is tee's 0, and a failed flock is swallowed, so the cron monitor records every run as a success — including the broken ones.

Desired behavior

Any Tamanu HTTP call aborts within its read timeout and the script exits non-zero with a clear ConnectionError/Timeout message.
A hung step is reaped by timeout, the lock is released, and the next cron cycle resumes on its own.
A run that fails or is skipped (lock held) exits non-zero, so the cron monitor reflects the real state instead of false greens.

--
I confirm I have tested this PR thoroughly and coded it according to PEP8
and Plone's Python styleguide standards.

The sync wrapper could wedge indefinitely and stop syncing without any visible failure. Four issues, each observed in production: * No wall-clock cap. A sync step blocking on a stalled HTTP request (e.g. login with no socket timeout) ran forever. * `flock -n` turned a hang into an outage. The hung run never released /tmp/sync_tamanu.lock, so every subsequent cron run failed to acquire it and silently did nothing until a human restarted the process. * A bare `... |& tee` pipeline always exits 0 (tee's code), and a failed flock was swallowed, so the cron monitor saw every run as a success — including the broken ones. * The lock was taken/released per step, leaving an interleave window between the three steps. Changes: * Acquire one lock (fd 9) for the whole run. A lock held by a prior run is surfaced as exit 75 plus a logged [SKIP] line instead of a silent success. * Wrap each step in `timeout --signal=TERM --kill-after=30s` so a stuck step is reaped (SIGTERM, then SIGKILL), the lock is freed, and the next cron cycle self-recovers. Per-step budgets sum under the cron cadence. * `set -uo pipefail` so the step's real exit code propagates through tee and the cron monitor reflects failures. This is a safety net at the wrapper level. The underlying fix is to add HTTP timeouts in bes.lims/tamanu/session.py so the script exits gracefully through its own ConnectionError handler instead of being killed mid-run.

TamanuSession built every request with no timeout, so `requests` blocked forever on a stalled socket (overloaded server, half-open connection, no RST). A single stalled call — most often login, the first call of every run — hung the whole sync process indefinitely until it was restarted by hand. session.py: * Add default (connect, read) = (10, 60) timeouts applied to every get() and post() via setdefault, so a caller can still override per call. * Use a tighter (10, 30) timeout for login, since it gates the whole run. * Accept an optional `timeout` in TamanuSession.__init__ so the default can be overridden per session. sync_tamanu.py: * A requests Timeout is not a subclass of ConnectionError, so the existing handler would have missed it. Catch (ConnectionError, Timeout) around the sync loop, and also wrap the login() call, so an unreachable or stalled server exits cleanly through connection_error() instead of raising an unhandled traceback. A stalled Tamanu call now aborts within the read timeout and exits non-zero with a clear message, instead of hanging the run.

xispa added 4 commits June 10, 2026 22:22

Changelog

d9b83a7

Merge branch 'master' into fix-sync-hangs

90fc890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent the Tamanu sync from hanging indefinitely#156

Prevent the Tamanu sync from hanging indefinitely#156
xispa wants to merge 4 commits into
masterfrom
fix-sync-hangs

xispa commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

xispa commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Current behavior

Desired behavior

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xispa commented Jun 10, 2026 •

edited

Loading