Prevent the Tamanu sync from hanging indefinitely#156
Open
xispa wants to merge 4 commits into
Open
Conversation
The sync wrapper could wedge indefinitely and stop syncing without any visible
failure. Four issues, each observed in production:
* No wall-clock cap. A sync step blocking on a stalled HTTP request (e.g.
login with no socket timeout) ran forever.
* `flock -n` turned a hang into an outage. The hung run never released
/tmp/sync_tamanu.lock, so every subsequent cron run failed to acquire it
and silently did nothing until a human restarted the process.
* A bare `... |& tee` pipeline always exits 0 (tee's code), and a failed
flock was swallowed, so the cron monitor saw every run as a success —
including the broken ones.
* The lock was taken/released per step, leaving an interleave window
between the three steps.
Changes:
* Acquire one lock (fd 9) for the whole run. A lock held by a prior run is
surfaced as exit 75 plus a logged [SKIP] line instead of a silent success.
* Wrap each step in `timeout --signal=TERM --kill-after=30s` so a stuck
step is reaped (SIGTERM, then SIGKILL), the lock is freed, and the next
cron cycle self-recovers. Per-step budgets sum under the cron cadence.
* `set -uo pipefail` so the step's real exit code propagates through tee
and the cron monitor reflects failures.
This is a safety net at the wrapper level. The underlying fix is to add
HTTP timeouts in bes.lims/tamanu/session.py so the script exits gracefully
through its own ConnectionError handler instead of being killed mid-run.
TamanuSession built every request with no timeout, so `requests` blocked
forever on a stalled socket (overloaded server, half-open connection, no RST).
A single stalled call — most often login, the first call of every run — hung
the whole sync process indefinitely until it was restarted by hand.
session.py:
* Add default (connect, read) = (10, 60) timeouts applied to every get()
and post() via setdefault, so a caller can still override per call.
* Use a tighter (10, 30) timeout for login, since it gates the whole run.
* Accept an optional `timeout` in TamanuSession.__init__ so the default can
be overridden per session.
sync_tamanu.py:
* A requests Timeout is not a subclass of ConnectionError, so the existing
handler would have missed it. Catch (ConnectionError, Timeout) around the
sync loop, and also wrap the login() call, so an unreachable or stalled
server exits cleanly through connection_error() instead of raising an
unhandled traceback.
A stalled Tamanu call now aborts within the read timeout and exits non-zero
with a clear message, instead of hanging the run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Linked issue: #139
The Tamanu sync could wedge and stop syncing in both directions, with no visible failure, until the process was restarted by hand. This addresses the two mechanisms behind that: HTTP calls with no timeout, and a sync wrapper that turned a single hung run into a standing outage while reporting success.
Changes:
tamanu/session.py: add default(connect, read) = (10, 60)timeouts, applied to everyget()/post()viasetdefault(still overridable per call), plus a tighter(10, 30)onlogin, and an optionaltimeoutargument onTamanuSession.scripts/sync_tamanu.py: arequestsTimeoutis not aConnectionError, so the existing handler would miss it — catch(ConnectionError, Timeout)around the sync loop and aroundlogin(), exiting cleanly throughconnection_error()instead of an unhandled traceback.templates/sync_tamanu.in: take one lock for the whole run; surface a held lock as a non-zero exit + log line instead of swallowing it; cap each step withtimeout --signal=TERM --kill-after=30sso a stuck step is reaped and the next cron run self-recovers; addset -o pipefailso the step's real exit code propagates throughteeto the cron monitor.Current behavior
TamanuSession.post()/get()callrequestswith no timeout, so a stalled socket (overloaded server, half-open connection, no RST) blocks forever. A hunglogin— the first call of every run — hangs the whole sync.flock -n: a hung run never releases/tmp/sync_tamanu.lock, so every subsequent cron run fails to acquire it and silently does nothing until someone restarts the process.... |& tee, whose exit code istee's0, and a failedflockis swallowed, so the cron monitor records every run as a success — including the broken ones.Desired behavior
ConnectionError/Timeoutmessage.timeout, the lock is released, and the next cron cycle resumes on its own.--
I confirm I have tested this PR thoroughly and coded it according to PEP8
and Plone's Python styleguide standards.