Skip to content

Prevent the Tamanu sync from hanging indefinitely#156

Open
xispa wants to merge 4 commits into
masterfrom
fix-sync-hangs
Open

Prevent the Tamanu sync from hanging indefinitely#156
xispa wants to merge 4 commits into
masterfrom
fix-sync-hangs

Conversation

@xispa

@xispa xispa commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Description

Linked issue: #139

The Tamanu sync could wedge and stop syncing in both directions, with no visible failure, until the process was restarted by hand. This addresses the two mechanisms behind that: HTTP calls with no timeout, and a sync wrapper that turned a single hung run into a standing outage while reporting success.

Changes:

  • tamanu/session.py: add default (connect, read) = (10, 60) timeouts, applied to every get()/post() via setdefault (still overridable per call), plus a tighter (10, 30) on login, and an optional timeout argument on TamanuSession.
  • scripts/sync_tamanu.py: a requests Timeout is not a ConnectionError, so the existing handler would miss it — catch (ConnectionError, Timeout) around the sync loop and around login(), exiting cleanly through connection_error() instead of an unhandled traceback.
  • templates/sync_tamanu.in: take one lock for the whole run; surface a held lock as a non-zero exit + log line instead of swallowing it; cap each step with timeout --signal=TERM --kill-after=30s so a stuck step is reaped and the next cron run self-recovers; add set -o pipefail so the step's real exit code propagates through tee to the cron monitor.

Current behavior

  • TamanuSession.post()/get() call requests with no timeout, so a stalled socket (overloaded server, half-open connection, no RST) blocks forever. A hung login — the first call of every run — hangs the whole sync.
  • The wrapper guards each step with flock -n: a hung run never releases /tmp/sync_tamanu.lock, so every subsequent cron run fails to acquire it and silently does nothing until someone restarts the process.
  • Each wrapper step is ... |& tee, whose exit code is tee's 0, and a failed flock is swallowed, so the cron monitor records every run as a success — including the broken ones.

Desired behavior

  • Any Tamanu HTTP call aborts within its read timeout and the script exits non-zero with a clear ConnectionError/Timeout message.
  • A hung step is reaped by timeout, the lock is released, and the next cron cycle resumes on its own.
  • A run that fails or is skipped (lock held) exits non-zero, so the cron monitor reflects the real state instead of false greens.

--
I confirm I have tested this PR thoroughly and coded it according to PEP8
and Plone's Python styleguide standards.

xispa added 4 commits June 10, 2026 22:22
The sync wrapper could wedge indefinitely and stop syncing without any visible
failure. Four issues, each observed in production:

  * No wall-clock cap. A sync step blocking on a stalled HTTP request (e.g.
    login with no socket timeout) ran forever.
  * `flock -n` turned a hang into an outage. The hung run never released
    /tmp/sync_tamanu.lock, so every subsequent cron run failed to acquire it
    and silently did nothing until a human restarted the process.
  * A bare `... |& tee` pipeline always exits 0 (tee's code), and a failed
    flock was swallowed, so the cron monitor saw every run as a success —
    including the broken ones.
  * The lock was taken/released per step, leaving an interleave window
    between the three steps.

Changes:

  * Acquire one lock (fd 9) for the whole run. A lock held by a prior run is
    surfaced as exit 75 plus a logged [SKIP] line instead of a silent success.
  * Wrap each step in `timeout --signal=TERM --kill-after=30s` so a stuck
    step is reaped (SIGTERM, then SIGKILL), the lock is freed, and the next
    cron cycle self-recovers. Per-step budgets sum under the cron cadence.
  * `set -uo pipefail` so the step's real exit code propagates through tee
    and the cron monitor reflects failures.

This is a safety net at the wrapper level. The underlying fix is to add
HTTP timeouts in bes.lims/tamanu/session.py so the script exits gracefully
through its own ConnectionError handler instead of being killed mid-run.
TamanuSession built every request with no timeout, so `requests` blocked
forever on a stalled socket (overloaded server, half-open connection, no RST).
A single stalled call — most often login, the first call of every run — hung
the whole sync process indefinitely until it was restarted by hand.

session.py:

  * Add default (connect, read) = (10, 60) timeouts applied to every get()
    and post() via setdefault, so a caller can still override per call.
  * Use a tighter (10, 30) timeout for login, since it gates the whole run.
  * Accept an optional `timeout` in TamanuSession.__init__ so the default can
    be overridden per session.

sync_tamanu.py:

  * A requests Timeout is not a subclass of ConnectionError, so the existing
    handler would have missed it. Catch (ConnectionError, Timeout) around the
    sync loop, and also wrap the login() call, so an unreachable or stalled
    server exits cleanly through connection_error() instead of raising an
    unhandled traceback.

A stalled Tamanu call now aborts within the read timeout and exits non-zero
with a clear message, instead of hanging the run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant