fix(fhir): no-issue: stop FHIR resolver job stalling forever on lock waits by passcod · Pull Request #10088 · beyondessential/tamanu

passcod · 2026-06-18T11:54:25Z

Changes

🤖 The fhir.resolver job could sit in Started forever when its rematerialisation blocked on a lock held by another transaction (e.g. a long sync session or a bulk update). Because the owning worker kept heartbeating, the dead-worker reclaim path never fired and the job never errored — wedging the singleton resolver queue (the job uses a fixed discriminant, so no new resolver can be enqueued while one is stuck) and tripping the long-running-job healthcheck.

This bounds each record's lock waits with a configurable lock_timeout (new fhir.worker.resolverLockTimeout setting, default 2 minutes), set via set_config inside each resolve transaction, so a blocked resolve errors and retries on the same worker rather than hanging — keeping the "one at a time" invariant intact (no cross-worker stealing). It also moves the transaction boundary down from one transaction per run to one per record, so lock holds stay short and progress is preserved if a later record fails.

Added an integration test that holds a row lock from a separate transaction and asserts the resolve fails fast instead of stalling.

Auto-Deploy

Deploy

Options

Tests

Run E2E tests
Run DAST scan

Review Hero

Run Review Hero
Auto-fix review suggestions Wait for Review Hero to finish, resolve any comments you disagree with or want to fix manually, then check this to auto-fix the rest.
Auto-fix CI failures Check this to auto-fix lint errors, test failures, and other CI issues.
Auto-merge upstream Check this to merge the base branch into this PR, with AI conflict resolution if needed.
Save suppressions Check this to capture 👎 reactions on Review Hero comments as suppression rules in .github/review-hero/suppressions.yml. Also runs automatically at the end of any auto-fix run.

Remember to...

...write or update tests
...add UI screenshots and testing notes to the Linear issue
...add any manual upgrade steps to the Linear issue
...update the config reference, settings reference, or any relevant runbook(s)
...call out additions or changes to config files for the deployment team to take note of

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit b5142f2. Configure here.}

review-hero · 2026-06-18T12:01:08Z

🦸 Review Hero Summary
9 agents reviewed this PR | 0 critical | 4 suggestions | 0 nitpicks | Filtering: consensus 3 voters, 7 below threshold, 1 suppressed

Below consensus threshold (7 unique issues not confirmed by majority)

Location	Agent	Severity	Comment
`packages/central-server/app/tasks/fhir/resolver.js:54`	Design & Architecture	`suggestion`	The `ms()` conversion (human-readable setting string → integer milliseconds) is done here in resolver.js, but the only consumer of the resulting number is the PostgreSQL `set_config` call deep insi...
`packages/central-server/app/tasks/fhir/resolver.js:63`	Design & Architecture	`suggestion`	Removing the outer `sequelize.transaction()` that previously wrapped all resource types is a meaningful atomicity change: previously the entire resolve run was all-or-nothing; now per-record transa...
`packages/database/src/models/fhir/Resource.ts:226`	Design & Architecture	`suggestion`	The unresolved-resource snapshot (findAll) is taken outside any transaction, then each row is resolved in its own separate transaction. In a concurrent deployment (multiple resolver workers) two wo...
`packages/database/src/models/fhir/Resource.ts:226`	Performance	`suggestion`	`this.findAll({ where: { resolved: false } })` loads every unresolved FHIR resource for this type into memory with no LIMIT. During a large backlog (e.g. after a bulk upstream update) this can mate...
`packages/database/src/models/fhir/Resource.ts:238`	Performance	`suggestion`	The shift from one wrapping transaction for all resources to one transaction per unresolved record multiplies transaction begin/commit round-trips by N (the number of unresolved records). For a bac...
`packages/database/src/models/fhir/Resource.ts:240`	Design & Architecture	`nitpick`	`QueryTypes.SELECT` on a `set_config(...)` call is unconventional — the result is discarded and it is not a SELECT in the relational sense. `QueryTypes.RAW` more accurately describes the intent and...
`packages/database/src/models/fhir/Resource.ts:242`	Bugs & Correctness	`suggestion`	`String(lockTimeoutMs)` produces a bare integer string like `'120000'`. PostgreSQL does accept plain integers as milliseconds for timeout GUCs, but this is implicit. Passing `${lockTimeoutMs}ms` (e...

Local fix prompt (copy to your coding agent)

Fix these issues identified on the pull request. One commit per issue fixed.

packages/central-server/app/tasks/fhir/resolver.js:54: If the setting value is unparseable or missing, ms() returns undefined, making lockTimeoutMs silently undefined. The resolver then falls back to no lock timeout (the stalling behaviour this PR is fixing) without any log warning. Consider adding a guard: if (!lockTimeoutMs) { log.warn('fhir.worker.resolverLockTimeout is missing or unparseable; resolver has no lock timeout'); }

packages/database/src/models/fhir/Resource.ts:247: The first record that hits a lock timeout rethrows and aborts all remaining unresolved resources for this resource type. Because the outer transaction is gone, earlier records are already committed, but later records in the same batch are silently skipped for this run. The error message says which resource failed, but there is no indication of how many records were left unprocessed. Logging the count of skipped records before rethrowing, or catching the lock-timeout error specifically and continuing to the next record (letting the job complete), would give better operational visibility and avoid starving records that appear after the blocked one in the batch.

packages/central-server/app/tasks/fhir/resolver.js:54: models.Setting.get('fhir.worker.resolverLockTimeout') issues a DB query on every invocation of the resolver task. Since this setting rarely changes at runtime, the overhead accumulates across frequent resolver runs. If Setting.get doesn't already cache, consider reading the value once at task startup/initialisation rather than per-run.

packages/database/src/models/fhir/Resource.ts:239: The set_config('lock_timeout', ..., true) call is issued once per record inside its own transaction. Since the lockTimeoutMs value is constant for the entire resolver run, this adds an extra round-trip to every record's transaction (BEGIN + set_config SELECT + materialise + COMMIT instead of BEGIN + materialise + COMMIT). With a large backlog of unresolved resources this doubles the per-record DB overhead. A cheaper alternative: set the lock_timeout once at session level before the loop using set_config('lock_timeout', value, false) (or a raw SET LOCAL-equivalent outside any transaction), then let all per-record transactions inherit it. This avoids the per-record cost while still bounding every lock acquisition in the loop.

…waits The fhir.resolver job could sit in 'Started' indefinitely when its rematerialisation blocked on a lock held elsewhere (e.g. a long sync session or a bulk update). The worker kept heartbeating, so the dead-worker reclaim path never fired and the job never errored, which wedged the singleton resolver queue and tripped the long-running-job healthcheck. Bound each record's lock waits with a configurable lock_timeout (fhir.worker.resolverLockTimeout, default 2 minutes) so a blocked resolve errors and retries on the same worker instead of hanging. Also move the transaction boundary down to per-record, so lock holds stay short and progress is preserved if a later record fails. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread packages/central-server/app/tasks/fhir/resolver.js

review-hero Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread packages/central-server/app/tasks/fhir/resolver.js Outdated

Comment thread packages/database/src/models/fhir/Resource.ts

Comment thread packages/central-server/app/tasks/fhir/resolver.js Outdated

Comment thread packages/database/src/models/fhir/Resource.ts

passcod force-pushed the fix/fhir-resolver-lock-timeout branch from b5142f2 to add69fe Compare June 18, 2026 13:34

passcod force-pushed the fix/fhir-resolver-lock-timeout branch from add69fe to 90a3700 Compare June 18, 2026 13:43

Merge branch 'main' into fix/fhir-resolver-lock-timeout

49b1e38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(fhir): no-issue: stop FHIR resolver job stalling forever on lock waits#10088

fix(fhir): no-issue: stop FHIR resolver job stalling forever on lock waits#10088
passcod wants to merge 2 commits into
mainfrom
fix/fhir-resolver-lock-timeout

passcod commented Jun 18, 2026 •

edited by github-actions Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-hero Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

passcod commented Jun 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Auto-Deploy

Tests

Review Hero

Remember to...

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

review-hero Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

passcod commented Jun 18, 2026 •

edited by github-actions Bot

Loading