Skip to content

fix(fhir): no-issue: stop FHIR resolver job stalling forever on lock waits#10088

Open
passcod wants to merge 2 commits into
mainfrom
fix/fhir-resolver-lock-timeout
Open

fix(fhir): no-issue: stop FHIR resolver job stalling forever on lock waits#10088
passcod wants to merge 2 commits into
mainfrom
fix/fhir-resolver-lock-timeout

Conversation

@passcod

@passcod passcod commented Jun 18, 2026

Copy link
Copy Markdown
Member

Changes

🤖 The fhir.resolver job could sit in Started forever when its rematerialisation blocked on a lock held by another transaction (e.g. a long sync session or a bulk update). Because the owning worker kept heartbeating, the dead-worker reclaim path never fired and the job never errored — wedging the singleton resolver queue (the job uses a fixed discriminant, so no new resolver can be enqueued while one is stuck) and tripping the long-running-job healthcheck.

This bounds each record's lock waits with a configurable lock_timeout (new fhir.worker.resolverLockTimeout setting, default 2 minutes), set via set_config inside each resolve transaction, so a blocked resolve errors and retries on the same worker rather than hanging — keeping the "one at a time" invariant intact (no cross-worker stealing). It also moves the transaction boundary down from one transaction per run to one per record, so lock holds stay short and progress is preserved if a later record fails.

Added an integration test that holds a row lock from a separate transaction and asserts the resolve fails fast instead of stalling.

Auto-Deploy

  • Deploy
Options
  • Synthetic test
  • Generate fake data
  • More data (20Gi)
  • No facility servers (central-only)
  • No sync (facility tasks scaled to zero)
  • AMD64 architecture (default is arm64)
  • Skip mobile build
  • Always build mobile
  • Stay up for 8 hours
  • Stay up for 24 hours
  • Stay up (no TTL)
  • Build images only (don't deploy)
  • Pause this deploy

Tests

  • Run E2E tests
  • Run DAST scan

Review Hero

  • Run Review Hero
  • Auto-fix review suggestions Wait for Review Hero to finish, resolve any comments you disagree with or want to fix manually, then check this to auto-fix the rest.
  • Auto-fix CI failures Check this to auto-fix lint errors, test failures, and other CI issues.
  • Auto-merge upstream Check this to merge the base branch into this PR, with AI conflict resolution if needed.
  • Save suppressions Check this to capture 👎 reactions on Review Hero comments as suppression rules in .github/review-hero/suppressions.yml. Also runs automatically at the end of any auto-fix run.

Remember to...

  • ...write or update tests
  • ...add UI screenshots and testing notes to the Linear issue
  • ...add any manual upgrade steps to the Linear issue
  • ...update the config reference, settings reference, or any relevant runbook(s)
  • ...call out additions or changes to config files for the deployment team to take note of

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit b5142f2. Configure here.

Comment thread packages/central-server/app/tasks/fhir/resolver.js
Comment thread packages/central-server/app/tasks/fhir/resolver.js Outdated
Comment thread packages/database/src/models/fhir/Resource.ts
Comment thread packages/central-server/app/tasks/fhir/resolver.js Outdated
Comment thread packages/database/src/models/fhir/Resource.ts
@review-hero

review-hero Bot commented Jun 18, 2026

Copy link
Copy Markdown

🦸 Review Hero Summary
9 agents reviewed this PR | 0 critical | 4 suggestions | 0 nitpicks | Filtering: consensus 3 voters, 7 below threshold, 1 suppressed

Below consensus threshold (7 unique issues not confirmed by majority)
Location Agent Severity Comment
packages/central-server/app/tasks/fhir/resolver.js:54 Design & Architecture suggestion The ms() conversion (human-readable setting string → integer milliseconds) is done here in resolver.js, but the only consumer of the resulting number is the PostgreSQL set_config call deep insi...
packages/central-server/app/tasks/fhir/resolver.js:63 Design & Architecture suggestion Removing the outer sequelize.transaction() that previously wrapped all resource types is a meaningful atomicity change: previously the entire resolve run was all-or-nothing; now per-record transa...
packages/database/src/models/fhir/Resource.ts:226 Design & Architecture suggestion The unresolved-resource snapshot (findAll) is taken outside any transaction, then each row is resolved in its own separate transaction. In a concurrent deployment (multiple resolver workers) two wo...
packages/database/src/models/fhir/Resource.ts:226 Performance suggestion this.findAll({ where: { resolved: false } }) loads every unresolved FHIR resource for this type into memory with no LIMIT. During a large backlog (e.g. after a bulk upstream update) this can mate...
packages/database/src/models/fhir/Resource.ts:238 Performance suggestion The shift from one wrapping transaction for all resources to one transaction per unresolved record multiplies transaction begin/commit round-trips by N (the number of unresolved records). For a bac...
packages/database/src/models/fhir/Resource.ts:240 Design & Architecture nitpick QueryTypes.SELECT on a set_config(...) call is unconventional — the result is discarded and it is not a SELECT in the relational sense. QueryTypes.RAW more accurately describes the intent and...
packages/database/src/models/fhir/Resource.ts:242 Bugs & Correctness suggestion String(lockTimeoutMs) produces a bare integer string like '120000'. PostgreSQL does accept plain integers as milliseconds for timeout GUCs, but this is implicit. Passing ${lockTimeoutMs}ms (e...
Local fix prompt (copy to your coding agent)

Fix these issues identified on the pull request. One commit per issue fixed.


packages/central-server/app/tasks/fhir/resolver.js:54: If the setting value is unparseable or missing, ms() returns undefined, making lockTimeoutMs silently undefined. The resolver then falls back to no lock timeout (the stalling behaviour this PR is fixing) without any log warning. Consider adding a guard: if (!lockTimeoutMs) { log.warn('fhir.worker.resolverLockTimeout is missing or unparseable; resolver has no lock timeout'); }


packages/database/src/models/fhir/Resource.ts:247: The first record that hits a lock timeout rethrows and aborts all remaining unresolved resources for this resource type. Because the outer transaction is gone, earlier records are already committed, but later records in the same batch are silently skipped for this run. The error message says which resource failed, but there is no indication of how many records were left unprocessed. Logging the count of skipped records before rethrowing, or catching the lock-timeout error specifically and continuing to the next record (letting the job complete), would give better operational visibility and avoid starving records that appear after the blocked one in the batch.


packages/central-server/app/tasks/fhir/resolver.js:54: models.Setting.get('fhir.worker.resolverLockTimeout') issues a DB query on every invocation of the resolver task. Since this setting rarely changes at runtime, the overhead accumulates across frequent resolver runs. If Setting.get doesn't already cache, consider reading the value once at task startup/initialisation rather than per-run.


packages/database/src/models/fhir/Resource.ts:239: The set_config('lock_timeout', ..., true) call is issued once per record inside its own transaction. Since the lockTimeoutMs value is constant for the entire resolver run, this adds an extra round-trip to every record's transaction (BEGIN + set_config SELECT + materialise + COMMIT instead of BEGIN + materialise + COMMIT). With a large backlog of unresolved resources this doubles the per-record DB overhead. A cheaper alternative: set the lock_timeout once at session level before the loop using set_config('lock_timeout', value, false) (or a raw SET LOCAL-equivalent outside any transaction), then let all per-record transactions inherit it. This avoids the per-record cost while still bounding every lock acquisition in the loop.

@passcod passcod force-pushed the fix/fhir-resolver-lock-timeout branch from b5142f2 to add69fe Compare June 18, 2026 13:34
…waits

The fhir.resolver job could sit in 'Started' indefinitely when its
rematerialisation blocked on a lock held elsewhere (e.g. a long sync
session or a bulk update). The worker kept heartbeating, so the
dead-worker reclaim path never fired and the job never errored, which
wedged the singleton resolver queue and tripped the long-running-job
healthcheck.

Bound each record's lock waits with a configurable lock_timeout
(fhir.worker.resolverLockTimeout, default 2 minutes) so a blocked
resolve errors and retries on the same worker instead of hanging.
Also move the transaction boundary down to per-record, so lock holds
stay short and progress is preserved if a later record fails.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@passcod passcod force-pushed the fix/fhir-resolver-lock-timeout branch from add69fe to 90a3700 Compare June 18, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant