Skip to content

Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy#29

Merged
mansbernhardt merged 1 commit into
mainfrom
claude/hierarchy-lock-deadlock
Jun 24, 2026
Merged

Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy#29
mansbernhardt merged 1 commit into
mainfrom
claude/hierarchy-lock-deadlock

Conversation

@mansbernhardt

Copy link
Copy Markdown
Collaborator

Summary

AnyContext.reduce — the core of reduceHierarchy/mapHierarchy, and the path every @EnvironmentStorage read walks to resolve a value up the ancestor chain — acquired the TestAccess lock while already holding the context (hierarchy) lock, inverting the TestAccess → context lock order that Context._modify / Context.transaction enforce. That AB-BA inversion deadlocks a concurrent writer against a parent-observation traversal, so settle() / expect() hang (never reach a fixpoint).

The registration ran inside lock(observedParents):

parents = lock(observedParents)   // observedParents = { willAccessParents(); weakParents.compactMap(\.parent) }

willAccessParents()TestAccess.willAccessregisterReadOnlyPathWake takes the TestAccess lock. The fix hoists it out of the lock:

willAccessParents()
parents = lock { weakParents.compactMap(\.parent) }

How it was found

Surfaced as a 100% hang in a downstream suite (ParallelEditorTests) after bumping to 1.0.5, whose concurrent swift-model.test-drain.shared queue runs two models' activation in parallel and exposed the latent inversion. Root-caused with lldb (process attach + thread backtrace all) on the wedged process:

Thread Holds Wants Path
writer TestAccess.lock (acquireWriteLock) context.lock Context._modify of a property
traversal context.lock (lock(observedParents)) TestAccess.lock environmentValuereduceHierarchy(observeParents:)observedParentswillAccessParents()TestAccess.willAccessregisterReadOnlyPathWake

registerReadOnlyPathWake's own comment documents the opposite ("context → TestAccess") order — the two subsystems disagreed on lock order.

Audit for sibling inversions

Checked every willAccess* synthetic-path site that takes the TestAccess lock (via registerReadOnlyPathWake):

  • parents (reduce) — the inversion, fixed here.
  • storage (AnyContext[storage] getter + environmentValue traversal) — already registers observation before/outside the lock.
  • preference (AnyContext[preference] getter + preferenceValue traversal) — already outside the lock, and already carries explicit deadlock-avoidance comments for this exact "TestAccess lock held on another thread" hazard (isApplyingSnapshot guard, post-traversal willAccessPreferenceValue).

So the parents-in-reduce path was the unique remaining inversion.

Verification

  • Downstream ParallelEditorTests: 504/504 pass, no hang (was a 100% wedge under the 1.0.5 bump).
  • Full swift-model suite: 819 tests, no regression — the only failure is a pre-existing load-induced ClockTests.testImmediateClock flake that passes in 0.045s in isolation on main as well.
  • Added HierarchyLockOrderDeadlockTests — a concurrency smoke test that drives the exact code path (concurrent deep environment-read traversals + Context._modify writes). Its file header documents honestly why a deterministic minimal repro isn't feasible (the window is too narrow synthetically, and TestAccess's locks can't be hooked to force the interleaving without adding production test seams); the authoritative reproduction is the downstream editor suite.

Locking-discipline change only — no behavior change, and a no-op outside .modelTesting.

🤖 Generated with Claude Code

AnyContext.reduce acquired the TestAccess lock (via willAccessParents →
TestAccess.willAccess → registerReadOnlyPathWake) while already holding the
context (hierarchy) lock, because the registration ran inside
lock(observedParents). That inverts the TestAccess.lock → context.lock order
Context._modify / Context.transaction enforce (acquireWriteLock() before
lock.lock()). A concurrent writer holding the TestAccess lock and waiting on the
context lock, racing a parent-observation traversal holding the context lock and
waiting on the TestAccess lock, deadlocked — observed as settle()/expect()
hanging when one model activates and reads an environment value while another
writes a property on the concurrent drive.

Hoist willAccessParents() out of the context lock so the registration runs on
the same TestAccess → context order as every writer; the parents list is still
read under the lock. The sibling storage/preference traversals already register
observation outside the lock.

Verified: downstream ParallelEditorTests 504/504 (was a 100% hang under the
swift-model 1.0.5 bump); full swift-model suite clean (the only failure is a
pre-existing load-induced ClockTests flake that passes in isolation on main).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mansbernhardt mansbernhardt merged commit 45613bd into main Jun 24, 2026
6 checks passed
@mansbernhardt mansbernhardt deleted the claude/hierarchy-lock-deadlock branch June 24, 2026 08:46
mansbernhardt added a commit that referenced this pull request Jun 27, 2026
* Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29)

`memoize`'s first-access setup took the context (hierarchy) lock B
(`context.lock { … }`) and only then, via the nested `Context.transaction`,
the `TestAccess` write lock A (`acquireWriteLock()`) — i.e. B→A. Every writer
(`Context._modify` / `Context.transaction`) takes A→B (`acquireWriteLock()`
before `lock.lock()`). Two threads running the opposite orders concurrently
deadlock: the memoize thread holds B and waits for A; the writer holds A and
waits for B.

This is the exact lock-pair / AB-BA family as the 1.0.6 `reduceHierarchy` fix
(#29), on the path that fix did not cover. It is reachable only under
`.modelTesting` — the base `ModelAccess.acquireWriteLock()` is a no-op in
production, so A does not exist there; it never wedges a shipping app, only the
test plan.

Captured live with `sample` on a hung xctest in a downstream consumer's macOS
package test plan: a model activation whose `onActivate` read a memoized child
value (holding B, awaiting A) while drain-executor model tasks ran
`Context._modify` writes (holding A, awaiting B). The victim test varied
run-to-run (whichever model was mid-activation when the race landed), which is
why load-aimed mitigations (serial execution, build governors, timeout scaling)
never cured it — a deadlock is immune to all of them.

Fix: acquire the `TestAccess` write lock before the context lock in `memoize`'s
first-access block, matching the canonical A→B order. Both locks are recursive,
so the nested `Context.transaction` re-enters harmlessly. Locking-discipline
change only — no behavior change, no-op outside `.modelTesting`.

Adds `MemoizeLockOrderDeadlockTests` (a concurrency smoke test mirroring
`HierarchyLockOrderDeadlockTests`); the deterministic guard is the downstream
plan, which hung ~every run before and passes after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Stabilize testForEachCancelPreviousInheritsContext (Linux-parallel flake)

Pre-existing load-sensitive flake (history of "Test stability fixes"),
surfaced reliably on the Linux (parallel) CI job once this branch added a
suite — it failed `(processedCount → 1) == 0` ~every run.

The test sent a value, waited for the per-element body to START, cancelled,
then after a fixed 100 ms wall-clock wait asserted the 500 ms work had been
interrupted (`processedCount == 0`). Under heavy parallel CI load, >500 ms could
elapse between work-start and the assert, so the work's `Task.sleep(500ms)`
completed and `+= value` ran before the check — a wall-clock race, not a real
regression.

Made deterministic following the existing `waitUntil` convention: the per-element
work now sleeps long enough (30 s) that it can ONLY finish via cancellation, and
the body records (in a `@Locked` flag, from the sleep's catch) that cancellation
interrupted it before the write. The test `waitUntil`s that flag (resolves in ms
once cancellation propagates; generous 10 s bound for a saturated box), then
asserts the write never ran. If cancellation ever fails to interrupt the body —
the actual regression this guards — the wait times out and the test fails.

Verified: passes 5/5 locally (~0.017 s each, vs the old ~600 ms timing dance).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant