Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy by mansbernhardt · Pull Request #29 · bitofmind/swift-model

mansbernhardt · 2026-06-24T08:28:23Z

Summary

AnyContext.reduce — the core of reduceHierarchy/mapHierarchy, and the path every @EnvironmentStorage read walks to resolve a value up the ancestor chain — acquired the TestAccess lock while already holding the context (hierarchy) lock, inverting the TestAccess → context lock order that Context._modify / Context.transaction enforce. That AB-BA inversion deadlocks a concurrent writer against a parent-observation traversal, so settle() / expect() hang (never reach a fixpoint).

The registration ran inside lock(observedParents):

parents = lock(observedParents)   // observedParents = { willAccessParents(); weakParents.compactMap(\.parent) }

willAccessParents() → TestAccess.willAccess → registerReadOnlyPathWake takes the TestAccess lock. The fix hoists it out of the lock:

willAccessParents()
parents = lock { weakParents.compactMap(\.parent) }

How it was found

Surfaced as a 100% hang in a downstream suite (ParallelEditorTests) after bumping to 1.0.5, whose concurrent swift-model.test-drain.shared queue runs two models' activation in parallel and exposed the latent inversion. Root-caused with lldb (process attach + thread backtrace all) on the wedged process:

Thread	Holds	Wants	Path
writer	TestAccess.lock (`acquireWriteLock`)	context.lock	`Context._modify` of a property
traversal	context.lock (`lock(observedParents)`)	TestAccess.lock	`environmentValue` → `reduceHierarchy(observeParents:)` → `observedParents` → `willAccessParents()` → `TestAccess.willAccess` → `registerReadOnlyPathWake`

registerReadOnlyPathWake's own comment documents the opposite ("context → TestAccess") order — the two subsystems disagreed on lock order.

Audit for sibling inversions

Checked every willAccess* synthetic-path site that takes the TestAccess lock (via registerReadOnlyPathWake):

parents (reduce) — the inversion, fixed here.
storage (AnyContext[storage] getter + environmentValue traversal) — already registers observation before/outside the lock.
preference (AnyContext[preference] getter + preferenceValue traversal) — already outside the lock, and already carries explicit deadlock-avoidance comments for this exact "TestAccess lock held on another thread" hazard (isApplyingSnapshot guard, post-traversal willAccessPreferenceValue).

So the parents-in-reduce path was the unique remaining inversion.

Verification

Downstream ParallelEditorTests: 504/504 pass, no hang (was a 100% wedge under the 1.0.5 bump).
Full swift-model suite: 819 tests, no regression — the only failure is a pre-existing load-induced ClockTests.testImmediateClock flake that passes in 0.045s in isolation on main as well.
Added HierarchyLockOrderDeadlockTests — a concurrency smoke test that drives the exact code path (concurrent deep environment-read traversals + Context._modify writes). Its file header documents honestly why a deterministic minimal repro isn't feasible (the window is too narrow synthetically, and TestAccess's locks can't be hooked to force the interleaving without adding production test seams); the authoritative reproduction is the downstream editor suite.

Locking-discipline change only — no behavior change, and a no-op outside .modelTesting.

🤖 Generated with Claude Code

AnyContext.reduce acquired the TestAccess lock (via willAccessParents → TestAccess.willAccess → registerReadOnlyPathWake) while already holding the context (hierarchy) lock, because the registration ran inside lock(observedParents). That inverts the TestAccess.lock → context.lock order Context._modify / Context.transaction enforce (acquireWriteLock() before lock.lock()). A concurrent writer holding the TestAccess lock and waiting on the context lock, racing a parent-observation traversal holding the context lock and waiting on the TestAccess lock, deadlocked — observed as settle()/expect() hanging when one model activates and reads an environment value while another writes a property on the concurrent drive. Hoist willAccessParents() out of the context lock so the registration runs on the same TestAccess → context order as every writer; the parents list is still read under the lock. The sibling storage/preference traversals already register observation outside the lock. Verified: downstream ParallelEditorTests 504/504 (was a 100% hang under the swift-model 1.0.5 bump); full swift-model suite clean (the only failure is a pre-existing load-induced ClockTests flake that passes in isolation on main). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29) `memoize`'s first-access setup took the context (hierarchy) lock B (`context.lock { … }`) and only then, via the nested `Context.transaction`, the `TestAccess` write lock A (`acquireWriteLock()`) — i.e. B→A. Every writer (`Context._modify` / `Context.transaction`) takes A→B (`acquireWriteLock()` before `lock.lock()`). Two threads running the opposite orders concurrently deadlock: the memoize thread holds B and waits for A; the writer holds A and waits for B. This is the exact lock-pair / AB-BA family as the 1.0.6 `reduceHierarchy` fix (#29), on the path that fix did not cover. It is reachable only under `.modelTesting` — the base `ModelAccess.acquireWriteLock()` is a no-op in production, so A does not exist there; it never wedges a shipping app, only the test plan. Captured live with `sample` on a hung xctest in a downstream consumer's macOS package test plan: a model activation whose `onActivate` read a memoized child value (holding B, awaiting A) while drain-executor model tasks ran `Context._modify` writes (holding A, awaiting B). The victim test varied run-to-run (whichever model was mid-activation when the race landed), which is why load-aimed mitigations (serial execution, build governors, timeout scaling) never cured it — a deadlock is immune to all of them. Fix: acquire the `TestAccess` write lock before the context lock in `memoize`'s first-access block, matching the canonical A→B order. Both locks are recursive, so the nested `Context.transaction` re-enters harmlessly. Locking-discipline change only — no behavior change, no-op outside `.modelTesting`. Adds `MemoizeLockOrderDeadlockTests` (a concurrency smoke test mirroring `HierarchyLockOrderDeadlockTests`); the deterministic guard is the downstream plan, which hung ~every run before and passes after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Stabilize testForEachCancelPreviousInheritsContext (Linux-parallel flake) Pre-existing load-sensitive flake (history of "Test stability fixes"), surfaced reliably on the Linux (parallel) CI job once this branch added a suite — it failed `(processedCount → 1) == 0` ~every run. The test sent a value, waited for the per-element body to START, cancelled, then after a fixed 100 ms wall-clock wait asserted the 500 ms work had been interrupted (`processedCount == 0`). Under heavy parallel CI load, >500 ms could elapse between work-start and the assert, so the work's `Task.sleep(500ms)` completed and `+= value` ran before the check — a wall-clock race, not a real regression. Made deterministic following the existing `waitUntil` convention: the per-element work now sleeps long enough (30 s) that it can ONLY finish via cancellation, and the body records (in a `@Locked` flag, from the sleep's catch) that cancellation interrupted it before the write. The test `waitUntil`s that flag (resolves in ms once cancellation propagates; generous 10 s bound for a saturated box), then asserts the write never ran. If cancellation ever fails to interrupt the body — the actual regression this guards — the wait times out and the test fails. Verified: passes 5/5 locally (~0.017 s each, vs the old ~600 ms timing dance). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mansbernhardt merged commit 45613bd into main Jun 24, 2026
6 checks passed

mansbernhardt deleted the claude/hierarchy-lock-deadlock branch June 24, 2026 08:46

mansbernhardt mentioned this pull request Jun 27, 2026

Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29) #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy#29

Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy#29
mansbernhardt merged 1 commit into
mainfrom
claude/hierarchy-lock-deadlock

mansbernhardt commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mansbernhardt commented Jun 24, 2026

Summary

How it was found

Audit for sibling inversions

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant