Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy#29
Merged
Conversation
AnyContext.reduce acquired the TestAccess lock (via willAccessParents → TestAccess.willAccess → registerReadOnlyPathWake) while already holding the context (hierarchy) lock, because the registration ran inside lock(observedParents). That inverts the TestAccess.lock → context.lock order Context._modify / Context.transaction enforce (acquireWriteLock() before lock.lock()). A concurrent writer holding the TestAccess lock and waiting on the context lock, racing a parent-observation traversal holding the context lock and waiting on the TestAccess lock, deadlocked — observed as settle()/expect() hanging when one model activates and reads an environment value while another writes a property on the concurrent drive. Hoist willAccessParents() out of the context lock so the registration runs on the same TestAccess → context order as every writer; the parents list is still read under the lock. The sibling storage/preference traversals already register observation outside the lock. Verified: downstream ParallelEditorTests 504/504 (was a 100% hang under the swift-model 1.0.5 bump); full swift-model suite clean (the only failure is a pre-existing load-induced ClockTests flake that passes in isolation on main). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mansbernhardt
added a commit
that referenced
this pull request
Jun 27, 2026
* Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29) `memoize`'s first-access setup took the context (hierarchy) lock B (`context.lock { … }`) and only then, via the nested `Context.transaction`, the `TestAccess` write lock A (`acquireWriteLock()`) — i.e. B→A. Every writer (`Context._modify` / `Context.transaction`) takes A→B (`acquireWriteLock()` before `lock.lock()`). Two threads running the opposite orders concurrently deadlock: the memoize thread holds B and waits for A; the writer holds A and waits for B. This is the exact lock-pair / AB-BA family as the 1.0.6 `reduceHierarchy` fix (#29), on the path that fix did not cover. It is reachable only under `.modelTesting` — the base `ModelAccess.acquireWriteLock()` is a no-op in production, so A does not exist there; it never wedges a shipping app, only the test plan. Captured live with `sample` on a hung xctest in a downstream consumer's macOS package test plan: a model activation whose `onActivate` read a memoized child value (holding B, awaiting A) while drain-executor model tasks ran `Context._modify` writes (holding A, awaiting B). The victim test varied run-to-run (whichever model was mid-activation when the race landed), which is why load-aimed mitigations (serial execution, build governors, timeout scaling) never cured it — a deadlock is immune to all of them. Fix: acquire the `TestAccess` write lock before the context lock in `memoize`'s first-access block, matching the canonical A→B order. Both locks are recursive, so the nested `Context.transaction` re-enters harmlessly. Locking-discipline change only — no behavior change, no-op outside `.modelTesting`. Adds `MemoizeLockOrderDeadlockTests` (a concurrency smoke test mirroring `HierarchyLockOrderDeadlockTests`); the deterministic guard is the downstream plan, which hung ~every run before and passes after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Stabilize testForEachCancelPreviousInheritsContext (Linux-parallel flake) Pre-existing load-sensitive flake (history of "Test stability fixes"), surfaced reliably on the Linux (parallel) CI job once this branch added a suite — it failed `(processedCount → 1) == 0` ~every run. The test sent a value, waited for the per-element body to START, cancelled, then after a fixed 100 ms wall-clock wait asserted the 500 ms work had been interrupted (`processedCount == 0`). Under heavy parallel CI load, >500 ms could elapse between work-start and the assert, so the work's `Task.sleep(500ms)` completed and `+= value` ran before the check — a wall-clock race, not a real regression. Made deterministic following the existing `waitUntil` convention: the per-element work now sleeps long enough (30 s) that it can ONLY finish via cancellation, and the body records (in a `@Locked` flag, from the sleep's catch) that cancellation interrupted it before the write. The test `waitUntil`s that flag (resolves in ms once cancellation propagates; generous 10 s bound for a saturated box), then asserts the write never ran. If cancellation ever fails to interrupt the body — the actual regression this guards — the wait times out and the test fails. Verified: passes 5/5 locally (~0.017 s each, vs the old ~600 ms timing dance). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AnyContext.reduce— the core ofreduceHierarchy/mapHierarchy, and the path every@EnvironmentStorageread walks to resolve a value up the ancestor chain — acquired the TestAccess lock while already holding the context (hierarchy) lock, inverting theTestAccess → contextlock order thatContext._modify/Context.transactionenforce. That AB-BA inversion deadlocks a concurrent writer against a parent-observation traversal, sosettle()/expect()hang (never reach a fixpoint).The registration ran inside
lock(observedParents):willAccessParents()→TestAccess.willAccess→registerReadOnlyPathWaketakes the TestAccess lock. The fix hoists it out of the lock:How it was found
Surfaced as a 100% hang in a downstream suite (
ParallelEditorTests) after bumping to 1.0.5, whose concurrentswift-model.test-drain.sharedqueue runs two models' activation in parallel and exposed the latent inversion. Root-caused withlldb(process attach+thread backtrace all) on the wedged process:acquireWriteLock)Context._modifyof a propertylock(observedParents))environmentValue→reduceHierarchy(observeParents:)→observedParents→willAccessParents()→TestAccess.willAccess→registerReadOnlyPathWakeregisterReadOnlyPathWake's own comment documents the opposite ("context → TestAccess") order — the two subsystems disagreed on lock order.Audit for sibling inversions
Checked every
willAccess*synthetic-path site that takes the TestAccess lock (viaregisterReadOnlyPathWake):reduce) — the inversion, fixed here.AnyContext[storage]getter +environmentValuetraversal) — already registers observation before/outside the lock.AnyContext[preference]getter +preferenceValuetraversal) — already outside the lock, and already carries explicit deadlock-avoidance comments for this exact "TestAccess lock held on another thread" hazard (isApplyingSnapshotguard, post-traversalwillAccessPreferenceValue).So the parents-in-
reducepath was the unique remaining inversion.Verification
ParallelEditorTests: 504/504 pass, no hang (was a 100% wedge under the 1.0.5 bump).ClockTests.testImmediateClockflake that passes in 0.045s in isolation onmainas well.HierarchyLockOrderDeadlockTests— a concurrency smoke test that drives the exact code path (concurrent deep environment-read traversals +Context._modifywrites). Its file header documents honestly why a deterministic minimal repro isn't feasible (the window is too narrow synthetically, andTestAccess's locks can't be hooked to force the interleaving without adding production test seams); the authoritative reproduction is the downstream editor suite.Locking-discipline change only — no behavior change, and a no-op outside
.modelTesting.🤖 Generated with Claude Code