Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29)#30
Merged
Conversation
`memoize`'s first-access setup took the context (hierarchy) lock B
(`context.lock { … }`) and only then, via the nested `Context.transaction`,
the `TestAccess` write lock A (`acquireWriteLock()`) — i.e. B→A. Every writer
(`Context._modify` / `Context.transaction`) takes A→B (`acquireWriteLock()`
before `lock.lock()`). Two threads running the opposite orders concurrently
deadlock: the memoize thread holds B and waits for A; the writer holds A and
waits for B.
This is the exact lock-pair / AB-BA family as the 1.0.6 `reduceHierarchy` fix
(#29), on the path that fix did not cover. It is reachable only under
`.modelTesting` — the base `ModelAccess.acquireWriteLock()` is a no-op in
production, so A does not exist there; it never wedges a shipping app, only the
test plan.
Captured live with `sample` on a hung xctest in a downstream consumer's macOS
package test plan: a model activation whose `onActivate` read a memoized child
value (holding B, awaiting A) while drain-executor model tasks ran
`Context._modify` writes (holding A, awaiting B). The victim test varied
run-to-run (whichever model was mid-activation when the race landed), which is
why load-aimed mitigations (serial execution, build governors, timeout scaling)
never cured it — a deadlock is immune to all of them.
Fix: acquire the `TestAccess` write lock before the context lock in `memoize`'s
first-access block, matching the canonical A→B order. Both locks are recursive,
so the nested `Context.transaction` re-enters harmlessly. Locking-discipline
change only — no behavior change, no-op outside `.modelTesting`.
Adds `MemoizeLockOrderDeadlockTests` (a concurrency smoke test mirroring
`HierarchyLockOrderDeadlockTests`); the deterministic guard is the downstream
plan, which hung ~every run before and passes after.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ake) Pre-existing load-sensitive flake (history of "Test stability fixes"), surfaced reliably on the Linux (parallel) CI job once this branch added a suite — it failed `(processedCount → 1) == 0` ~every run. The test sent a value, waited for the per-element body to START, cancelled, then after a fixed 100 ms wall-clock wait asserted the 500 ms work had been interrupted (`processedCount == 0`). Under heavy parallel CI load, >500 ms could elapse between work-start and the assert, so the work's `Task.sleep(500ms)` completed and `+= value` ran before the check — a wall-clock race, not a real regression. Made deterministic following the existing `waitUntil` convention: the per-element work now sleeps long enough (30 s) that it can ONLY finish via cancellation, and the body records (in a `@Locked` flag, from the sleep's catch) that cancellation interrupted it before the write. The test `waitUntil`s that flag (resolves in ms once cancellation propagates; generous 10 s bound for a saturated box), then asserts the write never ran. If cancellation ever fails to interrupt the body — the actual regression this guards — the wait times out and the test fails. Verified: passes 5/5 locally (~0.017 s each, vs the old ~600 ms timing dance). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes an AB-BA deadlock between the context (hierarchy) lock and the
TestAccesswrite lock on thememoizefirst-access path — the direct sibling of the 1.0.6reduceHierarchyfix (#29), on the path #29 did not cover.memoize's first-access setup takes the context lock B (context.lock { … }) and only then, through the nestedContext.transaction, theTestAccesswrite lock A (acquireWriteLock()) — i.e. B→A. Every writer (Context._modify/Context.transaction) takes A→B (acquireWriteLock()beforelock.lock()). Run concurrently, the two opposite orders deadlock:Reachable only under
.modelTesting: the baseModelAccess.acquireWriteLock()is a no-op in production, so lock A doesn't exist there. It never wedges a shipping app — only the test plan.How it was found
Captured live with
sampleon a hungxctestin a downstream consumer's macOS package test plan (122 suites, serial). The hang victim varied run-to-run — whichever model was mid-activation when the race landed — which is exactly why load-aimed mitigations (serial execution, build governors, timeout scaling) never cured it: a deadlock is immune to all of them. Three independent stack samples all showed:onActivate()→ memoized child read →memoizefirst-access →context.lock(B held) →Context.transaction→TestAccess.acquireWriteLock()→ blocked on AContext.transaction(A held) →lock→ blocked on BAll threads
__psynch_mutexwait, 100% of samples — a hard deadlock, not starvation.Fix
Acquire the
TestAccesswrite lock before the context lock inmemoize's first-access block, matching the canonical A→B order thatContext.transactionestablishes. Both locks are recursive, so the nestedContext.transaction's ownacquireWriteLock/lockre-enter harmlessly. Locking-discipline change only — no behavior change, no-op outside.modelTesting.Tests
MemoizeLockOrderDeadlockTests— a concurrency smoke test (mirroringHierarchyLockOrderDeadlockTests): many leaves activate, each running a first-accessmemoizeconcurrently with siblingContext._modifywrites; asserts each iteration settles. The narrow timing window means the deterministic guard is the downstream plan below.MemoizeThrashTests,MemoizeDirtyObservationTests,HierarchyLockOrderDeadlockTests(Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy #29) all pass.Also in this PR — stabilize a pre-existing Linux-parallel flake
Adding the new suite increased Linux (parallel) load and reliably surfaced a pre-existing flake in the unrelated
testForEachCancelPreviousInheritsContext(a test with a history of "Test stability fixes"). It asserted a cancelledforEachbody's+= valuenever runs, but did so by racing a fixed 100 ms wall-clock wait against a 500 ms work sleep — under load, >500 ms elapsed and the work completed first (processedCount → 1). Made deterministic following the repo'swaitUntilconvention: the work can now only finish via cancellation, the body records (in a@Lockedflag) that it was interrupted before the write, and the testwaitUntils that flag. Passes 5/5 locally (~0.017 s each).Stability verified: CI green across 3 consecutive runs on this branch (
run_attempt=3, all jobs incl. Linux parallel/serial, macOS parallel/serial, WASM, Android).