Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29) by mansbernhardt · Pull Request #30 · bitofmind/swift-model

mansbernhardt · 2026-06-27T11:38:48Z

Summary

Fixes an AB-BA deadlock between the context (hierarchy) lock and the TestAccess write lock on the memoize first-access path — the direct sibling of the 1.0.6 reduceHierarchy fix (#29), on the path #29 did not cover.

memoize's first-access setup takes the context lock B (context.lock { … }) and only then, through the nested Context.transaction, the TestAccess write lock A (acquireWriteLock()) — i.e. B→A. Every writer (Context._modify / Context.transaction) takes A→B (acquireWriteLock() before lock.lock()). Run concurrently, the two opposite orders deadlock:

memoize thread — holds B (context lock), awaits A (TestAccess lock)
writer thread — holds A (TestAccess lock), awaits B (context lock)

Reachable only under .modelTesting: the base ModelAccess.acquireWriteLock() is a no-op in production, so lock A doesn't exist there. It never wedges a shipping app — only the test plan.

How it was found

Captured live with sample on a hung xctest in a downstream consumer's macOS package test plan (122 suites, serial). The hang victim varied run-to-run — whichever model was mid-activation when the race landed — which is exactly why load-aimed mitigations (serial execution, build governors, timeout scaling) never cured it: a deadlock is immune to all of them. Three independent stack samples all showed:

main thread: onActivate() → memoized child read → memoize first-access → context.lock (B held) → Context.transaction → TestAccess.acquireWriteLock() → blocked on A
drain-executor model tasks: Context.transaction (A held) → lock → blocked on B

All threads __psynch_mutexwait, 100% of samples — a hard deadlock, not starvation.

Fix

Acquire the TestAccess write lock before the context lock in memoize's first-access block, matching the canonical A→B order that Context.transaction establishes. Both locks are recursive, so the nested Context.transaction's own acquireWriteLock/lock re-enter harmlessly. Locking-discipline change only — no behavior change, no-op outside .modelTesting.

Tests

New: MemoizeLockOrderDeadlockTests — a concurrency smoke test (mirroring HierarchyLockOrderDeadlockTests): many leaves activate, each running a first-access memoize concurrently with sibling Context._modify writes; asserts each iteration settles. The narrow timing window means the deterministic guard is the downstream plan below.
No regression: MemoizeThrashTests, MemoizeDirtyObservationTests, HierarchyLockOrderDeadlockTests (Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy #29) all pass.
Authoritative reproduction: the downstream macOS package plan (503 tests / 122 suites) hung ~every run before this fix (ran 3.5h until killed) and passes in ~5 min after — zero stalls.

Also in this PR — stabilize a pre-existing Linux-parallel flake

Adding the new suite increased Linux (parallel) load and reliably surfaced a pre-existing flake in the unrelated testForEachCancelPreviousInheritsContext (a test with a history of "Test stability fixes"). It asserted a cancelled forEach body's += value never runs, but did so by racing a fixed 100 ms wall-clock wait against a 500 ms work sleep — under load, >500 ms elapsed and the work completed first (processedCount → 1). Made deterministic following the repo's waitUntil convention: the work can now only finish via cancellation, the body records (in a @Locked flag) that it was interrupted before the write, and the test waitUntils that flag. Passes 5/5 locally (~0.017 s each).

Stability verified: CI green across 3 consecutive runs on this branch (run_attempt=3, all jobs incl. Linux parallel/serial, macOS parallel/serial, WASM, Android).

`memoize`'s first-access setup took the context (hierarchy) lock B (`context.lock { … }`) and only then, via the nested `Context.transaction`, the `TestAccess` write lock A (`acquireWriteLock()`) — i.e. B→A. Every writer (`Context._modify` / `Context.transaction`) takes A→B (`acquireWriteLock()` before `lock.lock()`). Two threads running the opposite orders concurrently deadlock: the memoize thread holds B and waits for A; the writer holds A and waits for B. This is the exact lock-pair / AB-BA family as the 1.0.6 `reduceHierarchy` fix (#29), on the path that fix did not cover. It is reachable only under `.modelTesting` — the base `ModelAccess.acquireWriteLock()` is a no-op in production, so A does not exist there; it never wedges a shipping app, only the test plan. Captured live with `sample` on a hung xctest in a downstream consumer's macOS package test plan: a model activation whose `onActivate` read a memoized child value (holding B, awaiting A) while drain-executor model tasks ran `Context._modify` writes (holding A, awaiting B). The victim test varied run-to-run (whichever model was mid-activation when the race landed), which is why load-aimed mitigations (serial execution, build governors, timeout scaling) never cured it — a deadlock is immune to all of them. Fix: acquire the `TestAccess` write lock before the context lock in `memoize`'s first-access block, matching the canonical A→B order. Both locks are recursive, so the nested `Context.transaction` re-enters harmlessly. Locking-discipline change only — no behavior change, no-op outside `.modelTesting`. Adds `MemoizeLockOrderDeadlockTests` (a concurrency smoke test mirroring `HierarchyLockOrderDeadlockTests`); the deterministic guard is the downstream plan, which hung ~every run before and passes after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ake) Pre-existing load-sensitive flake (history of "Test stability fixes"), surfaced reliably on the Linux (parallel) CI job once this branch added a suite — it failed `(processedCount → 1) == 0` ~every run. The test sent a value, waited for the per-element body to START, cancelled, then after a fixed 100 ms wall-clock wait asserted the 500 ms work had been interrupted (`processedCount == 0`). Under heavy parallel CI load, >500 ms could elapse between work-start and the assert, so the work's `Task.sleep(500ms)` completed and `+= value` ran before the check — a wall-clock race, not a real regression. Made deterministic following the existing `waitUntil` convention: the per-element work now sleeps long enough (30 s) that it can ONLY finish via cancellation, and the body records (in a `@Locked` flag, from the sleep's catch) that cancellation interrupted it before the write. The test `waitUntil`s that flag (resolves in ms once cancellation propagates; generous 10 s bound for a saturated box), then asserts the write never ran. If cancellation ever fails to interrupt the body — the actual regression this guards — the wait times out and the test fails. Verified: passes 5/5 locally (~0.017 s each, vs the old ~600 ms timing dance). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mansbernhardt and others added 2 commits June 27, 2026 13:37

mansbernhardt merged commit ccc9deb into main Jun 27, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29)#30

Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29)#30
mansbernhardt merged 2 commits into
mainfrom
fix/memoize-testaccess-lock-order

mansbernhardt commented Jun 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mansbernhardt commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it was found

Fix

Tests

Also in this PR — stabilize a pre-existing Linux-parallel flake

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mansbernhardt commented Jun 27, 2026 •

edited

Loading