Skip to content

Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29)#30

Merged
mansbernhardt merged 2 commits into
mainfrom
fix/memoize-testaccess-lock-order
Jun 27, 2026
Merged

Fix AB-BA memoize/TestAccess lock-order deadlock (sibling of #29)#30
mansbernhardt merged 2 commits into
mainfrom
fix/memoize-testaccess-lock-order

Conversation

@mansbernhardt

@mansbernhardt mansbernhardt commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes an AB-BA deadlock between the context (hierarchy) lock and the TestAccess write lock on the memoize first-access path — the direct sibling of the 1.0.6 reduceHierarchy fix (#29), on the path #29 did not cover.

memoize's first-access setup takes the context lock B (context.lock { … }) and only then, through the nested Context.transaction, the TestAccess write lock A (acquireWriteLock()) — i.e. B→A. Every writer (Context._modify / Context.transaction) takes A→B (acquireWriteLock() before lock.lock()). Run concurrently, the two opposite orders deadlock:

  • memoize thread — holds B (context lock), awaits A (TestAccess lock)
  • writer thread — holds A (TestAccess lock), awaits B (context lock)

Reachable only under .modelTesting: the base ModelAccess.acquireWriteLock() is a no-op in production, so lock A doesn't exist there. It never wedges a shipping app — only the test plan.

How it was found

Captured live with sample on a hung xctest in a downstream consumer's macOS package test plan (122 suites, serial). The hang victim varied run-to-run — whichever model was mid-activation when the race landed — which is exactly why load-aimed mitigations (serial execution, build governors, timeout scaling) never cured it: a deadlock is immune to all of them. Three independent stack samples all showed:

  • main thread: onActivate() → memoized child read → memoize first-access → context.lock (B held) → Context.transactionTestAccess.acquireWriteLock()blocked on A
  • drain-executor model tasks: Context.transaction (A held) → lockblocked on B

All threads __psynch_mutexwait, 100% of samples — a hard deadlock, not starvation.

Fix

Acquire the TestAccess write lock before the context lock in memoize's first-access block, matching the canonical A→B order that Context.transaction establishes. Both locks are recursive, so the nested Context.transaction's own acquireWriteLock/lock re-enter harmlessly. Locking-discipline change only — no behavior change, no-op outside .modelTesting.

Tests

  • New: MemoizeLockOrderDeadlockTests — a concurrency smoke test (mirroring HierarchyLockOrderDeadlockTests): many leaves activate, each running a first-access memoize concurrently with sibling Context._modify writes; asserts each iteration settles. The narrow timing window means the deterministic guard is the downstream plan below.
  • No regression: MemoizeThrashTests, MemoizeDirtyObservationTests, HierarchyLockOrderDeadlockTests (Fix AB-BA context/TestAccess lock deadlock in reduceHierarchy #29) all pass.
  • Authoritative reproduction: the downstream macOS package plan (503 tests / 122 suites) hung ~every run before this fix (ran 3.5h until killed) and passes in ~5 min after — zero stalls.

Also in this PR — stabilize a pre-existing Linux-parallel flake

Adding the new suite increased Linux (parallel) load and reliably surfaced a pre-existing flake in the unrelated testForEachCancelPreviousInheritsContext (a test with a history of "Test stability fixes"). It asserted a cancelled forEach body's += value never runs, but did so by racing a fixed 100 ms wall-clock wait against a 500 ms work sleep — under load, >500 ms elapsed and the work completed first (processedCount → 1). Made deterministic following the repo's waitUntil convention: the work can now only finish via cancellation, the body records (in a @Locked flag) that it was interrupted before the write, and the test waitUntils that flag. Passes 5/5 locally (~0.017 s each).

Stability verified: CI green across 3 consecutive runs on this branch (run_attempt=3, all jobs incl. Linux parallel/serial, macOS parallel/serial, WASM, Android).

mansbernhardt and others added 2 commits June 27, 2026 13:37
`memoize`'s first-access setup took the context (hierarchy) lock B
(`context.lock { … }`) and only then, via the nested `Context.transaction`,
the `TestAccess` write lock A (`acquireWriteLock()`) — i.e. B→A. Every writer
(`Context._modify` / `Context.transaction`) takes A→B (`acquireWriteLock()`
before `lock.lock()`). Two threads running the opposite orders concurrently
deadlock: the memoize thread holds B and waits for A; the writer holds A and
waits for B.

This is the exact lock-pair / AB-BA family as the 1.0.6 `reduceHierarchy` fix
(#29), on the path that fix did not cover. It is reachable only under
`.modelTesting` — the base `ModelAccess.acquireWriteLock()` is a no-op in
production, so A does not exist there; it never wedges a shipping app, only the
test plan.

Captured live with `sample` on a hung xctest in a downstream consumer's macOS
package test plan: a model activation whose `onActivate` read a memoized child
value (holding B, awaiting A) while drain-executor model tasks ran
`Context._modify` writes (holding A, awaiting B). The victim test varied
run-to-run (whichever model was mid-activation when the race landed), which is
why load-aimed mitigations (serial execution, build governors, timeout scaling)
never cured it — a deadlock is immune to all of them.

Fix: acquire the `TestAccess` write lock before the context lock in `memoize`'s
first-access block, matching the canonical A→B order. Both locks are recursive,
so the nested `Context.transaction` re-enters harmlessly. Locking-discipline
change only — no behavior change, no-op outside `.modelTesting`.

Adds `MemoizeLockOrderDeadlockTests` (a concurrency smoke test mirroring
`HierarchyLockOrderDeadlockTests`); the deterministic guard is the downstream
plan, which hung ~every run before and passes after.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ake)

Pre-existing load-sensitive flake (history of "Test stability fixes"),
surfaced reliably on the Linux (parallel) CI job once this branch added a
suite — it failed `(processedCount → 1) == 0` ~every run.

The test sent a value, waited for the per-element body to START, cancelled,
then after a fixed 100 ms wall-clock wait asserted the 500 ms work had been
interrupted (`processedCount == 0`). Under heavy parallel CI load, >500 ms could
elapse between work-start and the assert, so the work's `Task.sleep(500ms)`
completed and `+= value` ran before the check — a wall-clock race, not a real
regression.

Made deterministic following the existing `waitUntil` convention: the per-element
work now sleeps long enough (30 s) that it can ONLY finish via cancellation, and
the body records (in a `@Locked` flag, from the sleep's catch) that cancellation
interrupted it before the write. The test `waitUntil`s that flag (resolves in ms
once cancellation propagates; generous 10 s bound for a saturated box), then
asserts the write never ran. If cancellation ever fails to interrupt the body —
the actual regression this guards — the wait times out and the test fails.

Verified: passes 5/5 locally (~0.017 s each, vs the old ~600 ms timing dance).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mansbernhardt mansbernhardt merged commit ccc9deb into main Jun 27, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant