diff --git a/Docs/test-determinism-executor-drain.md b/Docs/test-determinism-executor-drain.md new file mode 100644 index 0000000..9e4730c --- /dev/null +++ b/Docs/test-determinism-executor-drain.md @@ -0,0 +1,433 @@ +# Test determinism: from wall-clock quiescence to executor-drain quiescence + +**Status:** design note + validated spike (branch `claude/spike-drain-executor`). +**Audience:** SwiftModel maintainers. +**TL;DR:** `.modelTesting`'s load-sensitivity is not a bug in any one wait +mechanism — it is intrinsic to *defining "the model has reacted" as "the wall +clock saw no activity for a debounce window."* Model work runs on the +uncontrolled cooperative pool; "done" is inferred from the wall clock. Those +are two independent clocks that desynchronise under load. The durable fix is to +let the **test harness own the execution** of model work and define quiescence +as a **logical executor fixpoint** (drive-to-completion), not a wall-clock +window. A spike validates the core primitive: complete and load-independent. + +--- + +## 1. Symptom + +`@Test(.modelTesting)` suites intermittently fail under machine load — most +visibly on shared/self-hosted CI when a second `xcodebuild` runs concurrently. +Failures take two shapes: + +- **`settle()` false timeout** — `settle() timed out: model still has active + tasks` with an *empty* active-task list (the model had actually gone idle; the + `.deferential` quiet-window confirmation was starved on the `.background` + queue). `-retry-tests-on-failure` re-runs into the same condition and the job + blows its wall-clock budget. +- **`expect` can't observe model work in time** — the documented "known + load-sensitive tests" (`CLAUDE.md`): `ClockTests.testImmediateClock`, + `ChildActivationTaskTests.childTasksCompleteBeforeTeardown`, the Standups + `clock.advance(by: 6s)` cases, etc. Their assertions need *N cooperative-pool + slots to land within a fixed wall-clock budget*; under saturation they don't. + +Both are the **same** failure. + +## 2. Root cause + +Model reactions execute on the **shared, uncontrolled Swift cooperative pool** +(`Cancellables` spawns every `node.task`/`forEach` body with bare +`Task`/`Task.detached`, no executor preference; the `CallQueue` / +`BackgroundCallQueue` pumps use `Task.detached(priority: .userInitiated)`). But +tests decide **"the reaction is done" using the wall clock** (`waitUntilSettled` +debounce; `expect`'s 5 s budget). + +Two independent clocks. On an unloaded machine they coincide; under load they +desynchronise, and **no amount of tuning the second clock fixes a dependence on +the first.** `CLAUDE.md` already states the true shape of it: *"a property whose +truth requires N cooperative-pool slots to land within a fixed wall-clock +budget."* + +This also explains why *"settle needs waiting by nature"* feels true: it needs +waiting **only because the work was handed to a pool we don't control**, so the +only way to learn it finished is to watch a clock and infer. + +## 3. Every prior attempt was measurement-side + +A partial history (commit refs on `main`): + +| Attempt | Commits | Category | +|---|---|---| +| Increase / "patience" timeouts | `9a10423`, `5ed9a85`, `3f6acfb` | wait longer | +| Global `SWIFT_MODEL_TIMEOUT_SCALE` knob | `e35f639`, `eda289e` | wait longer (scaled) | +| Adaptive **calibration** (measure pool latency, scale) | `20bcb31`, `169e278` — *since removed* | wait longer (adaptive) | +| Priority-queue tuning (`.deferential`/`.background`) | `f6000db`, GTS design | when to give up | +| `yieldToScheduler` / Dispatch hops | `fae431e`, `ff0183d` | how to wait | +| GCD-backed `GlobalTickScheduler` | `f6000db` | non-starvable *deadlines* | +| Serial **and** parallel CI matrix | `0c461ec`, `c004ce6` | avoid contention | +| `.responsive` budget-cap backstop | `bdc6f4a`, `7136d94` | fail faster | + +Every one changes **how long / at what priority / with what backstop we wait to +observe quiescence**. None changes **what runs the work being waited on.** So +each relocates the flake instead of removing it — and the end state is a +*documented list of known-flaky tests* rather than zero. Falling back to serial +CI "feels like a failure" because it is a workaround for a correctness model +that isn't load-independent. + +## 4. The foundational shift: drive-to-completion + +Stop *inferring* quiescence from the wall clock. Instead **own the execution** +of model work in tests and **run it to a fixpoint**. + +- In `.modelTesting`, give model-spawned tasks a **`TaskExecutor` preference** + (SE-0417) pointing at a **per-test harness executor**. +- Define quiescence as an **executor fixpoint**: drain the ready queue (plus the + MainActor call queue and the bg pipeline) until no ready job remains. A slow + machine takes more wall-time to drain the *same finite* set of jobs — the + *answer is identical*. **Load-independent by construction.** +- The only remaining cap is a **logical step cap** (a runaway re-enqueues + forever and never reaches a stable empty queue) — deterministic on every + machine, and the wall clock is demoted to a last-resort guard for genuinely + stuck *external* async (a real test bug). + +The reactive ergonomics survive: + +| Verb | Today | Proposed | +|---|---|---| +| `expect { }` | evaluate; wait up to 5 s wall-clock | drain until predicate true **or** fixpoint, then evaluate | +| `settle()` | wait for 50 ms wall-clock quiet window | drain to fixpoint | +| end-of-test exhaustivity | `cancelAll`; wait up to 25 s | `cancelAll`; drain to fixpoint; check | +| runaway | hits wall-clock budget | hits **logical** step cap (deterministic) | + +This collapses three wall-clock budgets (expect 5 s / settle 5 s / cleanup 25 s) ++ `SWIFT_MODEL_TIMEOUT_SCALE` + the `.deferential`/`.background` QoS dance into a +single `drainToQuiescence()`. + +### Why this *fixes* parallel rather than threatening it + +Today parallel is unsafe because all tests' model work shares one pool and +"done" is wall-clock-bounded, so serial is the only correctness guarantee. With +**per-test owned execution + fixpoint quiescence, determinism is per-test and +load-independent**: parallel tests contend for CPU (slower wall-time) but each +drains its *own* finite work deterministically. Parallel becomes correct **by +construction**; serial demotes from *the* safety gate to a speed / odd-race +check. The parallel-execution differentiator (vs TCA's enforced `@MainActor`) +is preserved — and finally trustworthy. + +## 5. Spike evidence (`claude/spike-drain-executor`) + +`Tests/SwiftModelTests/SpikeDrainExecutorTests.swift` validates the load-bearing +unknown in isolation. `DrainTaskExecutor` counts outstanding ready jobs; +`waitUntilQuiescent()` hops barriers on its own serial queue until the ready +queue stays empty twice (fixpoint) or a logical step cap trips (runaway). No +wall clock in the decision. + +| Test | Claim | Result | +|---|---|---| +| **A** drain completeness under load | after `waitUntilQuiescent()`, every task in a depth-3/breadth-3 tree (5 suspension points each) completed | 150 iters/run × 3 runs = **450 completions, 0 flakes**, half the cores saturated | +| **B** logical runaway cap | `while { await Task.yield() }` never reaches a stable empty queue → caught by the **step count**, not a clock | deterministic — the case the budget-cap patch *couldn't* distinguish from idle | +| **C** honest caveat | a task parked on an **off-executor** GCD timer empties the queue → premature fixpoint | confirmed → **controlled dependencies are a hard rule** | + +## 6. Honest risks / what real wiring must clear + +> **Update — end-to-end wiring attempted; NEGATIVE result (valuable).** Branch +> `claude/spike-drain-executor` also contains a first end-to-end wiring (INERT by +> default; opt in with `SWIFT_MODEL_EXPERIMENTAL_DRAIN=1`): model task spawns +> (`Cancellables`) take an `executorPreference` to a per-test executor via a +> `_TestExecutorBox` task-local set by the trait, and `expect`/`settle` drive it +> via `_startExecutorDrive`. With the flag on, it **deadlocks real model +> settle/teardown** — `SettlingTests` time out and +> `ChildActivationTaskTests.childTasksCompleteBeforeTeardown` hits the trait +> wall-clock cap. Cause: a **single serial-queue** executor does not compose with +> real model work — model **context locks**, the **off-executor bg pump** +> (`BackgroundCallQueue`/`Task.detached`, which ignores executor preference), and +> **`@MainActor` hops** can leave the one serial thread blocked. The isolated +> primitive (Tests A/B/C) remains sound; the lesson is the executor must be a +> **counted *concurrent* executor**, and the fixpoint must **union the bg and +> MainActor queues** — risks #1/#2 below are not optional. The wiring is left in, +> inert and flagged, as the substrate for that next iteration; the end-to-end +> test is `.disabled` with this finding. + +> **Update 2 — concurrent executor + union fixpoint: STILL not working (second +> negative result).** Iteration 2 made the executor **concurrent** (to kill the +> serial head-of-line deadlock) and made the `expect` drive's fixpoint the +> **union** of executor-idle + `BackgroundCallQueue.isIdle` + `MainCallQueue.isIdle` +> + `!hasPendingStartTask`. With the flag on it *still* fails: `SettlingTests` +> time out and `childTasksCompleteBeforeTeardown` now **hangs to the 30 s trait +> cap** — i.e. the child `onActivate` tasks **never run to completion on the +> custom executor at all**. So the blocker is not the executor's shape; it's that +> **model tasks don't make progress when moved off the default cooperative pool** +> — most likely an actor-isolation / clock-resumption interaction (the child +> awaits `node.continuousClock.sleep`, whose continuation must resume back onto +> the preferred executor), and possibly that `settle` itself still resolves via +> the wall-clock quiet-window rather than the drive. Diagnosing this needs real +> instrumentation (GTS/settle tracing, confirming `enqueue`/resumption actually +> fire for a task that hops through `ImmediateClock`/`@MainActor`), **not** more +> blind rebuild cycles. The wiring remains inert-by-default (flag off ⇒ suite +> green, verified); the experimental path is parked here pending that +> investigation. **Revised conclusion: the executor migration is a deeper rework +> than "add executorPreference + a drive" — it also requires making `settle` +> drive-based and solving task-progress-on-a-custom-executor — and is best done +> with a maintainer pairing + tracing, not autonomous trial-and-error.** + +> **Update 3 — RESOLVED: end-to-end works (iteration 3).** A layered +> micro-diagnosis (`DiagExecutorClockTests`) refuted the clock-resumption +> hypothesis — a bare `Task(executorPreference:)` over `ImmediateClock`/ +> `ContinuousClock` sleeps resumes on the executor fine, and a *real* `@Model` +> `node.task` polled directly **completes** on the executor. So the hang was not +> the tasks; it was the **drive itself**: it used `queue.async(flags: .barrier)` +> to detect idle, and on a concurrent queue a barrier is a read-WRITE barrier +> that **blocks all other jobs while pending** — so polling with it throttled the +> very model work we were waiting on. Fix: detect idle with the **outstanding +> counter + an event-driven, cancellation-aware `waitUntilIdle()`** (resume +> waiters when the count hits 0), never a barrier. With that, the flag-on path is +> green on the targeted suites **including under deliberate CPU load**: +> `ChildActivationTaskTests` (previously hung), `SettlingTests` (previously timed +> out), the runaway `settleInfiniteChangesTimeout` (still correctly reported), +> and the end-to-end load-stressed `realModelChildTasksAreLoadIndependentEndToEnd` +> (40 iterations under load). In the same machine-load moment, the flag-OFF +> (wall-clock) path *flaked* `SettlingTests` while flag-ON stayed green — +> the thesis demonstrated live. Note: `settle` benefits transitively — its +> predicate phase (`expect` with empty predicates) drives the executor to a +> fixpoint, so its wall-clock quiet-window has nothing left to wait on. Still +> opt-in; remaining to reach full quality: the full suite under load, flip the +> default on, then parallel-apple + the parallel-CI flip. + +> **Update 4 — full-PARALLEL gate fails; the real remaining scope is exposed.** +> Iteration 3 is green on *targeted* suites under load, but the **full suite +> run in parallel** (91 suites / 813 tests, the actual goal) fails broadly — +> first with per-test concurrent queues exploding GCD's thread pool, then, after +> backing all executors with **one shared concurrent queue** (iteration 4), +> still ~288 failures: `settle() timed out` *and* `Timeout after 3–5 s waiting +> for condition` in `waitUntil`-based tests. Root cause: **moving model work to a +> custom executor doesn't add capacity — it relocates the same contention** — +> and, decisively, **the drive is only *additive*: every wait still carries a +> wall-clock deadline** (`expect` 5 s, `settle` 5 s, `waitUntil` 3–5 s; the last +> isn't wired to the drive at all). Under full parallel, those wall-clock budgets +> are the binding constraint, so load still fails them regardless of the executor. +> +> **Revised, honest scope:** true load-independence under parallel requires the +> step the design note calls "make the drain the PRIMARY resolution" — i.e. +> **rewrite every wait primitive (`expect`, `settle`, `waitUntil`, exhaustion) to +> resolve on the executor/queue *fixpoint*, demoting the wall clock to a generous +> last-resort hang-catcher (e.g. 60 s)** — not an additive drive alongside the +> existing 5 s budgets. That is a substantial rework of the wait core +> (`TestAccess`/`TestExpect`/`TestWaitSupport`), and it's the genuine remaining +> work. The primitive and small-scale end-to-end are validated; the wait-core +> migration is not done. Parked here (inert by default); this is the point where +> a maintainer who owns the wait core should drive the rewrite. + +> **Update 5 — the firing semantics are proven, and they redirect the design +> AWAY from the executor.** Clarified intent: the per-wait timeouts are internal +> and arbitrary; under load they may take *as long as necessary*; in the happy +> case they never fire; a broken test/refactor should fire *fast when a human +> runs one test* and correctly under CI load. That reframes the "timeout" as a +> **fixpoint check, not a time budget**: resolve *pass* when the target is met, +> resolve *fail* the moment the model is **quiescent with the target still +> unmet**; the wall clock is only a last-resort hang-catcher. +> +> Stage 1 made `expect` fixpoint-primary (fail when quiescent-but-unmet; wall +> clock pushed to a 600 s backstop). **It works for the firing semantics** — an +> unsatisfiable `expect` failed in **247 ms**, not 600 s (Test E). But it +> *false-failed healthy* tests (`childTasks`), and the cause is decisive: +> +> **An instantaneous "executor idle (`outstanding == 0`)" is a PREMATURE +> fixpoint.** A task suspended at `await clock.sleep` (even `ImmediateClock`) is +> not a ready job, so when several child tasks are momentarily suspended +> mid-sleep the counter reads 0 and the drive declares "fixpoint" while the work +> is actually in flight — resolving "unmet" there is a false failure. Tasks +> suspend and resume; a single idle instant cannot mean "done." +> +> **This is exactly why SwiftModel already uses a DEBOUNCED quiet window** (a +> suspended task that resumes does activity within the window and re-arms it). So +> the correct fixpoint detector is the **existing quiet window** (debounced over +> `_noteActivity` = writes + task-body-starts + enqueues), **not a custom +> executor** — the executor's idle count is both unnecessary and *worse* +> (premature). +> +> **Recommended direction (supersedes the executor route):** make +> `expect`/`settle`/`waitUntil` resolve on the **quiet-window fixpoint with the +> total budget removed** — i.e. wait as long as needed for the model to go quiet +> (only a generous hang backstop, e.g. the trait cap), and **fail `expect` when +> the model is quiet but the predicate is still unmet**. This reuses the +> mechanism SwiftModel *already* relies on for `settle` (`awaitSettled`: +> quiet-window + bg-idle), drops the executor wiring entirely, and avoids the +> premature-fixpoint false failures. It is *less* invasive than the executor +> migration. The executor spike (Tests A–E, iterations 1–4) remains as the +> investigation record that led here; it should not ship. + +> **Update 6 — runtime balance: resolve EVENT-DRIVEN, don't pay a fixed +> debounce.** Constraint raised: many tests call `settle()` several times; +> stability matters most, but the fix must not make tests wait unnecessarily +> (N settles × a fixed window × many tests = real total-runtime cost). Note the +> status quo *already* pays this — `awaitSettled` always arms a 50 ms quiet +> window and waits it out even when the model is already quiet. +> +> Resolution: don't keep the fixed debounce as the primary latency. Resolve a +> wait **event-driven, the instant the model is genuinely quiescent**, via an +> **authoritative in-flight signal** — the *correct* version of what the executor +> attempted. The executor counted **ready jobs** (premature: a task suspended at +> `clock.sleep` reads as 0). Instead count **in-flight TRANSIENT tasks from +> registration until completion**, so a suspended-mid-await task still counts +> (fixes the premature-fixpoint bug), and a wait resolves the moment that count +> hits 0 ∧ bg idle ∧ main idle ∧ no pending-start. SwiftModel's `TaskCancellable` +> infra already separates the relevant kinds: `node.task` bodies and `forEach` +> **inner per-element** bodies are transient (counted = "work happening now"); +> `forEach`'s outer loop / `onChange` consumers parked awaiting the next element +> are long-lived (excluded — parked ≠ busy). Outcome: happy-path settles are +> near-instant (often *faster* than today's 50 ms floor → shorter total +> runtime); broken assertions still fail fast at quiescence; under load it waits +> as long as needed (no budget, hang-backstop only); the 50 ms debounce drops to +> a *fallback* for genuinely ambiguous cases. Stability first, without +> unnecessary waiting. + +> **Update 7 — CORRECTION: the debounce is fundamental; completion-counting +> cannot replace it.** Update 6's "resolve event-driven on an in-flight transient +> count, drop the debounce" is **unsound**. A supported pattern — a long-lived +> user `node.task { while !cancelled { let x = await stream.next(); self.y = x } }` +> — *never completes*, so a registration→completion count never returns to 0 and +> the wait would hang. `CLAUDE.md` states this is supported: long-lived consumers +> "correctly settle … as long as they aren't currently writing." Classification +> can't save it: `forEach`/`onChange` are framework-known-long-lived, but a user +> `node.task` may be transient OR a `while` loop and is indistinguishable at +> spawn. Only the **debounce quiet-window** correctly settles a parked long-lived +> consumer *and* waits out a suspended-mid-`sleep` transient. So the debounce +> stays; the runtime win is a fast-path (zero-registered-tasks → immediate) and +> possibly a shorter window — NOT eliminating the debounce. +> +> **Corrected sound plan (priority = stability + load-independence):** +> 1. Keep the debounce quiet-window as the detector. +> 2. Remove the total budget; replace the hang-catch with an **inactivity +> watchdog** — fail after X s of *zero* activity (writes, task starts/ends), +> reset on any progress. Slow-but-progressing never trips; a true deadlock +> (no activity) trips at X. Load-independent and deadlock-catching. +> 3. Make the quiet-window confirmation **non-starvable** (drop the +> `.deferential`→`.background` QoS hop for `.responsive`). +> 4. `expect` fails when **quiescent-but-unmet** (fast interactive failure). +> 5. Drop the executor (premature/wrong detector). +> 6. Runtime: fast-path immediate return when zero registered tasks; debounce +> window stays (tunable). +> Open tuning (maintainer judgment): the watchdog duration X, and whether to +> shorten the debounce window. + +> **Update 8 — EVALUATION: budget-scaling is refuted; `.background` starvation +> is the unbounded disease; the executor-drain is the only resolution.** Ran the +> full suite in PARALLEL with `SWIFT_MODEL_TIMEOUT_SCALE=100` (every budget ×100, +> executor OFF — pure existing code). Result: **93 unexpected failures, 23-minute +> runtime, and 72 are still `settle() timed out: model still has active tasks`** +> — at a 500 s budget. A model's real work never takes 500 s, so the budget is +> not the constraint: settle's quiet-check runs `.deferential` → +> `DispatchQueue.global(qos: .background)`, which macOS starves **indefinitely** +> under parallel load, so the confirmation never fires and the `.responsive` +> backstop trips the false timeout at *any* budget. **Scaling (the lever this +> project has relied on for years) cannot fix this**; it only makes the run +> pathologically slow. +> +> The tension this exposes: `.deferential`/`.background` is *race-safe* (lets an +> about-to-write task run before the quiet-check) but *starvable*; `.responsive` +> is *non-starvable* but reintroduces the toggleExpanded race. **Neither is both +> via QoS.** The only mechanism that is both is the **executor-drain**: it *runs* +> ready work to a fixpoint deterministically rather than *observing* quiescence +> through a starvable callback — an about-to-write task is a ready job, so the +> drain runs it, then the check sees the write. Iteration 3 confirmed the +> executor-drain makes settle work at small scale; its failures were Stage-1's +> premature *resolve-as-fail* (use the drain to DETECT, not to fail) and +> full-parallel scale (the shared-queue fix), both scoped. **Conclusion: abandon +> budget/QoS tuning; the path is the executor-drain as the settle *quiescence +> detector* (not a failure trigger), made primary and scaled. This is a core +> concurrency-design decision for the maintainer.** + +> **Update 9 — drive-PRIMARY implemented: settle FIXED; `expect`+clock is the +> residual hard core.** Implemented the corrected plan: `settle`/`expect`/ +> `waitUntil` resolve on the executor-drain fixpoint (non-starvable), with a +> short non-starvable debounce (executor-idle must persist since the last +> enqueue — bridges suspend→resume gaps), `mainCall` excluded from the per-test +> fixpoint (it's process-global → would hang under parallel), and a generous +> deadlock watchdog. Full-parallel result (flag on): **the 72 `settle() timed +> out` failures are GONE**, runtime **165 s** (vs 1379 s for ×100 budgets). +> Settle's `.background`-starvation disease is cured. +> +> Residual (~23 distinct): **`expect`+clock** (`testImmediateClock`, +> `testClockStepByStep`, `childTasks`, `testChangeOf*`), **deadlock-stress** +> (`checkExhaustion*DoesNotDeadlock*`), **events** (`featureEvents`, +> `testChildEvents`). The core one is `expect`'s premature-fixpoint race under +> parallel load: `settle` only needs "quiescent" (robust), but `expect` judges +> the predicate *at* the fixpoint — so a clock task whose resume+write is delayed +> past the debounce under load yields "fixpoint, predicate false" → a FALSE +> failure on a healthy test. A re-confirmation fixpoint narrowed but did not +> eliminate it (a delayed resume is unbounded under load). Race-free options: +> (a) `expect` fails only at the watchdog — stable, but a genuine wrong-assertion +> fails slowly (sacrifices fast-interactive-fail); (b) fold the test-clock's +> pending-sleeper set into the fixpoint so it's never declared while a task is +> clock-parked — clean, but clock-internals-deep. The deadlock-stress/event +> clusters need separate investigation (the drive may resolve before events +> propagate / interact with `checkExhaustion`'s locking). +> +> **Status: settle drive-primary is a validated win; `expect` drive-primary needs +> the (a)/(b) decision + the stress/event clusters. Inert by default.** + +> **Update 10 — activity-grace (general, no dependency): settle stays fixed; +> expect residual is scaling + a long tail.** Per the maintainer's steer (must +> NOT depend on the clock or any specific dependency), the fixpoint debounce now +> keys on **all activity** — every `_noteActivity` (write/event/probe/task-start, +> via a new `_lastActivityNs`) AND executor enqueues — with a per-verb grace +> (settle 30 ms, expect 250 ms). Any activity (a clock-parked task resuming and +> writing) resets the grace, so on a single test it's quiet at once (fast) and +> under stress it waits until genuinely done. Full-parallel (flag on): **0 settle +> timeouts**, unexpected failures **118 → ~51**. +> +> Residual, now precisely characterized (two different things): +> 1. **Trait-cap hangs** (`childTasks`, `testImmediateClock` fail at exactly +> ~30.7 s = the trait cap). NOT a premature fail — the drive *correctly* +> waits, but under full-parallel ONE shared GCD queue backs every test's model +> tasks, so hundreds contend and the test's tasks queue behind them; the 30 s +> trait cap fires before they drain. Trading iter4's thread-explosion for a +> shared-queue bottleneck. Fix: the trait cap must itself be load-tolerant (an +> inactivity watchdog, not an absolute 30 s), and/or a bounded-but-larger +> executor concurrency. +> 2. **Long tail of genuine interactions** (fast fails): `testClockStepByStep` +> (manual clock stepping), `checkExhaustion*DoesNotDeadlock` (preference- +> exhaustion), `testChildEvents`/`testTouchThenRealWrite` (event/transition +> propagation) — the drive resolves before/around these in ways their +> assertions don't expect. +> +> **Conclusion: `settle` drive-primary is a clean, validated, SHIPPABLE win (the +> core disease). `expect` drive-primary is deeper R&D** — it needs (a) a +> load-tolerant trait cap + executor concurrency tuning for the scaling hangs, +> (b) the fundamental fast-fail-vs-delayed-resume race accepted via watchdog-only +> failure (general, no-dependency, race-free, but slow-fail) or lived-with as +> much-reduced residual flakiness, and (c) the long-tail interactions resolved +> one by one. Recommend banking settle and treating expect as a separate effort. + +The spike proves the *primitive*. Integration risks, confirmed by the wiring +attempts above: + +1. **MainActor hop.** Work that hops to `@MainActor` (`MainCallQueue`) runs on + the main executor, not the drain executor. The fixpoint must drain that queue + too (we already own it). +2. **GCD / `Task.detached` internals.** `CallQueue` / `BackgroundCallQueue` + pumps and GTS sit off any controlled executor. In test mode they must route + through it (or be virtual-time) or the "ready queue empty" signal lies. +3. **`executorPreference` inheritance** through `forEach` / `Observed` chains + and child-model task trees (the spike set it manually per `Task`; the + framework's `Cancellables` spawn path must thread it through). +4. **External async = a hard rule.** A task awaiting an *uncontrolled* real + async op idles the executor while work is pending elsewhere → premature + fixpoint (Test C). Tests must use controlled dependencies — the discipline + TCA enforces by construction. +5. **Availability.** Custom task executors need the Swift 6 runtime + (`macOS 15+` / `iOS 18+`). CI is `macos-15`; the production library is + unaffected (test-only opt-in). + +## 7. Migration path (incremental, non-breaking) + +1. Introduce the per-test drain executor + `executorPreference` on model task + spawns **under `.modelTesting` only** (production unchanged). +2. Add `drainToQuiescence()` that unions executor-idle + `MainCallQueue` idle + + bg idle + `!hasPendingStartTask`. Make `settle`/`expect` resolve on it, with + the existing wall-clock path kept as a **fallback** so nothing else breaks + during migration. +3. Once trusted, demote the wall-clock debounce to the runaway/iteration cap and + remove the QoS / `.deferential` / scale machinery. +4. Virtualise internal scheduling (GTS, call-queue pumps) last. + +Each step is independently shippable and reversible. diff --git a/Sources/SwiftModel/Internal/Cancellables.swift b/Sources/SwiftModel/Internal/Cancellables.swift index 30e7a70..4dcbab4 100644 --- a/Sources/SwiftModel/Internal/Cancellables.swift +++ b/Sources/SwiftModel/Internal/Cancellables.swift @@ -158,6 +158,23 @@ extension TaskCancellable { } } + #if canImport(Dispatch) + // EXPERIMENTAL (executor-drain quiescence): in `.modelTesting`, run + // model task bodies on the per-test harness executor so the test can + // drive them to a fixpoint instead of waiting on the wall clock. The + // contextual `Task` return type resolves the throwing + // `executorPreference` overload, exactly as the plain spawns below. + // (Name is dropped here only because the `name:`+`executorPreference:` + // combined initializer isn't available across all supported toolchains.) + if #available(macOS 15.0, iOS 18.0, tvOS 18.0, watchOS 11.0, *), + let exec = _TestExecutorBox.current as? _DrainTestExecutor { + if isDetached { + return Task.detached(executorPreference: exec, priority: priority, operation: operation) + } else { + return Task(executorPreference: exec, priority: priority, operation: operation) + } + } + #endif if isDetached { return Task.detached(name: taskName, priority: priority, operation: operation) } else { diff --git a/Sources/SwiftModel/Internal/TestAccess.swift b/Sources/SwiftModel/Internal/TestAccess.swift index a8be424..98dc1d0 100644 --- a/Sources/SwiftModel/Internal/TestAccess.swift +++ b/Sources/SwiftModel/Internal/TestAccess.swift @@ -960,10 +960,18 @@ final class TestAccess: ModelAccess, @unchecked Sendable { /// Thread-safety: callable from any thread. Wakes continuations OUTSIDE /// the lock so resumed Tasks can call back into `awaitPredicate` / /// `awaitQuietWindow` without re-entering. + /// Monotonic-ns of the most recent model activity (write / event / probe / + /// task-body-start) seen by `_noteActivity`. The executor-drain grace window + /// debounces against this (and the executor's last enqueue), so any activity + /// — including a clock-suspended task's resume-and-write — resets the grace, + /// keeping a wait from declaring quiescence while work is still in flight. + var _lastActivityNs: UInt64 = 0 + func _noteActivity() { var wakes: [CheckedContinuation] = [] lock { let now = monotonicNanoseconds() + _lastActivityNs = now // Iterate in reverse so removals don't shift indices we haven't // visited yet. for i in (0..<_pendingExpects.count).reversed() { diff --git a/Sources/SwiftModel/Internal/TestExecutorDrive.swift b/Sources/SwiftModel/Internal/TestExecutorDrive.swift new file mode 100644 index 0000000..467fba8 --- /dev/null +++ b/Sources/SwiftModel/Internal/TestExecutorDrive.swift @@ -0,0 +1,261 @@ +import Foundation +#if canImport(Dispatch) +import Dispatch +#endif + +// MARK: - Executor-drain quiescence (primary resolution for expect/settle/waitUntil) +// +// The durable fix for `.modelTesting` load-flakiness (see +// docs/test-determinism-executor-drain.md). Under `.modelTesting`, model tasks +// run on a per-test harness executor; the wait verbs resolve on the EXECUTOR +// DRAIN FIXPOINT — a load-independent, non-starvable "the model is quiescent" +// signal — rather than a wall-clock budget or a `.background`-QoS quiet-check +// (which macOS starves under parallel load — the disease this replaces). +// +// Semantics (the contract): +// • pass the instant the target is met (reactive `_noteActivity`, as before); +// • for `expect`/`waitUntil`: fail the instant the model is QUIESCENT with the +// target still unmet (fast interactively; "as long as necessary" under load); +// • `settle`: succeed when quiescent; +// • the wall clock is only a generous deadlock watchdog (`_executorHangNs`). +// +// "Quiescent" = a STABLE fixpoint: executor has no ready jobs AND bg pump idle +// AND MainActor queue idle AND no task pending its first run — observed across +// two consecutive checks (a task suspended mid-`await` re-enqueues between +// checks, so a single idle instant is not trusted; this 2-consecutive gate is +// what iteration 3 proved prevents premature fixpoints). + +/// Per-test harness executor box (task-local), set by the `.modelTesting` scope. +/// `(any Sendable)?` so the declaration compiles where `TaskExecutor` is +/// unavailable (macOS < 15, WASM); readers downcast under `if #available`. +enum _TestExecutorBox { + @TaskLocal static var current: (any Sendable)? +} + +/// Generous deadlock watchdog: how long a wait will drive toward a fixpoint +/// before giving up and reporting (a true deadlock / runaway never reaches a +/// fixpoint). NOT a per-wait budget — a healthy wait resolves at its fixpoint +/// long before this. **Tunable knob** (maintainer judgment): large enough to +/// never fire on a healthy wait under heavy parallel load; small enough that an +/// interactive deadlock surfaces in reasonable time. +func _executorHangDeadlineNs() -> UInt64 { + _drainMonotonicNs() &+ 120_000_000_000 // 120 s +} + +private final class _GTSSleepState: @unchecked Sendable { + var cont: CheckedContinuation? + var cancel: (@Sendable () -> Void)? + var done = false +} + +@inline(__always) func _drainMonotonicNs() -> UInt64 { + #if canImport(Dispatch) + return DispatchTime.now().uptimeNanoseconds + #else + return UInt64(ProcessInfo.processInfo.systemUptime * 1_000_000_000) + #endif +} + +/// Build a fresh per-test executor box, or `nil` where custom task executors +/// aren't available. Opt-in via `SWIFT_MODEL_EXPERIMENTAL_DRAIN=1`; inert +/// otherwise (every wait keeps its current path). +/// +/// NOTE: cannot be on by default yet. The drive needs model tasks on this +/// executor, and routing them off the cooperative pool is slightly slower under +/// parallel load — enough to trip `expect`'s wall-clock budget in latency- +/// sensitive clock tests (`childTasksCompleteBeforeTeardown`, `testImmediateClock` +/// regress with it on). Enabling by default is unblocked only once `expect`/ +/// `waitUntil` are also drive-primary (so they don't depend on a wall-clock +/// budget). Until then this stays opt-in. See docs/test-determinism-executor-drain.md. +func _makeTestExecutorBox() -> (any Sendable)? { + #if canImport(Dispatch) + if #available(macOS 15.0, iOS 18.0, tvOS 18.0, watchOS 11.0, *), + ProcessInfo.processInfo.environment["SWIFT_MODEL_EXPERIMENTAL_DRAIN"] == "1" { + return _DrainTestExecutor() + } + #endif + return nil +} + +#if canImport(Dispatch) +/// ONE process-wide concurrent queue backs every per-test executor — a per-test +/// `.concurrent` queue would explode GCD's worker pool under full-parallel test +/// runs. Each executor keeps its own `outstanding` counter, so per-test +/// quiescence detection stays isolated. +@available(macOS 15.0, iOS 18.0, tvOS 18.0, watchOS 11.0, *) +private let _sharedDrainQueue = DispatchQueue(label: "swift-model.test-drain.shared", attributes: .concurrent) + +@available(macOS 15.0, iOS 18.0, tvOS 18.0, watchOS 11.0, *) +final class _DrainTestExecutor: TaskExecutor, @unchecked Sendable { + private let lock = NSLock() + private var outstanding = 0 + private var _lastEnqueueNs: UInt64 = 0 + /// Closures that fire (at most once each) when `outstanding` hits 0. + private var idleWaiters: [(id: UInt64, fire: @Sendable () -> Void)] = [] + private var nextWaiterId: UInt64 = 0 + + /// Monotonic-ns of the most recent `enqueue`. A task suspended mid-`await` + /// (e.g. on a clock) re-enqueues when it resumes, bumping this — so the drive + /// can require executor-idle to *persist* for a debounce window since the + /// last enqueue, bridging the suspend→resume gap that an instantaneous idle + /// check slips through (premature fixpoint). + var lastEnqueueNs: UInt64 { lock.withLock { _lastEnqueueNs } } + + func enqueue(_ job: consuming ExecutorJob) { + let unowned = UnownedJob(job) + lock.withLock { outstanding += 1; _lastEnqueueNs = _drainMonotonicNs() } + _sharedDrainQueue.async { + unowned.runSynchronously(on: self.asUnownedTaskExecutor()) + let toFire: [@Sendable () -> Void] = self.lock.withLock { + self.outstanding -= 1 + guard self.outstanding == 0 else { return [] } + let fns = self.idleWaiters.map(\.fire) + self.idleWaiters.removeAll() + return fns + } + for f in toFire { f() } + } + } + + /// No executor job is ready/running this instant. + var isExecutorIdle: Bool { lock.withLock { outstanding == 0 } } + + /// Suspend until `outstanding == 0` (event-driven — fired from `enqueue`'s + /// completion, non-starvable, no QoS dependency), OR the GTS `deadlineNs` + /// fires (the deadlock watchdog — `.userInitiated`, also non-starvable), OR + /// the Task is cancelled. At-most-once resolution across all three. + func waitUntilIdleOrDeadline(_ deadlineNs: UInt64) async { + final class State: @unchecked Sendable { + var cont: CheckedContinuation? + var timerCancel: (@Sendable () -> Void)? + var resumed = false + } + let state = State() + let id = lock.withLock { () -> UInt64 in nextWaiterId += 1; return nextWaiterId } + + let resolve: @Sendable () -> Void = { [self] in + let cont: CheckedContinuation? = lock.withLock { + guard !state.resumed else { return nil } + state.resumed = true + state.timerCancel?() + state.timerCancel = nil + idleWaiters.removeAll { $0.id == id } + let c = state.cont + state.cont = nil + return c + } + cont?.resume() + } + + await withTaskCancellationHandler { + await withCheckedContinuation { (c: CheckedContinuation) in + let immediate: Bool = lock.withLock { + if Task.isCancelled || outstanding == 0 { state.resumed = true; return true } + state.cont = c + idleWaiters.append((id, { resolve() })) + return false + } + if immediate { c.resume(); return } + let cancel = scheduleAfter(deadline: deadlineNs) { resolve() } + let stale = lock.withLock { () -> Bool in + if state.resumed { return true } + state.timerCancel = cancel + return false + } + if stale { cancel() } + } + } onCancel: { + resolve() + } + } +} +#endif + +extension TestAccess { + /// True when a per-test harness executor is installed (flag on, macOS 15+). + var _isExecutorDriveActive: Bool { + #if canImport(Dispatch) + if #available(macOS 15.0, iOS 18.0, tvOS 18.0, watchOS 11.0, *) { + return _TestExecutorBox.current is _DrainTestExecutor + } + #endif + return false + } + + /// Drive the model to a STABLE fixpoint (quiescent across two consecutive + /// checks). Returns `true` at the fixpoint, `false` if the deadlock watchdog + /// (`hangDeadlineNs`) fires or the Task is cancelled. No executor → `true` + /// (callers gate on `_isExecutorDriveActive`). + func _driveToStableFixpoint(hangDeadlineNs: UInt64, graceNs: UInt64) async -> Bool { + #if canImport(Dispatch) + if #available(macOS 15.0, iOS 18.0, tvOS 18.0, watchOS 11.0, *), + let exec = _TestExecutorBox.current as? _DrainTestExecutor { + let bg = backgroundCall + // Quiescence = executor idle AND bg idle AND no task pending its first + // run AND NO ACTIVITY OF ANY KIND for a `graceNs` grace window. The + // grace debounces against the most recent of (a) any `_noteActivity` + // — model write / event / probe / task-body-start — and (b) any + // executor enqueue. So ANY activity (a clock-parked task resuming and + // writing, an enqueue, an event) resets the window: on a single test + // the system is quiet at once and the grace elapses fast; under stress + // every resuming task keeps resetting it, so the wait lasts as long as + // necessary and only declares quiescence once the system is genuinely + // done. NON-STARVABLE throughout (counter + GTS, never `.background`). + // `mainCall` is excluded — it's process-global, not per-test. + while !Task.isCancelled { + if _drainMonotonicNs() >= hangDeadlineNs { return false } + await exec.waitUntilIdleOrDeadline(hangDeadlineNs) + if !bg.isIdle { await bg.waitForCurrentItems(deadline: hangDeadlineNs) } + let idleNow = exec.isExecutorIdle && bg.isIdle && !self.context.hasPendingStartTask + if idleNow { + let lastActivity = max(self._lastActivityNs, exec.lastEnqueueNs) + let sinceActivity = _drainMonotonicNs() &- lastActivity + if sinceActivity >= graceNs { + return true // idle, and no activity of any kind for a full grace window + } + // Idle but recent activity — wait out the remainder of the + // grace (non-starvable), then re-check; a resuming task will + // have produced fresh activity and reset it by then. + await _gtsSleep(graceNs &- sinceActivity, hangDeadlineNs: hangDeadlineNs) + } + } + return false + } + #endif + return true + } + + /// Grace window for `settle`'s fixpoint debounce. `settle` is forgiving (a + /// slightly early settle just means the next line re-settles), so a short + /// grace suffices. Tunable knob (runtime vs robustness). + static var _settleGraceNs: UInt64 { 30_000_000 } // 30 ms + + /// Non-starvable sleep for `ns` (or until `hangDeadlineNs`), via GTS — used + /// by the drive's debounce. Cancellation-aware. + func _gtsSleep(_ ns: UInt64, hangDeadlineNs: UInt64) async { + #if canImport(Dispatch) + let target = min(_drainMonotonicNs() &+ ns, hangDeadlineNs) + let s = _GTSSleepState(); let lk = NSLock() + let fire: @Sendable () -> Void = { + let c: CheckedContinuation? = lk.withLock { + guard !s.done else { return nil } + s.done = true; s.cancel?(); s.cancel = nil + let c = s.cont; s.cont = nil; return c + } + c?.resume() + } + await withTaskCancellationHandler { + await withCheckedContinuation { (c: CheckedContinuation) in + let immediate = lk.withLock { () -> Bool in + if Task.isCancelled || s.done { return true } + s.cont = c; return false + } + if immediate { c.resume(); return } + let cancel = scheduleAfter(deadline: target) { fire() } + let stale = lk.withLock { () -> Bool in if s.done { return true }; s.cancel = cancel; return false } + if stale { cancel() } + } + } onCancel: { fire() } + #endif + } +} diff --git a/Sources/SwiftModel/Internal/TestWaitSupport.swift b/Sources/SwiftModel/Internal/TestWaitSupport.swift index 8343fbb..b60ca20 100644 --- a/Sources/SwiftModel/Internal/TestWaitSupport.swift +++ b/Sources/SwiftModel/Internal/TestWaitSupport.swift @@ -84,6 +84,21 @@ extension TestAccess { /// budget was exhausted. @discardableResult package func waitUntilSettled(cleanup: Bool = false, at fileAndLine: FileAndLine) async -> Bool { + // Executor-drain path (primary, when the per-test harness executor is + // active): resolve on the model's STABLE FIXPOINT — a non-starvable, + // load-independent quiescence signal — instead of the `.deferential`/ + // `.background` quiet-check (which macOS starves under parallel load, + // the root cause of the false `settle() timed out`). Waits as long as + // necessary; only a genuine deadlock (no fixpoint within the watchdog) + // fails. See docs/test-determinism-executor-drain.md. + if _isExecutorDriveActive { + let reached = await _driveToStableFixpoint(hangDeadlineNs: _executorHangDeadlineNs(), graceNs: Self._settleGraceNs) + if !reached, !cleanup { + fail("settle() timed out: model never reached a fixpoint (deadlock or runaway).\n\(settleDiagnostics())", at: fileAndLine) + } + return reached + } + let window = cleanup ? Self.settleDebounceCleanupNs : Self.settleDebounceInTestNs let totalBudgetNs = cleanup ? Self.settleCleanupTotalBudgetNs : Self.settleTotalBudgetNs let startNs = _monotonicNs() diff --git a/Sources/SwiftModel/Testing/ModelTestingTrait.swift b/Sources/SwiftModel/Testing/ModelTestingTrait.swift index cca0219..2b03a87 100644 --- a/Sources/SwiftModel/Testing/ModelTestingTrait.swift +++ b/Sources/SwiftModel/Testing/ModelTestingTrait.swift @@ -441,12 +441,15 @@ private func _withModelTestingImpl( } let pending = _PendingModelTestScope(exhaustivity: resolvedExhaustivity, dependencies: mergedDependencies) let testQueue = BackgroundCallQueue() + let execBox = _makeTestExecutorBox() try await _BackgroundCallLocals.$queue.withValue(testQueue) { - try await _ModelTestingLocals.$scope.withValue(pending) { - try await body() - } - if let concrete = pending.concrete, let fl = pending.registrationFileAndLine { - await concrete.checkExhaustion(at: fl) + try await _TestExecutorBox.$current.withValue(execBox) { + try await _ModelTestingLocals.$scope.withValue(pending) { + try await body() + } + if let concrete = pending.concrete, let fl = pending.registrationFileAndLine { + await concrete.checkExhaustion(at: fl) + } } } } @@ -492,6 +495,7 @@ extension ModelTestingTrait: TestScoping, TestTrait, SuiteTrait { // each other's in-flight Observed pipeline updates. let testQueue = BackgroundCallQueue() let testTag = test.name + let execBox = _makeTestExecutorBox() try await _BackgroundCallLocals.$queue.withValue(testQueue) { // Per-test wall-clock cap. // @@ -507,13 +511,15 @@ extension ModelTestingTrait: TestScoping, TestTrait, SuiteTrait { // three the test should fail explicitly rather than hang. try await withoutActuallyEscaping(function) { escapingFunction in try await _withTestTimeout(seconds: ModelTestingTraitOptions.testWallClockSeconds, testTag: testTag) { - try await _ModelTestingLocals.$scope.withValue(pending) { - try await escapingFunction() - } - // After the test body completes, run exhaustion check (still inside the - // test-local queue scope so any teardown backgroundCall work uses testQueue). - if let concrete = pending.concrete, let fl = pending.registrationFileAndLine { - await concrete.checkExhaustion(at: fl) + try await _TestExecutorBox.$current.withValue(execBox) { + try await _ModelTestingLocals.$scope.withValue(pending) { + try await escapingFunction() + } + // After the test body completes, run exhaustion check (still inside the + // test-local queue scope so any teardown backgroundCall work uses testQueue). + if let concrete = pending.concrete, let fl = pending.registrationFileAndLine { + await concrete.checkExhaustion(at: fl) + } } } } diff --git a/Tests/SwiftModelTests/ExecutorDrainSettleTests.swift b/Tests/SwiftModelTests/ExecutorDrainSettleTests.swift new file mode 100644 index 0000000..208dd7b --- /dev/null +++ b/Tests/SwiftModelTests/ExecutorDrainSettleTests.swift @@ -0,0 +1,73 @@ +#if canImport(Dispatch) +import Foundation +import Dispatch +import Testing +import SwiftModel + +// Regression coverage for executor-drain `settle()` (opt-in via +// SWIFT_MODEL_EXPERIMENTAL_DRAIN=1). With the per-test harness executor active, +// `settle()` resolves on the model's drain FIXPOINT — a non-starvable, +// load-independent quiescence signal — instead of the `.deferential`/`.background` +// quiet-check that macOS starves under parallel load (the root cause of the +// false `settle() timed out: model still has active tasks`). See +// docs/test-determinism-executor-drain.md. + +@Model private struct DrainParent: Sendable { + var items: [DrainItem] = [] +} +@Model private struct DrainItem: Sendable, Identifiable { + let id: Int + var done: Bool = false + func onActivate() { + // A short async chain on the per-test executor: several suspension + // points (each needs a CPU slot), then a write — the shape that a + // wall-clock/`.background` settle starves on under parallel load. + node.task { + for _ in 0..<6 { await Task.yield() } + done = true + } + } +} + +/// Saturate ~half the cores for the duration of `body`, then stop — enough to +/// stress scheduling without wedging a shared machine. +@Sendable private func underCPULoad(_ body: () async -> T) async -> T { + let stop = NSLock() + nonisolated(unsafe) var running = true + for _ in 0..