Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate) by mansbernhardt · Pull Request #24 · bitofmind/swift-model

mansbernhardt · 2026-06-17T09:11:50Z

Executor-drain test determinism — `.modelTesting` wait verbs resolve on a load-independent fixpoint (now the default)

Combines the settle-drive (#23) and the expect/waitUntil drive-primary work into one PR, flips the executor-drive on by default, and promotes it to the required CI gate.

What changed

.modelTesting's wait verbs (settle / expect / waitUntil) now resolve on a load-independent executor-drain fixpoint instead of a starvable wall-clock budget. Model tasks run on a per-test TaskExecutor; a wait resolves when that executor (plus the per-test background + main-registrar observation queues) is quiescent — a non-starvable signal that doesn't depend on the cooperative pool getting CPU under parallel/CI load.

Cures the false settle() timed out: model still has active tasks failures (empty active-task list) and the broad --parallel flake population (on a contended dev machine the per-100-run distinct-flaky-test count dropped ~63 → a small clock-bound remnant, now resolved).
Faster: serial suite ~2.6× faster (fixpoint resolution vs. waiting out debounce windows).
Default on for macOS 15+ / iOS 18+ / Linux-Swift-6 (custom task executors need the Swift 6 runtime). Older OSes and WASM keep the wall-clock path. Opt out with SWIFT_MODEL_EXPERIMENTAL_DRAIN=0.

The clock-test tail — resolved by splitting it into its two real classes

The premature-fixpoint clock tests are fixed at the correct layer for each:

Work routed through the executor (testImmediateClock, childTasksCompleteBeforeTeardown) — countable, a genuine framework gap. A global-quiescence fail-gate fails a still-unmet expect only when the whole process is executor-quiescent (not merely when one test looks idle), so a child parked mid-clock.sleep while the run is busy no longer trips a false fixpoint. The process-wide counter is a lock-free Swift 6 Atomic (negligible hot-path cost).
Work parked off the executor (testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework) — a TestClock advance-vs-subscribe registration race, invisible to any executor accounting. Fixed test-side with the documented settle()-before-advance pattern, which parks the timer (registering its deadline) first — replacing the Task.yield() ordering gamble.

Discriminator for future flakes: is the pending work routed through our executor? Yes → framework owns it (gate); parked on an external clock → test owns the ordering (settle).

CI

drain=1 (drive, default) is now the required gate, serial + parallel, macOS + Linux.
drain=0 serial (vanilla wall-clock) stays required — the deterministic regression gate that caught the OR-path race (497c2ab).
drain=0 parallel is informational — the documented-flaky legacy wall-clock path.

Validation

80 full-suite --parallel iterations at scale=1 (the harshest local condition), 0 failures.
Serial 3/3 clean; flag-off inert at CI's scale=3.
Targeted stress: 40× OnChange, 30× Clock, 25× clock+child — all clean.
Real-consumer (parallel-apple): cured the "CI runs forever" disease (833→962 passing, 13→1 failing suites; the lone residual was a missing-clock-override test bug, fixed there).

See Docs/test-determinism-executor-drain.md (Updates 1–21) for the full design history.

🤖 Generated with Claude Code

`settle()` (and the settle phase of `expect { … }`) resolves on the model's executor-drain FIXPOINT instead of a `.deferential`/`.background`-QoS quiet-check. Under heavy parallel load macOS starves `.background` indefinitely, so the quiet-check never fired and settle reported a false `settle() timed out: model still has active tasks` (empty task list) at ANY budget — the years-old flake the serial-CI fallback and SWIFT_MODEL_TIMEOUT_SCALE were working around. The drain signal is non-starvable (a job-count + GTS, never `.background`) and dependency-free, so settle waits as long as necessary under load and resolves the instant the model is genuinely quiescent. How it works: - `_DrainTestExecutor`: one shared concurrent GCD queue backs every per-test executor (avoids per-test thread-pool explosion); each keeps its own outstanding-job count + event-driven `waitUntilIdleOrDeadline`. - Model task bodies (`node.task`/`forEach`) adopt it via `executorPreference` under `.modelTesting`; the trait installs a per-test executor box. - `_driveToStableFixpoint`: quiescent = executor idle + per-test bg-idle + no pending-start task, persisted for a short NON-STARVABLE grace that debounces against ALL activity (every `_noteActivity` + executor enqueue) so a clock- parked task's resume resets it. `mainCall` excluded (process-global). - `waitUntilSettled` resolves on the fixpoint; a generous watchdog only catches a true deadlock. OPT-IN via SWIFT_MODEL_EXPERIMENTAL_DRAIN=1 — inert by default (executor box is nil → every wait keeps its current path), so the suite is unchanged unless enabled. Validated: flag ON, settle suites green incl. a 60-iteration load-stressed child-task settle test; flag OFF, broad regression green (unchanged). Custom task executors need Swift 6 runtime (macOS 15+); older OS/WASM stay on the existing path. `expect`/`waitUntil` drive-primary migration is deliberately NOT in this PR (the fixpoint-as-fail judgment has open scaling/race work) — follow-up. Full design arc in docs/test-determinism-executor-drain.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Under the executor drive (flag on, macOS 15+), the per-test wall-clock cap becomes an inactivity watchdog keyed on the test's own _DrainTestExecutor activity instead of an absolute 30s deadline. Under full-parallel a healthy test's jobs queue behind hundreds on the shared drain queue, so wall-clock-to- drain legitimately exceeds 30s though the test is still progressing; the watchdog resets on every executor enqueue/completion and trips only after a full window of genuine inactivity (a real stall). Per-test isolated, so a single hung test is still caught while the rest of the suite is busy. Flag off => activityProbe is nil => original absolute cap, unchanged (inert). _birthNs floors activityNs so a test that hasn't yet enqueued work reads 'active as of now' rather than the epoch (else target = 0 + window is already past and the watchdog trips instantly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tail collapses into the item-2 fixpoint-fail race Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…oint + Update 12 The off-main main-registrar observation path (withObservationTracking/Observed/ onChange) delivers via context.mainCallQueue, which is PER-CONTEXT (per-test), not the process-global `mainCall`. The drive can therefore safely wait on it (watchdog-bounded, no inter-test hang). Fixpoint now unions executor + bg + per-test main-observation idle. Finding (Update 12): this is correct coverage (doc risk #1) but does NOT close the residual. Full-parallel shows ~15 flaky failures with and without it, and events don't use the registrar at all. The residual is a single fine-grained fixpoint-SAMPLING race (Update 8's theorem: no finite grace is safe under unbounded load with a dependency-free signal), which (B) cannot beat by covering more queues. Recommend pivoting to (A): expect never self-fails at a fixpoint; a still-unmet predicate fails only at the per-test inactivity watchdog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ike/diag investigation tests; Update 13 expect never self-fails at a single fixpoint sample: it resolves PASS reactively the instant the predicate is true (unchanged), and FAILs only after a sustained window of genuine inactivity (_expectGraceNs = 2s × timeoutScale), which any activity/enqueue/queue-drain resets — so under load the fail defers until truly quiet (no false fail). settle keeps its 30ms grace. Removed SpikeDrainExecutorTests/DiagExecutorClockTests (investigation records; some encode the now-abandoned fast-fail semantics, e.g. brokenExpectFailsFastAtFixpoint). Finding (Update 13): A reduces the targeted deferred-resume flakes (testChildEvents 1/3 -> 0/8 focused) but the full-parallel aggregate is ~neutral - dominated by a broad observation/transition/event propagation race that neither A (grace length) nor B (mainCall coverage) cracks. Next step is tracing, not further blind tuning. Flag remains OFF by default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…een; drive 2.6x faster serially + halves parallel flake; ~5-6 flaky serial residual is the bar to flip default) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fast-fail (not the delayed-resume race); revert expect window to 2s Decisive experiment: with the expect inactivity-fail window set to 30s, the Observed/transition residual (testChangeOf, testRecursive*, testCapturedObserved*, testTaskId*) STILL fails at 0.05-0.1s — so it does not go through the drive's inactivity-fail path and (A) cannot address it. The failure is an accumulated update sequence missing its last element (counts==[5] vs [5,8]). New hypothesis: the executor changes the interleaving between the test thread's mutations and the Observed consumer's appends. (A) kept at 2s (helps the event sub-class + chosen semantics) but is not the lever for this dominant sub-class. Next: trace one testChangeOf failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rive residual); remove diagnostic traces; Update 16 Root cause: expect()'s per-call executor driver runs _resolveUnmetPredicatesAtFixpoint on every loop iteration, including when _driveToStableFixpoint returns false due to the Task being CANCELLED (which expect() does as soon as its awaitPredicate resolves). Because _pendingExpects is shared across a test's sequential expects, a just-cancelled driver from expect A would fail expect B's freshly-registered predicate before B's satisfying write/event arrived — a fast, window-independent false fail leaving an accumulated Observed/onChange/event sequence missing its last element. Fix: the driver bails on Task.isCancelled before touching the shared pending entries. Results (same machine): flag-on SERIAL (CI gate) goes from ~5-6 flaky/run to GREEN (2/2, 42s, 2.6x faster than flag-off's 112s); flag-on --parallel ~12-18 -> 1-3; the OnChange+UpdateStream combo 0/6 (was ~1/run). The whole observation/transition/event residual class is gone. Diagnostic obs/drive traces removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… Linux Adds a drain={0,1} dimension to the macOS and Linux test matrices. drain=1 runs the suite with SWIFT_MODEL_EXPERIMENTAL_DRAIN=1 (the executor-drive active), serial + parallel, and is continue-on-error (informational) while the drive is opt-in. This is the production-readiness gate the design note calls for: it exercises the drive's load-tolerance on the real small/contended runners — the one signal a dev machine can't provide — without blocking merges on the experimental path. drain=0 rows are the existing required gates, unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…llel tail; Update 17 (b) tail hardening: - ReactiveWaitInfrastructureTests.awaitQuietWindow_firesAfterQuietWindow: drop the wall-clock UPPER-bound assertion (< 4.5s × scale). It asserted timing on a .deferential/.background callback macOS starves under --parallel, so it flaked on BOTH flag states; lower-bound + .timeout checks keep the real coverage. - Document testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework, testRaceVariant in CLAUDE.md's known load-sensitive list. They flake flag-off too; the executor-drive REDUCES the dev-machine --parallel flake population ~5-10x (flag-off 11-23/run vs flag-on 1-4/run on this machine) — they're a small remnant of a shared tail, not drive regressions. Update 17 records the same-machine flag-off-vs-on --parallel comparison and the CI flag-on gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…+Linux, serial+parallel) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… (63->6 tests) but regresses clock-parked tests at scale=1 (premature-fixpoint hard core)

The wait verbs (settle/expect/waitUntil) now resolve on the load-independent executor-drain fixpoint by default on macOS 15+/iOS 18+/Linux-Swift-6, instead of the starvable wall-clock quiet-window. _makeTestExecutorBox returns the executor unless SWIFT_MODEL_EXPERIMENTAL_DRAIN=0 (opt-out → legacy wall-clock path). Why: cures the false settle()-timeouts under .background starvation; cuts the dev-machine --parallel flaky-test population ~10x (63->~6 per 100 runs); ~2.6x faster serial. Validated against a real consumer (parallel-apple): cures its 'CI runs forever' disease (833->962 passing, 13->1 failing suites). Known limitation (documented, Update 20 + CHANGELOG): a few clock-driven tests (value produced after await clock.sleep) can be MORE flaky than the wall-clock path under --parallel at scale=1 — a task parked mid-sleep is not a ready executor job, so the drive can read 'quiescent' before it resumes (premature fixpoint). No clean dependency-free fix: a hard transient-task gate hangs on intentional long-lived node.tasks (Update 7's edge, confirmed); a bounded-wait neither fixes the clock tests nor avoids slowdown. Run those serially or bump TIMEOUT_SCALE (CI uses 3). CI: drain=0 (wall-clock opt-out) stays the REQUIRED deterministic gate; drain=1 (the drive, = default) runs informationally until the clock tail is gate-able. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ed CI gate The premature-fixpoint clock tail (Update 20's "unfixable" four tests) is resolved by splitting it into its two real classes and fixing each at the correct layer: 1. Work routed THROUGH the executor (testImmediateClock, childTasksCompleteBeforeTeardown) — countable, a genuine framework gap. Kept the global-quiescence fail-gate (an unmet `expect` fails only when the WHOLE process is executor-quiescent, not just when one test looks idle, so a child parked mid-`clock.sleep` while the run is busy no longer trips a false fixpoint). Converted its hot-path lock to a lock-free Swift 6 `Atomic` so the per-enqueue/-completion cost is negligible. 2. Work parked OFF the executor (testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework) — a TestClock advance-vs-subscribe registration race, invisible to any executor accounting. Fixed test-side with the documented `settle()`-before-`advance` pattern (parks the timer, registering its deadline, first), replacing the `Task.yield()` ordering gamble. Discriminator for future flakes: is the pending work routed through our executor? Yes -> framework owns it (gate); parked on an external clock -> test owns the ordering (settle). Not a license to blame scheduling for everything — class (1) was a real gap the gate closes, making async child-task completion testable without a manual settle. CI: promote drain=1 (drive, default) to the REQUIRED gate on all rows; keep the vanilla serial wall-clock row required (deterministic regression gate, caught the OR-path race); demote only the parallel wall-clock row to informational (the documented-flaky legacy path). Validation: 80 full-suite --parallel iterations at scale=1 clean (30+50), serial 3/3 clean, flag-off inert at scale=3; targeted 40x OnChange / 30x Clock / 25x clock+child clean. Docs (CLAUDE.md, CHANGELOG, design-note Update 21) updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aitUntil tail) Real-runner CI showed the drain=1 PARALLEL promotion was too aggressive: on the small GitHub runners the drain=1 parallel jobs failed intermittently (macOS flipped to pass on re-run; Linux is the consistent offender) on two waitUntil-based known-flaky tests — testObservedStreamWithModelAccessingObservable (unsupported @observable interop) and testSharedDependency (deinit-chain poll). Not the global gate: both are waitUntil tests, the gate lives in the expect driver, and the drain=1 SERIAL jobs passed cleanly both runs. Root cause is a pre-existing interaction — under the drive waitUntil extends its deadline to the 120s watchdog, which exceeds the 90s trait cap (30s x scale=3), so a never-true waitUntil on a saturated 2-vCPU container hits the trait cap (un-absorbable by withKnownIssue) before its own catchable timeout. Split required-vs-informational by MODE: serial rows (both drains) are the required deterministic gates (drive-serial primary, wall-clock-serial vanilla regression gate); parallel rows (both drains) informational (wall-clock starves; drive has the waitUntil tail). Core goal preserved: drive is now a required gate (serial) and the default. Design note Update 22 records the full analysis + the future waitUntil-cap fix to eventually promote parallel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(absorbable) Fixes the Update 22 latent bug: under the executor drive, waitUntil extended its deadline to a fixed 120s watchdog, which EXCEEDS the per-test trait cap (testWallClockSeconds = 30s x scale = 90s on CI). So a never-true waitUntil on a saturated container hit the trait cap first, throwing _TestTimeoutError OUTSIDE the test body — un-absorbable by withKnownIssue (the testObservedStreamWithModelAccessingObservable hard failure). Now, under the drive, waitUntil fails when this test's executor has been INACTIVE for 60% of the trait window (18s x scale), using the same exec.activityNs signal the trait cap uses: load-tolerant (a progressing model resets it) but guaranteed to fire before the trait cap, via the catchable WaitUntilTimeoutError that withKnownIssue can absorb. A 120s absolute watchdog still backstops the continuous-activity-but-never-true edge. Flag-off path unchanged (no probe → the scaled absolute budget governs, exactly as before). Validation: wait-infra meta-tests 20/20, full suite 10/10 at scale=1, no regression. Design note Update 23. Parallel stays informational: testSharedDependency (a real deinit-chain assertion, not withKnownIssue) remains a genuine container load-flake this fix improves but doesn't eliminate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…nditional default The executor-drive is validated (80+ local --parallel iterations, multiple green CI runs, ~2.6x faster serial, flake population gone), so the opt-out flag and the drain=0 comparison machinery no longer earn their keep in the tree. Key distinction: "what the library ships to" (iOS 14+) is not "what host runs the tests". `.modelTesting` runs on the dev/CI machine, which on any modern toolchain (macOS 15+ / Linux-Swift-6) can run the drive — so the drive is simply on there. - _makeTestExecutorBox returns the executor whenever #available (macOS 15+/iOS 18+), with no env-var check. The wall-clock path survives ONLY as the automatic fallback for hosts that can't run the drive (pre-macOS-15 / pre-iOS-18 hosts and simulators, older Swift, WASM — no Dispatch), selected by availability, not a toggle. - CI matrix drops the `drain` dimension: each OS runs just {parallel, serial} on the drive. Serial required; parallel informational (the waitUntil tail). - ExecutorDrainSettleTests gates on #available instead of the env var (so it now actually runs on supported hosts rather than no-opping). - CHANGELOG, CLAUDE.md, design-note Update 24 updated. Before/after comparison preserved in git history (one `git show` away). Validation: full suite 5/5 --parallel + serial (770 tests) clean; drain-settle regression test now runs and passes 3/3; build warning-free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fore teardown The test shares one Dependency across two children, each observing its value via forEach(Observed). `sharedDep.value -= 2` (→5) delivers "(->5)" to EACH child reactively, while both are alive. The old test waited for only ONE "(->5)" then removed the children, letting the SECOND arrive entangled with child teardown — "a deinit chain whose timing is at the mercy of when the last strong reference is released." That teardown-timed delivery is the load-sensitive parallel flake (~2/1000; once exceeded even the 120s drive backstop on a saturated container). Fix: assert both children see 5 and wait for "(->5)(->5)" UP FRONT, while both children exist — so both observations are reactive, executor-driven, and deterministic, not dependent on teardown ordering. The trailing wait then only needs the "d" destruct log (imminent once lifetime == .destructed). No large timeout needed anymore. (Investigated a settle()-based approach first; it made things worse — 1/30 — because that second log entry can land AFTER executor quiescence via the ARC/cancellation-flush path, which settle doesn't drive. Reverted; the reorder is the real fix: make the delivery reactive instead of teardown-timed.) Validation: testSharedDependency 60/60 isolated, full suite 10/10 --parallel, all clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

With testObservedStream now absorbed by withKnownIssue (waitUntil sub-trait-cap fix, Update 23) and testSharedDependency made deterministic (reactive shared-dep observations awaited up front, Update 25), the parallel `waitUntil` tail that kept `--parallel` informational is gone. Verified green across 4 consecutive CI runs (push + 3 re-runs; parallel jobs green on macOS and Linux every time). Drop `continue-on-error` from both jobs: serial AND parallel are now required, merge-blocking gates on macOS and Linux. Completes the arc — the executor-drive is the unconditional default test-wait mechanism and the full CI matrix gates on it. CLAUDE.md + design-note Update 25 updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mansbernhardt and others added 9 commits June 16, 2026 17:25

Base expect drive-primary on settle-drain (overlay spike expect changes)

d803b9c

docs: Update 11 — item 1a (load-tolerant trait cap) done; event long-…

31ccbf8

…tail collapses into the item-2 fixpoint-fail race Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: Update 14 — corrected same-machine baseline (flag-off serial gr…

b6d0f80

…een; drive 2.6x faster serially + halves parallel flake; ~5-6 flaky serial residual is the bar to flip default) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mansbernhardt changed the title ~~Make expect/waitUntil drive-primary under the executor (opt-in); fix the cancelled-driver residual~~ Executor-drain test determinism: settle + expect/waitUntil drive-primary (opt-in) Jun 17, 2026

mansbernhardt changed the base branch from claude/settle-drain to main June 17, 2026 09:40

mansbernhardt mentioned this pull request Jun 17, 2026

[draft/foundation] Executor-drain settle(): non-starvable quiescence (coupled to expect follow-up) #23

Closed

mansbernhardt and others added 6 commits June 17, 2026 13:40

docs: Update 18 — CI flag-on gate passed clean on real runners (macOS…

b4d886f

…+Linux, serial+parallel) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: Update 19 — 100x stress: drive removes ~90% of flake population…

4599fbd

… (63->6 tests) but regresses clock-parked tests at scale=1 (premature-fixpoint hard core)

mansbernhardt changed the title ~~Executor-drain test determinism: settle + expect/waitUntil drive-primary (opt-in)~~ Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate) Jun 18, 2026

mansbernhardt and others added 5 commits June 18, 2026 17:01

mansbernhardt merged commit 5a0c6f3 into main Jun 18, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate)#24

Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate)#24
mansbernhardt merged 20 commits into
mainfrom
claude/expect-drain

mansbernhardt commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mansbernhardt commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Executor-drain test determinism — .modelTesting wait verbs resolve on a load-independent fixpoint (now the default)

What changed

The clock-test tail — resolved by splitting it into its two real classes

CI

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mansbernhardt commented Jun 17, 2026 •

edited

Loading

Executor-drain test determinism — `.modelTesting` wait verbs resolve on a load-independent fixpoint (now the default)