Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate)#24
Merged
Conversation
`settle()` (and the settle phase of `expect { … }`) resolves on the model's
executor-drain FIXPOINT instead of a `.deferential`/`.background`-QoS quiet-check.
Under heavy parallel load macOS starves `.background` indefinitely, so the
quiet-check never fired and settle reported a false `settle() timed out: model
still has active tasks` (empty task list) at ANY budget — the years-old flake the
serial-CI fallback and SWIFT_MODEL_TIMEOUT_SCALE were working around. The drain
signal is non-starvable (a job-count + GTS, never `.background`) and
dependency-free, so settle waits as long as necessary under load and resolves the
instant the model is genuinely quiescent.
How it works:
- `_DrainTestExecutor`: one shared concurrent GCD queue backs every per-test
executor (avoids per-test thread-pool explosion); each keeps its own
outstanding-job count + event-driven `waitUntilIdleOrDeadline`.
- Model task bodies (`node.task`/`forEach`) adopt it via `executorPreference`
under `.modelTesting`; the trait installs a per-test executor box.
- `_driveToStableFixpoint`: quiescent = executor idle + per-test bg-idle + no
pending-start task, persisted for a short NON-STARVABLE grace that debounces
against ALL activity (every `_noteActivity` + executor enqueue) so a clock-
parked task's resume resets it. `mainCall` excluded (process-global).
- `waitUntilSettled` resolves on the fixpoint; a generous watchdog only catches
a true deadlock.
OPT-IN via SWIFT_MODEL_EXPERIMENTAL_DRAIN=1 — inert by default (executor box is
nil → every wait keeps its current path), so the suite is unchanged unless
enabled. Validated: flag ON, settle suites green incl. a 60-iteration
load-stressed child-task settle test; flag OFF, broad regression green
(unchanged). Custom task executors need Swift 6 runtime (macOS 15+); older
OS/WASM stay on the existing path.
`expect`/`waitUntil` drive-primary migration is deliberately NOT in this PR (the
fixpoint-as-fail judgment has open scaling/race work) — follow-up. Full design
arc in docs/test-determinism-executor-drain.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under the executor drive (flag on, macOS 15+), the per-test wall-clock cap becomes an inactivity watchdog keyed on the test's own _DrainTestExecutor activity instead of an absolute 30s deadline. Under full-parallel a healthy test's jobs queue behind hundreds on the shared drain queue, so wall-clock-to- drain legitimately exceeds 30s though the test is still progressing; the watchdog resets on every executor enqueue/completion and trips only after a full window of genuine inactivity (a real stall). Per-test isolated, so a single hung test is still caught while the rest of the suite is busy. Flag off => activityProbe is nil => original absolute cap, unchanged (inert). _birthNs floors activityNs so a test that hasn't yet enqueued work reads 'active as of now' rather than the epoch (else target = 0 + window is already past and the watchdog trips instantly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tail collapses into the item-2 fixpoint-fail race Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oint + Update 12 The off-main main-registrar observation path (withObservationTracking/Observed/ onChange) delivers via context.mainCallQueue, which is PER-CONTEXT (per-test), not the process-global `mainCall`. The drive can therefore safely wait on it (watchdog-bounded, no inter-test hang). Fixpoint now unions executor + bg + per-test main-observation idle. Finding (Update 12): this is correct coverage (doc risk #1) but does NOT close the residual. Full-parallel shows ~15 flaky failures with and without it, and events don't use the registrar at all. The residual is a single fine-grained fixpoint-SAMPLING race (Update 8's theorem: no finite grace is safe under unbounded load with a dependency-free signal), which (B) cannot beat by covering more queues. Recommend pivoting to (A): expect never self-fails at a fixpoint; a still-unmet predicate fails only at the per-test inactivity watchdog. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ike/diag investigation tests; Update 13 expect never self-fails at a single fixpoint sample: it resolves PASS reactively the instant the predicate is true (unchanged), and FAILs only after a sustained window of genuine inactivity (_expectGraceNs = 2s × timeoutScale), which any activity/enqueue/queue-drain resets — so under load the fail defers until truly quiet (no false fail). settle keeps its 30ms grace. Removed SpikeDrainExecutorTests/DiagExecutorClockTests (investigation records; some encode the now-abandoned fast-fail semantics, e.g. brokenExpectFailsFastAtFixpoint). Finding (Update 13): A reduces the targeted deferred-resume flakes (testChildEvents 1/3 -> 0/8 focused) but the full-parallel aggregate is ~neutral - dominated by a broad observation/transition/event propagation race that neither A (grace length) nor B (mainCall coverage) cracks. Next step is tracing, not further blind tuning. Flag remains OFF by default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…een; drive 2.6x faster serially + halves parallel flake; ~5-6 flaky serial residual is the bar to flip default) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fast-fail (not the delayed-resume race); revert expect window to 2s Decisive experiment: with the expect inactivity-fail window set to 30s, the Observed/transition residual (testChangeOf, testRecursive*, testCapturedObserved*, testTaskId*) STILL fails at 0.05-0.1s — so it does not go through the drive's inactivity-fail path and (A) cannot address it. The failure is an accumulated update sequence missing its last element (counts==[5] vs [5,8]). New hypothesis: the executor changes the interleaving between the test thread's mutations and the Observed consumer's appends. (A) kept at 2s (helps the event sub-class + chosen semantics) but is not the lever for this dominant sub-class. Next: trace one testChangeOf failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rive residual); remove diagnostic traces; Update 16 Root cause: expect()'s per-call executor driver runs _resolveUnmetPredicatesAtFixpoint on every loop iteration, including when _driveToStableFixpoint returns false due to the Task being CANCELLED (which expect() does as soon as its awaitPredicate resolves). Because _pendingExpects is shared across a test's sequential expects, a just-cancelled driver from expect A would fail expect B's freshly-registered predicate before B's satisfying write/event arrived — a fast, window-independent false fail leaving an accumulated Observed/onChange/event sequence missing its last element. Fix: the driver bails on Task.isCancelled before touching the shared pending entries. Results (same machine): flag-on SERIAL (CI gate) goes from ~5-6 flaky/run to GREEN (2/2, 42s, 2.6x faster than flag-off's 112s); flag-on --parallel ~12-18 -> 1-3; the OnChange+UpdateStream combo 0/6 (was ~1/run). The whole observation/transition/event residual class is gone. Diagnostic obs/drive traces removed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… Linux
Adds a drain={0,1} dimension to the macOS and Linux test matrices. drain=1 runs
the suite with SWIFT_MODEL_EXPERIMENTAL_DRAIN=1 (the executor-drive active),
serial + parallel, and is continue-on-error (informational) while the drive is
opt-in. This is the production-readiness gate the design note calls for: it
exercises the drive's load-tolerance on the real small/contended runners — the
one signal a dev machine can't provide — without blocking merges on the
experimental path. drain=0 rows are the existing required gates, unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llel tail; Update 17 (b) tail hardening: - ReactiveWaitInfrastructureTests.awaitQuietWindow_firesAfterQuietWindow: drop the wall-clock UPPER-bound assertion (< 4.5s × scale). It asserted timing on a .deferential/.background callback macOS starves under --parallel, so it flaked on BOTH flag states; lower-bound + .timeout checks keep the real coverage. - Document testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework, testRaceVariant in CLAUDE.md's known load-sensitive list. They flake flag-off too; the executor-drive REDUCES the dev-machine --parallel flake population ~5-10x (flag-off 11-23/run vs flag-on 1-4/run on this machine) — they're a small remnant of a shared tail, not drive regressions. Update 17 records the same-machine flag-off-vs-on --parallel comparison and the CI flag-on gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+Linux, serial+parallel) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (63->6 tests) but regresses clock-parked tests at scale=1 (premature-fixpoint hard core)
The wait verbs (settle/expect/waitUntil) now resolve on the load-independent executor-drain fixpoint by default on macOS 15+/iOS 18+/Linux-Swift-6, instead of the starvable wall-clock quiet-window. _makeTestExecutorBox returns the executor unless SWIFT_MODEL_EXPERIMENTAL_DRAIN=0 (opt-out → legacy wall-clock path). Why: cures the false settle()-timeouts under .background starvation; cuts the dev-machine --parallel flaky-test population ~10x (63->~6 per 100 runs); ~2.6x faster serial. Validated against a real consumer (parallel-apple): cures its 'CI runs forever' disease (833->962 passing, 13->1 failing suites). Known limitation (documented, Update 20 + CHANGELOG): a few clock-driven tests (value produced after await clock.sleep) can be MORE flaky than the wall-clock path under --parallel at scale=1 — a task parked mid-sleep is not a ready executor job, so the drive can read 'quiescent' before it resumes (premature fixpoint). No clean dependency-free fix: a hard transient-task gate hangs on intentional long-lived node.tasks (Update 7's edge, confirmed); a bounded-wait neither fixes the clock tests nor avoids slowdown. Run those serially or bump TIMEOUT_SCALE (CI uses 3). CI: drain=0 (wall-clock opt-out) stays the REQUIRED deterministic gate; drain=1 (the drive, = default) runs informationally until the clock tail is gate-able. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed CI gate The premature-fixpoint clock tail (Update 20's "unfixable" four tests) is resolved by splitting it into its two real classes and fixing each at the correct layer: 1. Work routed THROUGH the executor (testImmediateClock, childTasksCompleteBeforeTeardown) — countable, a genuine framework gap. Kept the global-quiescence fail-gate (an unmet `expect` fails only when the WHOLE process is executor-quiescent, not just when one test looks idle, so a child parked mid-`clock.sleep` while the run is busy no longer trips a false fixpoint). Converted its hot-path lock to a lock-free Swift 6 `Atomic` so the per-enqueue/-completion cost is negligible. 2. Work parked OFF the executor (testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework) — a TestClock advance-vs-subscribe registration race, invisible to any executor accounting. Fixed test-side with the documented `settle()`-before-`advance` pattern (parks the timer, registering its deadline, first), replacing the `Task.yield()` ordering gamble. Discriminator for future flakes: is the pending work routed through our executor? Yes -> framework owns it (gate); parked on an external clock -> test owns the ordering (settle). Not a license to blame scheduling for everything — class (1) was a real gap the gate closes, making async child-task completion testable without a manual settle. CI: promote drain=1 (drive, default) to the REQUIRED gate on all rows; keep the vanilla serial wall-clock row required (deterministic regression gate, caught the OR-path race); demote only the parallel wall-clock row to informational (the documented-flaky legacy path). Validation: 80 full-suite --parallel iterations at scale=1 clean (30+50), serial 3/3 clean, flag-off inert at scale=3; targeted 40x OnChange / 30x Clock / 25x clock+child clean. Docs (CLAUDE.md, CHANGELOG, design-note Update 21) updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aitUntil tail) Real-runner CI showed the drain=1 PARALLEL promotion was too aggressive: on the small GitHub runners the drain=1 parallel jobs failed intermittently (macOS flipped to pass on re-run; Linux is the consistent offender) on two waitUntil-based known-flaky tests — testObservedStreamWithModelAccessingObservable (unsupported @observable interop) and testSharedDependency (deinit-chain poll). Not the global gate: both are waitUntil tests, the gate lives in the expect driver, and the drain=1 SERIAL jobs passed cleanly both runs. Root cause is a pre-existing interaction — under the drive waitUntil extends its deadline to the 120s watchdog, which exceeds the 90s trait cap (30s x scale=3), so a never-true waitUntil on a saturated 2-vCPU container hits the trait cap (un-absorbable by withKnownIssue) before its own catchable timeout. Split required-vs-informational by MODE: serial rows (both drains) are the required deterministic gates (drive-serial primary, wall-clock-serial vanilla regression gate); parallel rows (both drains) informational (wall-clock starves; drive has the waitUntil tail). Core goal preserved: drive is now a required gate (serial) and the default. Design note Update 22 records the full analysis + the future waitUntil-cap fix to eventually promote parallel. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(absorbable) Fixes the Update 22 latent bug: under the executor drive, waitUntil extended its deadline to a fixed 120s watchdog, which EXCEEDS the per-test trait cap (testWallClockSeconds = 30s x scale = 90s on CI). So a never-true waitUntil on a saturated container hit the trait cap first, throwing _TestTimeoutError OUTSIDE the test body — un-absorbable by withKnownIssue (the testObservedStreamWithModelAccessingObservable hard failure). Now, under the drive, waitUntil fails when this test's executor has been INACTIVE for 60% of the trait window (18s x scale), using the same exec.activityNs signal the trait cap uses: load-tolerant (a progressing model resets it) but guaranteed to fire before the trait cap, via the catchable WaitUntilTimeoutError that withKnownIssue can absorb. A 120s absolute watchdog still backstops the continuous-activity-but-never-true edge. Flag-off path unchanged (no probe → the scaled absolute budget governs, exactly as before). Validation: wait-infra meta-tests 20/20, full suite 10/10 at scale=1, no regression. Design note Update 23. Parallel stays informational: testSharedDependency (a real deinit-chain assertion, not withKnownIssue) remains a genuine container load-flake this fix improves but doesn't eliminate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nditional default
The executor-drive is validated (80+ local --parallel iterations, multiple green
CI runs, ~2.6x faster serial, flake population gone), so the opt-out flag and the
drain=0 comparison machinery no longer earn their keep in the tree.
Key distinction: "what the library ships to" (iOS 14+) is not "what host runs the
tests". `.modelTesting` runs on the dev/CI machine, which on any modern toolchain
(macOS 15+ / Linux-Swift-6) can run the drive — so the drive is simply on there.
- _makeTestExecutorBox returns the executor whenever #available (macOS 15+/iOS
18+), with no env-var check. The wall-clock path survives ONLY as the automatic
fallback for hosts that can't run the drive (pre-macOS-15 / pre-iOS-18 hosts and
simulators, older Swift, WASM — no Dispatch), selected by availability, not a
toggle.
- CI matrix drops the `drain` dimension: each OS runs just {parallel, serial} on
the drive. Serial required; parallel informational (the waitUntil tail).
- ExecutorDrainSettleTests gates on #available instead of the env var (so it now
actually runs on supported hosts rather than no-opping).
- CHANGELOG, CLAUDE.md, design-note Update 24 updated. Before/after comparison
preserved in git history (one `git show` away).
Validation: full suite 5/5 --parallel + serial (770 tests) clean; drain-settle
regression test now runs and passes 3/3; build warning-free.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fore teardown The test shares one Dependency across two children, each observing its value via forEach(Observed). `sharedDep.value -= 2` (→5) delivers "(->5)" to EACH child reactively, while both are alive. The old test waited for only ONE "(->5)" then removed the children, letting the SECOND arrive entangled with child teardown — "a deinit chain whose timing is at the mercy of when the last strong reference is released." That teardown-timed delivery is the load-sensitive parallel flake (~2/1000; once exceeded even the 120s drive backstop on a saturated container). Fix: assert both children see 5 and wait for "(->5)(->5)" UP FRONT, while both children exist — so both observations are reactive, executor-driven, and deterministic, not dependent on teardown ordering. The trailing wait then only needs the "d" destruct log (imminent once lifetime == .destructed). No large timeout needed anymore. (Investigated a settle()-based approach first; it made things worse — 1/30 — because that second log entry can land AFTER executor quiescence via the ARC/cancellation-flush path, which settle doesn't drive. Reverted; the reorder is the real fix: make the delivery reactive instead of teardown-timed.) Validation: testSharedDependency 60/60 isolated, full suite 10/10 --parallel, all clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With testObservedStream now absorbed by withKnownIssue (waitUntil sub-trait-cap fix, Update 23) and testSharedDependency made deterministic (reactive shared-dep observations awaited up front, Update 25), the parallel `waitUntil` tail that kept `--parallel` informational is gone. Verified green across 4 consecutive CI runs (push + 3 re-runs; parallel jobs green on macOS and Linux every time). Drop `continue-on-error` from both jobs: serial AND parallel are now required, merge-blocking gates on macOS and Linux. Completes the arc — the executor-drive is the unconditional default test-wait mechanism and the full CI matrix gates on it. CLAUDE.md + design-note Update 25 updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Executor-drain test determinism —
.modelTestingwait verbs resolve on a load-independent fixpoint (now the default)Combines the
settle-drive (#23) and theexpect/waitUntildrive-primary work into one PR, flips the executor-drive on by default, and promotes it to the required CI gate.What changed
.modelTesting's wait verbs (settle/expect/waitUntil) now resolve on a load-independent executor-drain fixpoint instead of a starvable wall-clock budget. Model tasks run on a per-testTaskExecutor; a wait resolves when that executor (plus the per-test background + main-registrar observation queues) is quiescent — a non-starvable signal that doesn't depend on the cooperative pool getting CPU under parallel/CI load.settle() timed out: model still has active tasksfailures (empty active-task list) and the broad--parallelflake population (on a contended dev machine the per-100-run distinct-flaky-test count dropped ~63 → a small clock-bound remnant, now resolved).SWIFT_MODEL_EXPERIMENTAL_DRAIN=0.The clock-test tail — resolved by splitting it into its two real classes
The premature-fixpoint clock tests are fixed at the correct layer for each:
testImmediateClock,childTasksCompleteBeforeTeardown) — countable, a genuine framework gap. A global-quiescence fail-gate fails a still-unmetexpectonly when the whole process is executor-quiescent (not merely when one test looks idle), so a child parked mid-clock.sleepwhile the run is busy no longer trips a false fixpoint. The process-wide counter is a lock-free Swift 6Atomic(negligible hot-path cost).testClockStepByStep,testOnChangeCancelPreviousDiscardsStalework) — aTestClockadvance-vs-subscribe registration race, invisible to any executor accounting. Fixed test-side with the documentedsettle()-before-advancepattern, which parks the timer (registering its deadline) first — replacing theTask.yield()ordering gamble.Discriminator for future flakes: is the pending work routed through our executor? Yes → framework owns it (gate); parked on an external clock → test owns the ordering (
settle).CI
drain=1(drive, default) is now the required gate, serial + parallel, macOS + Linux.drain=0 serial(vanilla wall-clock) stays required — the deterministic regression gate that caught the OR-path race (497c2ab).drain=0 parallelis informational — the documented-flaky legacy wall-clock path.Validation
--paralleliterations atscale=1(the harshest local condition), 0 failures.scale=3.See
Docs/test-determinism-executor-drain.md(Updates 1–21) for the full design history.🤖 Generated with Claude Code