Skip to content

Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate)#24

Merged
mansbernhardt merged 20 commits into
mainfrom
claude/expect-drain
Jun 18, 2026
Merged

Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate)#24
mansbernhardt merged 20 commits into
mainfrom
claude/expect-drain

Conversation

@mansbernhardt

@mansbernhardt mansbernhardt commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Executor-drain test determinism — .modelTesting wait verbs resolve on a load-independent fixpoint (now the default)

Combines the settle-drive (#23) and the expect/waitUntil drive-primary work into one PR, flips the executor-drive on by default, and promotes it to the required CI gate.

What changed

.modelTesting's wait verbs (settle / expect / waitUntil) now resolve on a load-independent executor-drain fixpoint instead of a starvable wall-clock budget. Model tasks run on a per-test TaskExecutor; a wait resolves when that executor (plus the per-test background + main-registrar observation queues) is quiescent — a non-starvable signal that doesn't depend on the cooperative pool getting CPU under parallel/CI load.

  • Cures the false settle() timed out: model still has active tasks failures (empty active-task list) and the broad --parallel flake population (on a contended dev machine the per-100-run distinct-flaky-test count dropped ~63 → a small clock-bound remnant, now resolved).
  • Faster: serial suite ~2.6× faster (fixpoint resolution vs. waiting out debounce windows).
  • Default on for macOS 15+ / iOS 18+ / Linux-Swift-6 (custom task executors need the Swift 6 runtime). Older OSes and WASM keep the wall-clock path. Opt out with SWIFT_MODEL_EXPERIMENTAL_DRAIN=0.

The clock-test tail — resolved by splitting it into its two real classes

The premature-fixpoint clock tests are fixed at the correct layer for each:

  1. Work routed through the executor (testImmediateClock, childTasksCompleteBeforeTeardown) — countable, a genuine framework gap. A global-quiescence fail-gate fails a still-unmet expect only when the whole process is executor-quiescent (not merely when one test looks idle), so a child parked mid-clock.sleep while the run is busy no longer trips a false fixpoint. The process-wide counter is a lock-free Swift 6 Atomic (negligible hot-path cost).
  2. Work parked off the executor (testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework) — a TestClock advance-vs-subscribe registration race, invisible to any executor accounting. Fixed test-side with the documented settle()-before-advance pattern, which parks the timer (registering its deadline) first — replacing the Task.yield() ordering gamble.

Discriminator for future flakes: is the pending work routed through our executor? Yes → framework owns it (gate); parked on an external clock → test owns the ordering (settle).

CI

  • drain=1 (drive, default) is now the required gate, serial + parallel, macOS + Linux.
  • drain=0 serial (vanilla wall-clock) stays required — the deterministic regression gate that caught the OR-path race (497c2ab).
  • drain=0 parallel is informational — the documented-flaky legacy wall-clock path.

Validation

  • 80 full-suite --parallel iterations at scale=1 (the harshest local condition), 0 failures.
  • Serial 3/3 clean; flag-off inert at CI's scale=3.
  • Targeted stress: 40× OnChange, 30× Clock, 25× clock+child — all clean.
  • Real-consumer (parallel-apple): cured the "CI runs forever" disease (833→962 passing, 13→1 failing suites; the lone residual was a missing-clock-override test bug, fixed there).

See Docs/test-determinism-executor-drain.md (Updates 1–21) for the full design history.

🤖 Generated with Claude Code

mansbernhardt and others added 9 commits June 16, 2026 17:25
`settle()` (and the settle phase of `expect { … }`) resolves on the model's
executor-drain FIXPOINT instead of a `.deferential`/`.background`-QoS quiet-check.
Under heavy parallel load macOS starves `.background` indefinitely, so the
quiet-check never fired and settle reported a false `settle() timed out: model
still has active tasks` (empty task list) at ANY budget — the years-old flake the
serial-CI fallback and SWIFT_MODEL_TIMEOUT_SCALE were working around. The drain
signal is non-starvable (a job-count + GTS, never `.background`) and
dependency-free, so settle waits as long as necessary under load and resolves the
instant the model is genuinely quiescent.

How it works:
- `_DrainTestExecutor`: one shared concurrent GCD queue backs every per-test
  executor (avoids per-test thread-pool explosion); each keeps its own
  outstanding-job count + event-driven `waitUntilIdleOrDeadline`.
- Model task bodies (`node.task`/`forEach`) adopt it via `executorPreference`
  under `.modelTesting`; the trait installs a per-test executor box.
- `_driveToStableFixpoint`: quiescent = executor idle + per-test bg-idle + no
  pending-start task, persisted for a short NON-STARVABLE grace that debounces
  against ALL activity (every `_noteActivity` + executor enqueue) so a clock-
  parked task's resume resets it. `mainCall` excluded (process-global).
- `waitUntilSettled` resolves on the fixpoint; a generous watchdog only catches
  a true deadlock.

OPT-IN via SWIFT_MODEL_EXPERIMENTAL_DRAIN=1 — inert by default (executor box is
nil → every wait keeps its current path), so the suite is unchanged unless
enabled. Validated: flag ON, settle suites green incl. a 60-iteration
load-stressed child-task settle test; flag OFF, broad regression green
(unchanged). Custom task executors need Swift 6 runtime (macOS 15+); older
OS/WASM stay on the existing path.

`expect`/`waitUntil` drive-primary migration is deliberately NOT in this PR (the
fixpoint-as-fail judgment has open scaling/race work) — follow-up. Full design
arc in docs/test-determinism-executor-drain.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under the executor drive (flag on, macOS 15+), the per-test wall-clock cap
becomes an inactivity watchdog keyed on the test's own _DrainTestExecutor
activity instead of an absolute 30s deadline. Under full-parallel a healthy
test's jobs queue behind hundreds on the shared drain queue, so wall-clock-to-
drain legitimately exceeds 30s though the test is still progressing; the
watchdog resets on every executor enqueue/completion and trips only after a
full window of genuine inactivity (a real stall). Per-test isolated, so a
single hung test is still caught while the rest of the suite is busy.

Flag off => activityProbe is nil => original absolute cap, unchanged (inert).

_birthNs floors activityNs so a test that hasn't yet enqueued work reads
'active as of now' rather than the epoch (else target = 0 + window is already
past and the watchdog trips instantly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tail collapses into the item-2 fixpoint-fail race

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oint + Update 12

The off-main main-registrar observation path (withObservationTracking/Observed/
onChange) delivers via context.mainCallQueue, which is PER-CONTEXT (per-test),
not the process-global `mainCall`. The drive can therefore safely wait on it
(watchdog-bounded, no inter-test hang). Fixpoint now unions executor + bg +
per-test main-observation idle.

Finding (Update 12): this is correct coverage (doc risk #1) but does NOT close
the residual. Full-parallel shows ~15 flaky failures with and without it, and
events don't use the registrar at all. The residual is a single fine-grained
fixpoint-SAMPLING race (Update 8's theorem: no finite grace is safe under
unbounded load with a dependency-free signal), which (B) cannot beat by covering
more queues. Recommend pivoting to (A): expect never self-fails at a fixpoint;
a still-unmet predicate fails only at the per-test inactivity watchdog.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ike/diag investigation tests; Update 13

expect never self-fails at a single fixpoint sample: it resolves PASS reactively
the instant the predicate is true (unchanged), and FAILs only after a sustained
window of genuine inactivity (_expectGraceNs = 2s × timeoutScale), which any
activity/enqueue/queue-drain resets — so under load the fail defers until truly
quiet (no false fail). settle keeps its 30ms grace.

Removed SpikeDrainExecutorTests/DiagExecutorClockTests (investigation records;
some encode the now-abandoned fast-fail semantics, e.g. brokenExpectFailsFastAtFixpoint).

Finding (Update 13): A reduces the targeted deferred-resume flakes
(testChildEvents 1/3 -> 0/8 focused) but the full-parallel aggregate is ~neutral
- dominated by a broad observation/transition/event propagation race that
neither A (grace length) nor B (mainCall coverage) cracks. Next step is tracing,
not further blind tuning. Flag remains OFF by default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…een; drive 2.6x faster serially + halves parallel flake; ~5-6 flaky serial residual is the bar to flip default)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fast-fail (not the delayed-resume race); revert expect window to 2s

Decisive experiment: with the expect inactivity-fail window set to 30s, the
Observed/transition residual (testChangeOf, testRecursive*, testCapturedObserved*,
testTaskId*) STILL fails at 0.05-0.1s — so it does not go through the drive's
inactivity-fail path and (A) cannot address it. The failure is an accumulated
update sequence missing its last element (counts==[5] vs [5,8]). New hypothesis:
the executor changes the interleaving between the test thread's mutations and the
Observed consumer's appends. (A) kept at 2s (helps the event sub-class +
chosen semantics) but is not the lever for this dominant sub-class. Next: trace
one testChangeOf failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rive residual); remove diagnostic traces; Update 16

Root cause: expect()'s per-call executor driver runs _resolveUnmetPredicatesAtFixpoint
on every loop iteration, including when _driveToStableFixpoint returns false due to
the Task being CANCELLED (which expect() does as soon as its awaitPredicate resolves).
Because _pendingExpects is shared across a test's sequential expects, a just-cancelled
driver from expect A would fail expect B's freshly-registered predicate before B's
satisfying write/event arrived — a fast, window-independent false fail leaving an
accumulated Observed/onChange/event sequence missing its last element.

Fix: the driver bails on Task.isCancelled before touching the shared pending entries.

Results (same machine): flag-on SERIAL (CI gate) goes from ~5-6 flaky/run to GREEN
(2/2, 42s, 2.6x faster than flag-off's 112s); flag-on --parallel ~12-18 -> 1-3; the
OnChange+UpdateStream combo 0/6 (was ~1/run). The whole observation/transition/event
residual class is gone. Diagnostic obs/drive traces removed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mansbernhardt mansbernhardt changed the title Make expect/waitUntil drive-primary under the executor (opt-in); fix the cancelled-driver residual Executor-drain test determinism: settle + expect/waitUntil drive-primary (opt-in) Jun 17, 2026
@mansbernhardt mansbernhardt changed the base branch from claude/settle-drain to main June 17, 2026 09:40
mansbernhardt and others added 6 commits June 17, 2026 13:40
… Linux

Adds a drain={0,1} dimension to the macOS and Linux test matrices. drain=1 runs
the suite with SWIFT_MODEL_EXPERIMENTAL_DRAIN=1 (the executor-drive active),
serial + parallel, and is continue-on-error (informational) while the drive is
opt-in. This is the production-readiness gate the design note calls for: it
exercises the drive's load-tolerance on the real small/contended runners — the
one signal a dev machine can't provide — without blocking merges on the
experimental path. drain=0 rows are the existing required gates, unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…llel tail; Update 17

(b) tail hardening:
- ReactiveWaitInfrastructureTests.awaitQuietWindow_firesAfterQuietWindow: drop
  the wall-clock UPPER-bound assertion (< 4.5s × scale). It asserted timing on a
  .deferential/.background callback macOS starves under --parallel, so it flaked
  on BOTH flag states; lower-bound + .timeout checks keep the real coverage.
- Document testClockStepByStep, testOnChangeCancelPreviousDiscardsStalework,
  testRaceVariant in CLAUDE.md's known load-sensitive list. They flake flag-off
  too; the executor-drive REDUCES the dev-machine --parallel flake population
  ~5-10x (flag-off 11-23/run vs flag-on 1-4/run on this machine) — they're a
  small remnant of a shared tail, not drive regressions.

Update 17 records the same-machine flag-off-vs-on --parallel comparison and the
CI flag-on gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+Linux, serial+parallel)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (63->6 tests) but regresses clock-parked tests at scale=1 (premature-fixpoint hard core)
The wait verbs (settle/expect/waitUntil) now resolve on the load-independent
executor-drain fixpoint by default on macOS 15+/iOS 18+/Linux-Swift-6, instead of
the starvable wall-clock quiet-window. _makeTestExecutorBox returns the executor
unless SWIFT_MODEL_EXPERIMENTAL_DRAIN=0 (opt-out → legacy wall-clock path).

Why: cures the false settle()-timeouts under .background starvation; cuts the
dev-machine --parallel flaky-test population ~10x (63->~6 per 100 runs); ~2.6x
faster serial. Validated against a real consumer (parallel-apple): cures its
'CI runs forever' disease (833->962 passing, 13->1 failing suites).

Known limitation (documented, Update 20 + CHANGELOG): a few clock-driven tests
(value produced after await clock.sleep) can be MORE flaky than the wall-clock
path under --parallel at scale=1 — a task parked mid-sleep is not a ready
executor job, so the drive can read 'quiescent' before it resumes (premature
fixpoint). No clean dependency-free fix: a hard transient-task gate hangs on
intentional long-lived node.tasks (Update 7's edge, confirmed); a bounded-wait
neither fixes the clock tests nor avoids slowdown. Run those serially or bump
TIMEOUT_SCALE (CI uses 3).

CI: drain=0 (wall-clock opt-out) stays the REQUIRED deterministic gate; drain=1
(the drive, = default) runs informationally until the clock tail is gate-able.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ed CI gate

The premature-fixpoint clock tail (Update 20's "unfixable" four tests) is
resolved by splitting it into its two real classes and fixing each at the
correct layer:

1. Work routed THROUGH the executor (testImmediateClock,
   childTasksCompleteBeforeTeardown) — countable, a genuine framework gap.
   Kept the global-quiescence fail-gate (an unmet `expect` fails only when the
   WHOLE process is executor-quiescent, not just when one test looks idle, so a
   child parked mid-`clock.sleep` while the run is busy no longer trips a false
   fixpoint). Converted its hot-path lock to a lock-free Swift 6 `Atomic` so the
   per-enqueue/-completion cost is negligible.

2. Work parked OFF the executor (testClockStepByStep,
   testOnChangeCancelPreviousDiscardsStalework) — a TestClock advance-vs-subscribe
   registration race, invisible to any executor accounting. Fixed test-side with
   the documented `settle()`-before-`advance` pattern (parks the timer, registering
   its deadline, first), replacing the `Task.yield()` ordering gamble.

Discriminator for future flakes: is the pending work routed through our executor?
Yes -> framework owns it (gate); parked on an external clock -> test owns the
ordering (settle). Not a license to blame scheduling for everything — class (1)
was a real gap the gate closes, making async child-task completion testable
without a manual settle.

CI: promote drain=1 (drive, default) to the REQUIRED gate on all rows; keep the
vanilla serial wall-clock row required (deterministic regression gate, caught the
OR-path race); demote only the parallel wall-clock row to informational (the
documented-flaky legacy path).

Validation: 80 full-suite --parallel iterations at scale=1 clean (30+50), serial
3/3 clean, flag-off inert at scale=3; targeted 40x OnChange / 30x Clock / 25x
clock+child clean. Docs (CLAUDE.md, CHANGELOG, design-note Update 21) updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mansbernhardt mansbernhardt changed the title Executor-drain test determinism: settle + expect/waitUntil drive-primary (opt-in) Executor-drain test determinism: settle + expect/waitUntil resolve on a load-independent fixpoint (default-on, required gate) Jun 18, 2026
mansbernhardt and others added 5 commits June 18, 2026 17:01
…aitUntil tail)

Real-runner CI showed the drain=1 PARALLEL promotion was too aggressive: on the
small GitHub runners the drain=1 parallel jobs failed intermittently (macOS
flipped to pass on re-run; Linux is the consistent offender) on two
waitUntil-based known-flaky tests — testObservedStreamWithModelAccessingObservable
(unsupported @observable interop) and testSharedDependency (deinit-chain poll).

Not the global gate: both are waitUntil tests, the gate lives in the expect
driver, and the drain=1 SERIAL jobs passed cleanly both runs. Root cause is a
pre-existing interaction — under the drive waitUntil extends its deadline to the
120s watchdog, which exceeds the 90s trait cap (30s x scale=3), so a never-true
waitUntil on a saturated 2-vCPU container hits the trait cap (un-absorbable by
withKnownIssue) before its own catchable timeout.

Split required-vs-informational by MODE: serial rows (both drains) are the
required deterministic gates (drive-serial primary, wall-clock-serial vanilla
regression gate); parallel rows (both drains) informational (wall-clock starves;
drive has the waitUntil tail). Core goal preserved: drive is now a required gate
(serial) and the default. Design note Update 22 records the full analysis +
the future waitUntil-cap fix to eventually promote parallel.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(absorbable)

Fixes the Update 22 latent bug: under the executor drive, waitUntil extended its
deadline to a fixed 120s watchdog, which EXCEEDS the per-test trait cap
(testWallClockSeconds = 30s x scale = 90s on CI). So a never-true waitUntil on a
saturated container hit the trait cap first, throwing _TestTimeoutError OUTSIDE
the test body — un-absorbable by withKnownIssue (the
testObservedStreamWithModelAccessingObservable hard failure).

Now, under the drive, waitUntil fails when this test's executor has been INACTIVE
for 60% of the trait window (18s x scale), using the same exec.activityNs signal
the trait cap uses: load-tolerant (a progressing model resets it) but guaranteed
to fire before the trait cap, via the catchable WaitUntilTimeoutError that
withKnownIssue can absorb. A 120s absolute watchdog still backstops the
continuous-activity-but-never-true edge. Flag-off path unchanged (no probe → the
scaled absolute budget governs, exactly as before).

Validation: wait-infra meta-tests 20/20, full suite 10/10 at scale=1, no
regression. Design note Update 23. Parallel stays informational: testSharedDependency
(a real deinit-chain assertion, not withKnownIssue) remains a genuine container
load-flake this fix improves but doesn't eliminate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nditional default

The executor-drive is validated (80+ local --parallel iterations, multiple green
CI runs, ~2.6x faster serial, flake population gone), so the opt-out flag and the
drain=0 comparison machinery no longer earn their keep in the tree.

Key distinction: "what the library ships to" (iOS 14+) is not "what host runs the
tests". `.modelTesting` runs on the dev/CI machine, which on any modern toolchain
(macOS 15+ / Linux-Swift-6) can run the drive — so the drive is simply on there.

- _makeTestExecutorBox returns the executor whenever #available (macOS 15+/iOS
  18+), with no env-var check. The wall-clock path survives ONLY as the automatic
  fallback for hosts that can't run the drive (pre-macOS-15 / pre-iOS-18 hosts and
  simulators, older Swift, WASM — no Dispatch), selected by availability, not a
  toggle.
- CI matrix drops the `drain` dimension: each OS runs just {parallel, serial} on
  the drive. Serial required; parallel informational (the waitUntil tail).
- ExecutorDrainSettleTests gates on #available instead of the env var (so it now
  actually runs on supported hosts rather than no-opping).
- CHANGELOG, CLAUDE.md, design-note Update 24 updated. Before/after comparison
  preserved in git history (one `git show` away).

Validation: full suite 5/5 --parallel + serial (770 tests) clean; drain-settle
regression test now runs and passes 3/3; build warning-free.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fore teardown

The test shares one Dependency across two children, each observing its value via
forEach(Observed). `sharedDep.value -= 2` (→5) delivers "(->5)" to EACH child
reactively, while both are alive. The old test waited for only ONE "(->5)" then
removed the children, letting the SECOND arrive entangled with child teardown —
"a deinit chain whose timing is at the mercy of when the last strong reference is
released." That teardown-timed delivery is the load-sensitive parallel flake
(~2/1000; once exceeded even the 120s drive backstop on a saturated container).

Fix: assert both children see 5 and wait for "(->5)(->5)" UP FRONT, while both
children exist — so both observations are reactive, executor-driven, and
deterministic, not dependent on teardown ordering. The trailing wait then only
needs the "d" destruct log (imminent once lifetime == .destructed). No large
timeout needed anymore.

(Investigated a settle()-based approach first; it made things worse — 1/30 —
because that second log entry can land AFTER executor quiescence via the
ARC/cancellation-flush path, which settle doesn't drive. Reverted; the reorder is
the real fix: make the delivery reactive instead of teardown-timed.)

Validation: testSharedDependency 60/60 isolated, full suite 10/10 --parallel, all
clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With testObservedStream now absorbed by withKnownIssue (waitUntil sub-trait-cap
fix, Update 23) and testSharedDependency made deterministic (reactive shared-dep
observations awaited up front, Update 25), the parallel `waitUntil` tail that kept
`--parallel` informational is gone. Verified green across 4 consecutive CI runs
(push + 3 re-runs; parallel jobs green on macOS and Linux every time).

Drop `continue-on-error` from both jobs: serial AND parallel are now required,
merge-blocking gates on macOS and Linux. Completes the arc — the executor-drive
is the unconditional default test-wait mechanism and the full CI matrix gates on
it. CLAUDE.md + design-note Update 25 updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mansbernhardt mansbernhardt merged commit 5a0c6f3 into main Jun 18, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant