fix(e2e): mobile + guardian iOS E2E green on dedicated xlarge runners; guardian auth read over sync atom; transient-failure hardening#302
Merged
Conversation
…s fast instead of hanging the whole test
…nner + WASM-lock contention)
90s still times out (proven in CI run 28294249613): the guardian auth read is WASM-lock-starved by useSyncTrigger's 3s-cadence sync, not merely slow, so no fixed eval budget fixes it. Reverting to keep this PR to verified-working fixes.
…k WASM-lock starvation
…read (livelock fix)
…align signing-loop (root cause)
…rage parse) to avoid the OZ multisig load signing-loop
…ount, 60s eval budget) + widen iOS wallet-create timeouts
… so the lone getAccount isn't queued behind a slow sync
…red stash (no WASM call in the test eval path)
…un so a degraded CoreSimulator fails fast and the retry gets a fresh daemon
…+ --retries=2 for non-guardian mobile
…tashes before the auth step reads it (was racing fire-and-forget on iOS)
…re when the exact address key differs (stash was populated but keyed differently)
…_script callback arrives as boolean true, hangs every evalAsync
…min, test timeout 15->25min) instead of killing runs that would pass
… (shared macos-26 pool degraded for hours, _simPair setup couldn't finish)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes the mobile + blockchain E2E suites robust to two distinct transient CI failures on the macOS runners that were causing the mobile E2E jobs to fail on every run. Both are root-caused below with CI evidence; the third historical layer (the iOS build itself) was fixed separately in #299.
Why the mobile E2E jobs were always red (layered root cause)
1. (fixed in #299, already on
main) The iOS app didn't build. Every mobile job runstest:e2e:mobile:buildfirst; on themacos-26runner (Xcode 26.5) twoas? SecKeydowncasts inHotKeyPlugin.swiftwere rejected as a hard error →BUILD FAILED(exit 65), no tests ran, no artifacts. That's why all four mobile jobs were red and why it "passed locally" (older local Xcode).2. (this PR)
create_walletshung the full 15-minute timeout. Once #299 let the suite run, the devnet mobile job timed out increate_wallets(the wallet reached its home screen, but the readiness poll never returned).CdpBridge.eval/evaluateawaited the WebKitexecuteAtomcall with no timeout (unlikeevalAsync), andpollForConditiononly checks its deadline between iterations — so one wedgedeval(flaky RWI socket, or the WebView main thread briefly blocked by mobile main-thread WASM on the slower runner) hung the test until Playwright's global kill, and the rest of the serial suite skipped.eval/evaluateraceexecuteAtomagainst a 30s hard timeout. A wedge becomes a fast throw →pollForConditionenforces its own budget →--retriesrestarts on a fresh app + CDP.create_wallets/wasTimeout: true/ hung 903s → the test now passescreate_walletsand runs ~9.4 min deeper into the flow. The hang is gone.3. (this PR) The
miden-clientharness CLI failed the mint on a transient prover-connection error. With the hang gone, the devnet job then failed atmint_tokens_to_wallet_b: the CLI's delegated-prover TLS/gRPC handshake flaked (failed to connect to the remote prover→transport error→no native certs found) — intermittently, since a sibling mint in the same test connected fine. The CLI deploy/mint/sync retry loop only classified node-RPC + nonce-lag errors as transient, so this connection error was treated as fatal and failed immediately.transientclassifiers into oneisTransientCliErrorhelper.Verification
playwright test --listloads both modified helpers).src-scoped, so theseplaywright/changes aren't lint-gated; style matches the surrounding files.E2E Blockchain(devnet) run is triggered on this branch to confirm the mint now retries through the transient prover error and the mobile-devnet job goes green. Result posted once it completes.Not addressed (separate, and the gate already tolerates a single-network failure): testnet's slower commit cadence and its own persistent-state nonce conflicts.
Update — guardian gate (layer 4). The Mobile Guardian gate had a separate, recent regression (from the #227 guardian merge that's now main's tip): the iOS
verify_guardian_auth_structureread deadlocked on WASM-lock starvation —useSyncTrigger's 3s in-process sync kept re-grabbing the single-threaded WASM lock faster than the read could progress. A 90s eval budget still timed out, proving contention, not slowness (that attempt was reverted).Fixed by having
__TEST_GUARDIAN_AUTH__set a test-only__TEST_SYNC_PAUSED__flag that suspendsuseSyncTriggerfor the duration of the read (always cleared infinally), paired with a bounded 90s budget to wait out any single in-flight sync. The flag is gated onMIDEN_E2E_TESTand tree-shaken out of production. Added unit tests for the pause branches (extension + mobile + the production-ignores-flag case); full coverage stays ≥95%. Verifying on CI.Final resolution (supersedes the layer-4 note above — the record is kept intact deliberately).
Guardian iOS auth read — the real root cause. The
__TEST_SYNC_PAUSED__/ WASM-starvation theory above was wrong. The actual bug:CdpBridge.evalAsync(appiumexecute_async_script) is broken on the iOS RWI bridge — its completion callback arrives in thearguments[arguments.length - 1]slot as the booleantrue, not a function, socb(result)throwsTypeError, the promise rejects unhandled, the callback never fires, and everyevalAsynchangs to its timeout regardless of how fast the script ran. (Signature in the CI timeline:Unhandled Promise Rejection: TypeError: d is not a function ... 'd' is truethe instant the read ran, then a 60s hang — even with the stash already populated.)getGuardianAuthInfowas its only caller, so it had effectively never worked on iOS. Fix: read the auth structure — captured into a global by the wallet's own balance poll via a pureAccountInspector.fromAccountparse (no WASM, no signing, no client load) — over the reliable synchronousevalatom, polled. Verified onmacos-26-xlarge: guardian-devnet passes; the sync read returns{threshold, signerCommitments:[2 entries], procedureThresholds}instantly.The mobile suite couldn't run at all — degraded shared
macos-26runners. Independently, the sharedmacos-26runner pool degraded (noisy-neighbour IO): everysimctlop crawled (97 CI samples: per-wallet_simPairsetup p90 267s / max 401s vs. <5s healthy), so two-sim setup couldn't finish even in 15 min. Confirmed infra, not code — failure history shows mobile-devnet green at 06-27 19:41, then degraded from ~21:00 for 9h+, and the non-guardian mint test failed_simPairsetup before any cap code existed. Fix: moved all four mobile E2E jobs to dedicatedmacos-26-xlargerunners (Apple Silicon, 2× vCPU/RAM, no noisy neighbours); setup is back to ~2-3 min and the full suite is green. Belt-and-suspenders for any residual slowness:_simPaircap 13 min, per-test timeout 25 min,--retries=2.Result: Mobile E2E Gate + Mobile Guardian E2E Gate both green on
macos-26-xlarge(devnet run 28316908352); both-network validation in progress. The xlarge runner is a 1-line change to revert to standardmacos-26once GitHub's shared pool recovers.