ci: shard slow pytest into 8 parallel jobs#18
Merged
Conversation
PaulHax
added a commit
that referenced
this pull request
May 8, 2026
Three follow-ups on PR #18's first run, where shards 2/3 finished in 3min while 1/4 ran for an hour+: - Cache .test_durations via actions/cache instead of committing it. Floats churn every run, so committing would thrash git history. Each shard restores the most recent cached durations (any group, any run) for balanced splitting, then saves its own updated copy under a unique key. Over runs, fresh entries from every group accumulate. - Bump shards 4 → 8. Round-robin by test count (the fallback when no durations file exists) clusters heavy parametrized tests, so going wider than strictly needed buys headroom for the first run. - Bump timeout 90 → 240min. 90 was sized for the steady-state balanced case; first-run imbalance needs more slack. Still under GitHub's hard 360min cancel.
32099d5 to
7b005d4
Compare
The 4h→6h→10h timeout bumps were chasing a non-existent ceiling: GitHub hard-cancels any single job at 360 min regardless of timeout-minutes, so timeout: 600 on main is a no-op. The slow gate needs to scale below that line, not above it. Switch test-slow to an 8-way matrix using pytest-split. 545 slow items split 69×7 + 62; each shard runs well under an hour, wall-clock for the gate drops from ~3.5h to ~1h once durations are populated. .test_durations is held in the GH Actions cache (actions/cache@v4), not git — committing it would thrash history every run since the floats churn. Each shard restores the most recent cached durations, runs with --store-durations, and saves its own updated copy under a unique key. Over runs, fresh entries from every group accumulate. Sized for the cold-start case where no durations exist yet: - 8 shards (vs strictly-needed 4) buys headroom against round-robin clustering of heavy parametrized tests - timeout-minutes: 240 — first-run safety net only; steady-state is well under an hour. Still under the 360 hard cap. Adds pytest-split to the dev dep group; uv.lock regenerated.
The shard 8/8 hang was a JAX-array-as-int hot loop in _global_host_to_relative_slot / _abs_subnet_to_relative — pure-Python host-side helpers that were doing int(jax_array[i, j]) per element of small const lookup tables. Each access round-tripped through full JAX primitive dispatch (slice_p.bind → apply_primitive); on cold-cache CI runners the JAX persistent-cache lock contention from the also-running xdist workers turned that into an unbounded stall. Switched to one np.asarray + np.flatnonzero per call. Module already declared itself JAX-free at the top — this aligns with that intent. Workflow: - drop addopts via -o "addopts=" so -n auto doesn't stack on top of sharding (xdist + pytest-split contend on the JAX cache lock) - fix the durations cache: previous strategy had each shard overwrite .test_durations with only its 1/N slice, so prefix-restore picked up partial files that never accumulated. Now each shard writes .test_durations_cache/group-N.json, and a merge step rebuilds the full .test_durations before pytest runs - timeout 240 → 60 (hang root cause is fixed; no first-run cliff to budget for) Verified locally: test_native_generic_exploit_respects_blocked_scan_- source_route_matches_cyborg now finishes in 3m4s (was unbounded).
First run after the cache schema change has empty .test_durations_cache, so pytest-split round-robins by test count and heavy parametrized tests cluster into 1-2 shards that exceed 60 min serially. Subsequent runs (with partials cached) all fit under 30 min.
Per-shard cache keys could never warm up: actions/cache restores at most one cache per step, so each shard merged from a single shard's 2.4 KB partial instead of all eight. pytest-split fell back to round-robin and shard 2/8 hit the 120 min timeout while shard 3/8 finished in 59 s. Replace per-shard caches with a single aggregate refreshed by a new aggregate-durations job that merges every shard's uploaded .test_durations (plus the prior aggregate so timed-out shards keep their last-known durations) and saves under slow-durations-aggregated-<run_id>. Each shard restores the most recent aggregate via the slow-durations-aggregated- prefix. Bump test-slow timeout 120 -> 180 min for cold-cache headroom on the first run before the aggregate exists.
Two fixes for the slow-test sharding bootstrap: 1. upload-artifact@v4 excludes hidden files by default, so `.test_durations` was silently dropped. The aggregate-durations job consequently found 0 artifacts and never populated the cache. Add `include-hidden-files: true`. 2. Even with #1 fixed, the very first run on a new cache scope (this PR, main after merge, future PRs) still cold-starts with round-robin sharding and hits the per-shard timeout. Add a conditional "seed from URL" step that pulls a known-good `.test_durations` from `vars.SLOW_DURATIONS_SEED_URL` when the cache is empty. Subsequent runs hit the cache and skip the download, and the aggregate job keeps the cache refreshed.
e024987 to
8c74c2e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
timeout-minutes, so the recent 4h→6h→10h bumps couldn't take effect past 6h.test-slowinto a 4-way matrix with pytest-split. 545 slow items split 137/137/137/134; wall-clock drops ~3.5h → ~1h.timeout-minutes: 600 → 90.Test plan
pytest (slow) N/4jobs pass in parallelpytest (fast)+ruffstill pass.test_durationsafter a green run so splits balance by duration instead of round-robin