ci: shard slow pytest into 8 parallel jobs by PaulHax · Pull Request #18 · ITM-Kitware/jaxborg

PaulHax · 2026-05-08T18:52:43Z

Summary

GitHub-hosted runners hard-cancel jobs at 360 min regardless of timeout-minutes, so the recent 4h→6h→10h bumps couldn't take effect past 6h.
Shard test-slow into a 4-way matrix with pytest-split. 545 slow items split 137/137/137/134; wall-clock drops ~3.5h → ~1h.
timeout-minutes: 600 → 90.

Test plan

All 4 pytest (slow) N/4 jobs pass in parallel
pytest (fast) + ruff still pass
Follow-up: commit .test_durations after a green run so splits balance by duration instead of round-robin

Three follow-ups on PR #18's first run, where shards 2/3 finished in 3min while 1/4 ran for an hour+: - Cache .test_durations via actions/cache instead of committing it. Floats churn every run, so committing would thrash git history. Each shard restores the most recent cached durations (any group, any run) for balanced splitting, then saves its own updated copy under a unique key. Over runs, fresh entries from every group accumulate. - Bump shards 4 → 8. Round-robin by test count (the fallback when no durations file exists) clusters heavy parametrized tests, so going wider than strictly needed buys headroom for the first run. - Bump timeout 90 → 240min. 90 was sized for the steady-state balanced case; first-run imbalance needs more slack. Still under GitHub's hard 360min cancel.

The 4h→6h→10h timeout bumps were chasing a non-existent ceiling: GitHub hard-cancels any single job at 360 min regardless of timeout-minutes, so timeout: 600 on main is a no-op. The slow gate needs to scale below that line, not above it. Switch test-slow to an 8-way matrix using pytest-split. 545 slow items split 69×7 + 62; each shard runs well under an hour, wall-clock for the gate drops from ~3.5h to ~1h once durations are populated. .test_durations is held in the GH Actions cache (actions/cache@v4), not git — committing it would thrash history every run since the floats churn. Each shard restores the most recent cached durations, runs with --store-durations, and saves its own updated copy under a unique key. Over runs, fresh entries from every group accumulate. Sized for the cold-start case where no durations exist yet: - 8 shards (vs strictly-needed 4) buys headroom against round-robin clustering of heavy parametrized tests - timeout-minutes: 240 — first-run safety net only; steady-state is well under an hour. Still under the 360 hard cap. Adds pytest-split to the dev dep group; uv.lock regenerated.

The shard 8/8 hang was a JAX-array-as-int hot loop in _global_host_to_relative_slot / _abs_subnet_to_relative — pure-Python host-side helpers that were doing int(jax_array[i, j]) per element of small const lookup tables. Each access round-tripped through full JAX primitive dispatch (slice_p.bind → apply_primitive); on cold-cache CI runners the JAX persistent-cache lock contention from the also-running xdist workers turned that into an unbounded stall. Switched to one np.asarray + np.flatnonzero per call. Module already declared itself JAX-free at the top — this aligns with that intent. Workflow: - drop addopts via -o "addopts=" so -n auto doesn't stack on top of sharding (xdist + pytest-split contend on the JAX cache lock) - fix the durations cache: previous strategy had each shard overwrite .test_durations with only its 1/N slice, so prefix-restore picked up partial files that never accumulated. Now each shard writes .test_durations_cache/group-N.json, and a merge step rebuilds the full .test_durations before pytest runs - timeout 240 → 60 (hang root cause is fixed; no first-run cliff to budget for) Verified locally: test_native_generic_exploit_respects_blocked_scan_- source_route_matches_cyborg now finishes in 3m4s (was unbounded).

First run after the cache schema change has empty .test_durations_cache, so pytest-split round-robins by test count and heavy parametrized tests cluster into 1-2 shards that exceed 60 min serially. Subsequent runs (with partials cached) all fit under 30 min.

Per-shard cache keys could never warm up: actions/cache restores at most one cache per step, so each shard merged from a single shard's 2.4 KB partial instead of all eight. pytest-split fell back to round-robin and shard 2/8 hit the 120 min timeout while shard 3/8 finished in 59 s. Replace per-shard caches with a single aggregate refreshed by a new aggregate-durations job that merges every shard's uploaded .test_durations (plus the prior aggregate so timed-out shards keep their last-known durations) and saves under slow-durations-aggregated-<run_id>. Each shard restores the most recent aggregate via the slow-durations-aggregated- prefix. Bump test-slow timeout 120 -> 180 min for cold-cache headroom on the first run before the aggregate exists.

Two fixes for the slow-test sharding bootstrap: 1. upload-artifact@v4 excludes hidden files by default, so `.test_durations` was silently dropped. The aggregate-durations job consequently found 0 artifacts and never populated the cache. Add `include-hidden-files: true`. 2. Even with #1 fixed, the very first run on a new cache scope (this PR, main after merge, future PRs) still cold-starts with round-robin sharding and hits the per-shard timeout. Add a conditional "seed from URL" step that pulls a known-good `.test_durations` from `vars.SLOW_DURATIONS_SEED_URL` when the cache is empty. Subsequent runs hit the cache and skip the download, and the aggregate job keeps the cache refreshed.

PaulHax force-pushed the ci/shard-slow-tests branch from 32099d5 to 7b005d4 Compare May 8, 2026 20:11

PaulHax added 5 commits May 15, 2026 16:11

PaulHax force-pushed the ci/shard-slow-tests branch from e024987 to 8c74c2e Compare May 15, 2026 20:11

PaulHax changed the title ~~ci: shard slow pytest into 4 parallel jobs~~ ci: shard slow pytest into 8 parallel jobs May 15, 2026

PaulHax merged commit b85bc32 into main May 15, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: shard slow pytest into 8 parallel jobs#18

ci: shard slow pytest into 8 parallel jobs#18
PaulHax merged 5 commits into
mainfrom
ci/shard-slow-tests

PaulHax commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PaulHax commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PaulHax commented May 8, 2026 •

edited

Loading