Skip to content

ci: shard slow pytest into 8 parallel jobs#18

Merged
PaulHax merged 5 commits into
mainfrom
ci/shard-slow-tests
May 15, 2026
Merged

ci: shard slow pytest into 8 parallel jobs#18
PaulHax merged 5 commits into
mainfrom
ci/shard-slow-tests

Conversation

@PaulHax

@PaulHax PaulHax commented May 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • GitHub-hosted runners hard-cancel jobs at 360 min regardless of timeout-minutes, so the recent 4h→6h→10h bumps couldn't take effect past 6h.
  • Shard test-slow into a 4-way matrix with pytest-split. 545 slow items split 137/137/137/134; wall-clock drops ~3.5h → ~1h.
  • timeout-minutes: 600 → 90.

Test plan

  • All 4 pytest (slow) N/4 jobs pass in parallel
  • pytest (fast) + ruff still pass
  • Follow-up: commit .test_durations after a green run so splits balance by duration instead of round-robin

PaulHax added a commit that referenced this pull request May 8, 2026
Three follow-ups on PR #18's first run, where shards 2/3 finished in 3min
while 1/4 ran for an hour+:

- Cache .test_durations via actions/cache instead of committing it.
  Floats churn every run, so committing would thrash git history. Each
  shard restores the most recent cached durations (any group, any run)
  for balanced splitting, then saves its own updated copy under a
  unique key. Over runs, fresh entries from every group accumulate.

- Bump shards 4 → 8. Round-robin by test count (the fallback when no
  durations file exists) clusters heavy parametrized tests, so going
  wider than strictly needed buys headroom for the first run.

- Bump timeout 90 → 240min. 90 was sized for the steady-state balanced
  case; first-run imbalance needs more slack. Still under GitHub's
  hard 360min cancel.
@PaulHax PaulHax force-pushed the ci/shard-slow-tests branch from 32099d5 to 7b005d4 Compare May 8, 2026 20:11
PaulHax added 5 commits May 15, 2026 16:11
The 4h→6h→10h timeout bumps were chasing a non-existent ceiling: GitHub
hard-cancels any single job at 360 min regardless of timeout-minutes,
so timeout: 600 on main is a no-op. The slow gate needs to scale below
that line, not above it.

Switch test-slow to an 8-way matrix using pytest-split. 545 slow items
split 69×7 + 62; each shard runs well under an hour, wall-clock for
the gate drops from ~3.5h to ~1h once durations are populated.

.test_durations is held in the GH Actions cache (actions/cache@v4),
not git — committing it would thrash history every run since the
floats churn. Each shard restores the most recent cached durations,
runs with --store-durations, and saves its own updated copy under a
unique key. Over runs, fresh entries from every group accumulate.

Sized for the cold-start case where no durations exist yet:
- 8 shards (vs strictly-needed 4) buys headroom against round-robin
  clustering of heavy parametrized tests
- timeout-minutes: 240 — first-run safety net only; steady-state is
  well under an hour. Still under the 360 hard cap.

Adds pytest-split to the dev dep group; uv.lock regenerated.
The shard 8/8 hang was a JAX-array-as-int hot loop in
_global_host_to_relative_slot / _abs_subnet_to_relative — pure-Python
host-side helpers that were doing int(jax_array[i, j]) per element of
small const lookup tables. Each access round-tripped through full JAX
primitive dispatch (slice_p.bind → apply_primitive); on cold-cache CI
runners the JAX persistent-cache lock contention from the also-running
xdist workers turned that into an unbounded stall. Switched to one
np.asarray + np.flatnonzero per call. Module already declared itself
JAX-free at the top — this aligns with that intent.

Workflow:
- drop addopts via -o "addopts=" so -n auto doesn't stack on top of
  sharding (xdist + pytest-split contend on the JAX cache lock)
- fix the durations cache: previous strategy had each shard overwrite
  .test_durations with only its 1/N slice, so prefix-restore picked up
  partial files that never accumulated. Now each shard writes
  .test_durations_cache/group-N.json, and a merge step rebuilds the
  full .test_durations before pytest runs
- timeout 240 → 60 (hang root cause is fixed; no first-run cliff to
  budget for)

Verified locally: test_native_generic_exploit_respects_blocked_scan_-
source_route_matches_cyborg now finishes in 3m4s (was unbounded).
First run after the cache schema change has empty
.test_durations_cache, so pytest-split round-robins by test count and
heavy parametrized tests cluster into 1-2 shards that exceed 60 min
serially. Subsequent runs (with partials cached) all fit under 30 min.
Per-shard cache keys could never warm up: actions/cache restores at
most one cache per step, so each shard merged from a single shard's
2.4 KB partial instead of all eight. pytest-split fell back to
round-robin and shard 2/8 hit the 120 min timeout while shard 3/8
finished in 59 s.

Replace per-shard caches with a single aggregate refreshed by a new
aggregate-durations job that merges every shard's uploaded
.test_durations (plus the prior aggregate so timed-out shards keep
their last-known durations) and saves under
slow-durations-aggregated-<run_id>. Each shard restores the most
recent aggregate via the slow-durations-aggregated- prefix.

Bump test-slow timeout 120 -> 180 min for cold-cache headroom on the
first run before the aggregate exists.
Two fixes for the slow-test sharding bootstrap:

1. upload-artifact@v4 excludes hidden files by default, so
   `.test_durations` was silently dropped. The aggregate-durations
   job consequently found 0 artifacts and never populated the cache.
   Add `include-hidden-files: true`.

2. Even with #1 fixed, the very first run on a new cache scope
   (this PR, main after merge, future PRs) still cold-starts with
   round-robin sharding and hits the per-shard timeout. Add a
   conditional "seed from URL" step that pulls a known-good
   `.test_durations` from `vars.SLOW_DURATIONS_SEED_URL` when the
   cache is empty. Subsequent runs hit the cache and skip the
   download, and the aggregate job keeps the cache refreshed.
@PaulHax PaulHax force-pushed the ci/shard-slow-tests branch from e024987 to 8c74c2e Compare May 15, 2026 20:11
@PaulHax PaulHax changed the title ci: shard slow pytest into 4 parallel jobs ci: shard slow pytest into 8 parallel jobs May 15, 2026
@PaulHax PaulHax merged commit b85bc32 into main May 15, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant