Skip to content

fix(runtime): repair cross-Python ABI in-process for wrapper launches (torchrun)#132

Merged
christophergeyer merged 2 commits into
mainfrom
cg/runtime-repair-in-process
May 30, 2026
Merged

fix(runtime): repair cross-Python ABI in-process for wrapper launches (torchrun)#132
christophergeyer merged 2 commits into
mainfrom
cg/runtime-repair-in-process

Conversation

@christophergeyer
Copy link
Copy Markdown
Member

@christophergeyer christophergeyer commented May 27, 2026

Problem

roar run against a Python whose ABI differs from roar-cli's bundled deps (roar installed under 3.13, workload venv on 3.10, etc.) lazy-installs a matching runtime tree only when argv[0] is a python binary (abi_probe.probe_python_abi). Wrapper launches — torchrun, uv run, accelerate, shell scripts — bypass that launch-time probe, so the per-worker interpreters get roar's wrong-ABI pydantic_core and the run dies with:

ModuleNotFoundError: No module named 'pydantic_core._pydantic_core'

This is the failure hit while bringing up roar on a nanochat (torchrun, py3.10) job under a py3.13 roar.

Fix

Move the authoritative repair into the in-process sitecustomize gate, which runs inside the real worker where the ABI is known for certain regardless of how Python was launched. The launch-time install stays as a best-effort prewarm for the direct-python case. On an ABI mismatch the gate now:

  1. disables backend dispatch first — so the repair's own imports can't trigger a backend load (Ray/OSMO plugin → wrong-ABI pydantic_core → the very crash we're fixing);
  2. lazy-installs + prepends an ABI-matched runtime tree (tracking suppressed so the repair's file I/O stays out of workload lineage);
  3. re-enables dispatch only once matched deps are reachable.

The decision logic lives in support.apply_runtime_gate(...) so the disable-before-repair / enable-only-on-success ordering is unit-tested independently of the module body.

Supporting changes

  • lazy_install: per-ABI cross-process install lock (flock) with double-checked caching + timeout. Moving repair in-process means N torchrun workers hit a cold cache at once; the lock collapses that thundering herd to a single install, the rest wait and reuse it. Plus an env scrub so the installer subprocess can't re-inject roar into itself.
  • framework/registry: resilient backend discovery — a builtin/entry-point plugin whose compiled deps can't import (wrong-ABI wheel) is skipped (recorded for diagnostics) instead of crashing discovery and the workload. Necessary because re-enabling dispatch otherwise pulls each backend's full compiled-dep closure (e.g. Ray → cryptography's _rust.abi3.so).

Verification

  • ruff check ., ruff format --check ., mypy roar — clean.
  • Full unit suite: 1692 passed, 3 skipped (1 pre-existing telemetry failure is a local-env artifact, unrelated; passes in CI's clean env).
  • sitecustomize perf guard — unchanged (repair short-circuits on the matched path).
  • End-to-end (a bash launcher standing in for torchrun, spawning 8 concurrent cp310 workers under a cp313 roar): single install across all 8, every worker imports the ABI-matched pydantic_core from the cache, Ray skipped gracefully, no crash.

Regression tests added:

  • test_apply_runtime_gate_* — the gate ordering (disable-before-repair, enable-only-on-success, degrade-on-failure). Runs in every CI job and guards the core behavior change.
  • test_cross_python_runtime_repair — opt-in integration test: a bash launcher (torchrun/uv-run stand-in) spawns N workers on a different CPython, asserts each repairs in-process (imports the ABI-matched pydantic_core from cache) and that N concurrent workers collapse to a single install. Skipped unless uv + a second CPython minor are present.
  • Plus component tests: concurrent-install serialization (the herd), installer env scrub, dispatch re-enable, resilient discovery.

Follow-ups (out of scope)

  • For Ray/OSMO to fully work cross-ABI (not just be skipped), the runtime tree needs the complete backend dep closure (cryptography, …) — expand _RUNTIME_DEPS or install the full closure. (Alternatively, co-installing roar in the workload venv via uv pip install sidesteps the whole ABI problem and makes backends work natively.)
  • The gate's free-threaded (cp313t) ABI check uses a substring match that conflates cpython-313/cpython-313t; and the runtime cache key is cache_tag-only (no arch/libc). Pre-existing; worth a separate hardening pass.

🤖 Generated with Claude Code

chrisgeyertreqs and others added 2 commits May 27, 2026 22:35
`roar run` against a Python whose ABI differs from roar-cli's bundled deps
(e.g. roar installed under 3.13, workload venv on 3.10) lazy-installs a
matching runtime tree only when argv[0] is a `python` binary. Wrapper launches
(torchrun, `uv run`, shell scripts) bypass that probe, so the per-worker
interpreters got roar's wrong-ABI pydantic_core and crashed with
`ModuleNotFoundError: pydantic_core._pydantic_core`.

Move the authoritative repair into the in-process sitecustomize gate, which
runs in the real worker where the ABI is known for certain no matter how Python
was launched. On a mismatch it now:
  - disables backend dispatch first, so the repair's own imports can't trigger
    a wrong-ABI backend load (the original crash path),
  - lazy-installs + prepends an ABI-matched runtime tree (tracking suppressed so
    the repair's file I/O stays out of workload lineage),
  - re-enables dispatch only once matched deps are reachable.

Supporting changes:
  - lazy_install: per-ABI cross-process install lock with double-checked caching
    and a timeout, so N torchrun workers collapse to a single install instead of
    a thundering herd; plus an env scrub so the installer subprocess can't
    re-inject roar into itself.
  - framework registry: backend discovery skips a builtin/entry-point plugin
    whose compiled deps can't import (wrong-ABI wheel) rather than crashing
    discovery and the traced workload.

Verified end-to-end: a bash launcher (torchrun stand-in) spawning 8 concurrent
cp310 workers under a cp313 roar -> single install, all workers import the
ABI-matched pydantic_core from the cache, Ray skipped gracefully, no crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The components of the in-process repair were unit-tested, but the central
behavior change — the gate's disable-before-repair / enable-only-on-success
ordering — lived as module-level code in sitecustomize and was unguarded: a
reordering would pass every component test yet reintroduce the original crash.

- Extract the gate decision into `support.apply_runtime_gate(controller, *,
  matched, repair, on_degrade)` (a pure refactor of the sitecustomize body) and
  unit-test the ordering: dispatch disabled before repair runs, re-enabled +
  initialized only on success, degrade only on failure. Runs in every CI job.

- Add an opt-in integration test: a bash launcher (torchrun/uv-run stand-in)
  spawns N workers on a different CPython than roar's bundled deps; asserts each
  worker repairs in-process (imports the ABI-matched pydantic_core from the
  cache) and that N concurrent workers collapse to a single install. Skipped
  unless `uv` and a second CPython minor are present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christophergeyer christophergeyer marked this pull request as ready for review May 30, 2026 01:42
@christophergeyer christophergeyer merged commit 143e703 into main May 30, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants