fix(runtime): repair cross-Python ABI in-process for wrapper launches (torchrun)#132
Merged
Merged
Conversation
`roar run` against a Python whose ABI differs from roar-cli's bundled deps
(e.g. roar installed under 3.13, workload venv on 3.10) lazy-installs a
matching runtime tree only when argv[0] is a `python` binary. Wrapper launches
(torchrun, `uv run`, shell scripts) bypass that probe, so the per-worker
interpreters got roar's wrong-ABI pydantic_core and crashed with
`ModuleNotFoundError: pydantic_core._pydantic_core`.
Move the authoritative repair into the in-process sitecustomize gate, which
runs in the real worker where the ABI is known for certain no matter how Python
was launched. On a mismatch it now:
- disables backend dispatch first, so the repair's own imports can't trigger
a wrong-ABI backend load (the original crash path),
- lazy-installs + prepends an ABI-matched runtime tree (tracking suppressed so
the repair's file I/O stays out of workload lineage),
- re-enables dispatch only once matched deps are reachable.
Supporting changes:
- lazy_install: per-ABI cross-process install lock with double-checked caching
and a timeout, so N torchrun workers collapse to a single install instead of
a thundering herd; plus an env scrub so the installer subprocess can't
re-inject roar into itself.
- framework registry: backend discovery skips a builtin/entry-point plugin
whose compiled deps can't import (wrong-ABI wheel) rather than crashing
discovery and the traced workload.
Verified end-to-end: a bash launcher (torchrun stand-in) spawning 8 concurrent
cp310 workers under a cp313 roar -> single install, all workers import the
ABI-matched pydantic_core from the cache, Ray skipped gracefully, no crash.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The components of the in-process repair were unit-tested, but the central behavior change — the gate's disable-before-repair / enable-only-on-success ordering — lived as module-level code in sitecustomize and was unguarded: a reordering would pass every component test yet reintroduce the original crash. - Extract the gate decision into `support.apply_runtime_gate(controller, *, matched, repair, on_degrade)` (a pure refactor of the sitecustomize body) and unit-test the ordering: dispatch disabled before repair runs, re-enabled + initialized only on success, degrade only on failure. Runs in every CI job. - Add an opt-in integration test: a bash launcher (torchrun/uv-run stand-in) spawns N workers on a different CPython than roar's bundled deps; asserts each worker repairs in-process (imports the ABI-matched pydantic_core from the cache) and that N concurrent workers collapse to a single install. Skipped unless `uv` and a second CPython minor are present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TrevorBasinger
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
roar runagainst a Python whose ABI differs from roar-cli's bundled deps (roar installed under 3.13, workload venv on 3.10, etc.) lazy-installs a matching runtime tree only whenargv[0]is apythonbinary (abi_probe.probe_python_abi). Wrapper launches —torchrun,uv run,accelerate, shell scripts — bypass that launch-time probe, so the per-worker interpreters get roar's wrong-ABIpydantic_coreand the run dies with:This is the failure hit while bringing up roar on a nanochat (
torchrun, py3.10) job under a py3.13 roar.Fix
Move the authoritative repair into the in-process
sitecustomizegate, which runs inside the real worker where the ABI is known for certain regardless of how Python was launched. The launch-time install stays as a best-effort prewarm for the direct-pythoncase. On an ABI mismatch the gate now:pydantic_core→ the very crash we're fixing);The decision logic lives in
support.apply_runtime_gate(...)so the disable-before-repair / enable-only-on-success ordering is unit-tested independently of the module body.Supporting changes
lazy_install: per-ABI cross-process install lock (flock) with double-checked caching + timeout. Moving repair in-process means Ntorchrunworkers hit a cold cache at once; the lock collapses that thundering herd to a single install, the rest wait and reuse it. Plus an env scrub so the installer subprocess can't re-inject roar into itself.framework/registry: resilient backend discovery — a builtin/entry-point plugin whose compiled deps can't import (wrong-ABI wheel) is skipped (recorded for diagnostics) instead of crashing discovery and the workload. Necessary because re-enabling dispatch otherwise pulls each backend's full compiled-dep closure (e.g. Ray →cryptography's_rust.abi3.so).Verification
ruff check .,ruff format --check .,mypy roar— clean.torchrun, spawning 8 concurrent cp310 workers under a cp313 roar): single install across all 8, every worker imports the ABI-matchedpydantic_corefrom the cache, Ray skipped gracefully, no crash.Regression tests added:
test_apply_runtime_gate_*— the gate ordering (disable-before-repair, enable-only-on-success, degrade-on-failure). Runs in every CI job and guards the core behavior change.test_cross_python_runtime_repair— opt-in integration test: a bash launcher (torchrun/uv-run stand-in) spawns N workers on a different CPython, asserts each repairs in-process (imports the ABI-matchedpydantic_corefrom cache) and that N concurrent workers collapse to a single install. Skipped unlessuv+ a second CPython minor are present.Follow-ups (out of scope)
cryptography, …) — expand_RUNTIME_DEPSor install the full closure. (Alternatively, co-installing roar in the workload venv viauv pip installsidesteps the whole ABI problem and makes backends work natively.)cp313t) ABI check uses a substring match that conflatescpython-313/cpython-313t; and the runtime cache key iscache_tag-only (no arch/libc). Pre-existing; worth a separate hardening pass.🤖 Generated with Claude Code