Pre-populate worker python env on host VM#207
Open
JacobZuliani wants to merge 3 commits into
Open
Conversation
Worker[0] used to download uv from GitHub and `uv pip install burla` over
PyPI inside its container during boot. On a fresh VM that consistently
ran 15-25s and was tripping the 20s WORKER_BOOT_TIMEOUT_SECONDS, causing
~50% of node boots to fail in busy projects. Move both steps into the
host startup-script (which already templates CURRENT_BURLA_VERSION) so
worker_server.py short-circuits both `shutil.which("uv")` and the
`importlib.metadata.version("burla")` checks. Bumps the timeout to 60s
as a safety margin for the fallback paths.
The pre-populate path was hard-coding cp312 wheels, but the user can run any python image. Pull the first image in CONTAINERS during the startup script and ask its python interpreter for `sys.version_info`, then pass that to `uv pip install --python-version` so cp311 / cp312 / cp313 (etc.) images all get matching ABI wheels in the shared env. Wrap the whole block in a subshell + `|| true` so an unusual image (no python on PATH, unauthenticated registry, etc.) doesn't trigger the outer ERR trap and delete the VM — worker_server.py will fall back to its own install path in that case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Move
uvdownload +uv pip install burlaout of the worker container's first-run hot path and into the host VM's startup script. Detect the user image's python version dynamically (so cp311 / cp312 / cp313 / etc. all get matching ABI wheels).The host startup script already has
uvinstalled and is templated withCURRENT_BURLA_VERSION. New flow:CONTAINERS(env var the startup-script already exports) and grab the first image.docker pullit (no-op when node_service later pulls; saves time).docker run --rm --entrypoint python "$IMAGE" -c '...sys.version_info...'to get e.g.3.13.uv pip install --python-version "$PY_VERSION" --python-platform x86_64-manylinux2014 --target /worker_service_python_env burla==<CURRENT_BURLA_VERSION>.cp $(command -v uv) /worker_service_python_env/bin/uvso worker_server.py'sshutil.which("uv")short-circuits.The whole block is wrapped in a subshell +
|| true. If the image has no python on PATH, the registry isn't accessible, or anything else goes wrong, the env is just left empty and worker_server.py falls through to its existing GitHub uv download /uv pip install burlapath on first boot. We never let pre-populate failure trigger the outerset -Eeuo pipefailtrap and delete the VM.Multi-image cluster configs use the first image's python version. That's the same constraint the existing per-worker install has: a single
/worker_service_python_envis mounted into every worker regardless of image, so it has to pick one ABI tag. Not addressed in this PR.worker_server.pyshort-circuits cleanly because it already checks bothshutil.which("uv")andimportlib.metadata.version("burla") == target_burla_version.Why
Worker[0] was consistently exceeding the 20s boot budget on fresh VMs because it had to download uv from GitHub plus install ~20 PyPI deps inside the container before opening its TCP socket. In burla-3 that produced 36
Worker boot timed out after 20 secondsfailures on 2026-05-06 alone, every single one tracing back tolifecycle_endpoints.py:374 await workers[0].boot()with the buffered worker logs containing only3.12(the python-version line printed before the install steps).Test plan
burla-agent-04), ranmake -f makefile remote-dev.cluster_config.Nodes[0].containers[0].image = "python:3.13"(intentionally different from the previous default 3.12) and triggered/v1/cluster/restart.burla-node-2da1c4dc):find /worker_service_python_env -name "*cpython-313*" | wc -l-> 15 (cp313 .so files forfrozenlist,propcache,charset_normalizer, etc.).find /worker_service_python_env -name "*cpython-312*"-> 0.cryptography-48.0.0.dist-info/WHEEL->Tag: cp311-abi3-manylinux_2_17_x86_64(abi3-stable, runs on 3.13).3.13-> worker_server.py skipped both download paths.