Skip to content

Pre-populate worker python env on host VM#207

Open
JacobZuliani wants to merge 3 commits into
mainfrom
prepopulate-worker-env
Open

Pre-populate worker python env on host VM#207
JacobZuliani wants to merge 3 commits into
mainfrom
prepopulate-worker-env

Conversation

@JacobZuliani

@JacobZuliani JacobZuliani commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Move uv download + uv pip install burla out of the worker container's first-run hot path and into the host VM's startup script. Detect the user image's python version dynamically (so cp311 / cp312 / cp313 / etc. all get matching ABI wheels).

The host startup script already has uv installed and is templated with CURRENT_BURLA_VERSION. New flow:

  1. Parse CONTAINERS (env var the startup-script already exports) and grab the first image.
  2. docker pull it (no-op when node_service later pulls; saves time).
  3. docker run --rm --entrypoint python "$IMAGE" -c '...sys.version_info...' to get e.g. 3.13.
  4. uv pip install --python-version "$PY_VERSION" --python-platform x86_64-manylinux2014 --target /worker_service_python_env burla==<CURRENT_BURLA_VERSION>.
  5. cp $(command -v uv) /worker_service_python_env/bin/uv so worker_server.py's shutil.which("uv") short-circuits.

The whole block is wrapped in a subshell + || true. If the image has no python on PATH, the registry isn't accessible, or anything else goes wrong, the env is just left empty and worker_server.py falls through to its existing GitHub uv download / uv pip install burla path on first boot. We never let pre-populate failure trigger the outer set -Eeuo pipefail trap and delete the VM.

Multi-image cluster configs use the first image's python version. That's the same constraint the existing per-worker install has: a single /worker_service_python_env is mounted into every worker regardless of image, so it has to pick one ABI tag. Not addressed in this PR.

worker_server.py short-circuits cleanly because it already checks both shutil.which("uv") and importlib.metadata.version("burla") == target_burla_version.

Why

Worker[0] was consistently exceeding the 20s boot budget on fresh VMs because it had to download uv from GitHub plus install ~20 PyPI deps inside the container before opening its TCP socket. In burla-3 that produced 36 Worker boot timed out after 20 seconds failures on 2026-05-06 alone, every single one tracing back to lifecycle_endpoints.py:374 await workers[0].boot() with the buffered worker logs containing only 3.12 (the python-version line printed before the install steps).

Test plan

  • Implemented in worktree, no lint errors.
  • Synced to dev VM slot 04 (burla-agent-04), ran make -f makefile remote-dev.
  • Set cluster_config.Nodes[0].containers[0].image = "python:3.13" (intentionally different from the previous default 3.12) and triggered /v1/cluster/restart.
  • Verified on the booted node (burla-node-2da1c4dc):
    • find /worker_service_python_env -name "*cpython-313*" | wc -l -> 15 (cp313 .so files for frozenlist, propcache, charset_normalizer, etc.).
    • find /worker_service_python_env -name "*cpython-312*" -> 0.
    • cryptography-48.0.0.dist-info/WHEEL -> Tag: cp311-abi3-manylinux_2_17_x86_64 (abi3-stable, runs on 3.13).
  • Verified all 4 worker container logs contain only 3.13 -> worker_server.py skipped both download paths.
  • Boot timing on the test node: image pulled (cached from pre-populate, <1s) at 14:58:17, all 4 workers ready at 14:58:20 (~3s end-to-end vs ~9s previously).

Worker[0] used to download uv from GitHub and `uv pip install burla` over
PyPI inside its container during boot. On a fresh VM that consistently
ran 15-25s and was tripping the 20s WORKER_BOOT_TIMEOUT_SECONDS, causing
~50% of node boots to fail in busy projects. Move both steps into the
host startup-script (which already templates CURRENT_BURLA_VERSION) so
worker_server.py short-circuits both `shutil.which("uv")` and the
`importlib.metadata.version("burla")` checks. Bumps the timeout to 60s
as a safety margin for the fallback paths.
The pre-populate path was hard-coding cp312 wheels, but the user can run
any python image. Pull the first image in CONTAINERS during the startup
script and ask its python interpreter for `sys.version_info`, then pass
that to `uv pip install --python-version` so cp311 / cp312 / cp313 (etc.)
images all get matching ABI wheels in the shared env. Wrap the whole
block in a subshell + `|| true` so an unusual image (no python on PATH,
unauthenticated registry, etc.) doesn't trigger the outer ERR trap and
delete the VM — worker_server.py will fall back to its own install path
in that case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant