Migrate to OSDC runners and containerize the GPU workflows#179
Open
huydhn wants to merge 4 commits into
Open
Conversation
Switch all linux.aws.h100* and linux.aws.a100 runner labels to their OSDC/ARC equivalents. Labels follow the mapping in pytorch/pytorch .github/arc.yaml, with the mt- (Meta multi-tenant) prefix that OSDC production runners use: linux.aws.a100 -> mt-l-x86iavx512-11-125-a100 (1 GPU) linux.aws.h100 -> mt-l-x86iamx-22-225-h100 (1 GPU) linux.aws.h100.4 -> mt-l-x86iamx-88-900-h100-4 (4 GPU) linux.aws.h100.8 -> mt-l-bx86iamx-176-1800-h100-8 (8 GPU) Files: - generate_vllm_benchmark_matrix.py: TP_TO_RUNNER_MAPPING and RUNNER_TO_PLATFORM_MAPPING get the full label rename. In PLATFORM_SKIPS the skip tokens become the bare GPU-type 'h100'/'a100' so they remain a substring of the OSDC names, preserving the 'skip the whole family' behavior the substring matcher relies on (matters for h100, which has 1/4/8-GPU variants). - vllm-ci-test.yml, vllm-profiling.yml, pytorch-bisect.yaml: runs-on / runner choice updated. - test fixture: expected runner values updated to the OSDC names. The matrix output is unchanged except for the runner label strings (verified: every model<->runner pairing is identical after the rename).
8d498d9 to
a7a54a0
Compare
OSDC/ARC runners are ephemeral pods with no Docker daemon, so the old 'docker run --gpus all + docker exec' pattern cannot work on them. Run the vLLM CI image via the job-level container: key with options '--gpus all' instead (the GPU is injected by the runner pod), matching pytorch/pytorch _linux-test.yml (test-osdc) and pytorch/helion. - vllm-ci-test.yml / vllm-profiling.yml: add an ubuntu-latest 'resolve-image' pre-job that runs 'docker manifest inspect' (needs a daemon the pod lacks) to pick the latest available vLLM CI image and pass it down as the container image. Drop the GPU_FLAG/docker run/docker exec wrapper and the /tmp/workspace bind-mount; run the scripts directly in the container. - vllm-profiling.yml: assume the upload IAM role via OIDC before the S3 upload (ephemeral pods have no host instance role); pass the resolved vLLM commit through as S3_HEAD_SHA. - run_vllm_profiling.sh: use $GITHUB_WORKSPACE instead of the hardcoded /tmp/workspace bind-mount path.
Comment on lines
33
to
+82
| @@ -52,11 +44,12 @@ jobs: | |||
| ref: ${{ inputs.vllm_branch || 'main' }} | |||
| fetch-depth: 0 | |||
|
|
|||
| - name: Set Docker registry | |||
| shell: bash | |||
| - name: Resolve the latest available vLLM CI image | |||
| id: resolve | |||
| working-directory: vllm | |||
| env: | |||
| HEAD_BRANCH: ${{ inputs.vllm_branch || 'main' }} | |||
| DEVICE_NAME: ${{ matrix.device-name }} | |||
| HEAD_SHA: ${{ inputs.vllm_commit || '' }} | |||
| run: | | |||
| set -eux | |||
|
|
|||
| @@ -67,67 +60,59 @@ jobs: | |||
| DOCKER_IMAGE_PREFIX=public.ecr.aws/q9t5s3a7/vllm-ci-test-repo | |||
| fi | |||
|
|
|||
| DOCKER_IMAGE_SUFFIX="" | |||
| if [[ "${DEVICE_NAME}" == "rocm" ]]; then | |||
| DOCKER_IMAGE_PREFIX=docker.io/rocm/vllm-ci | |||
| elif [[ "${DEVICE_NAME}" == "cpu" ]]; then | |||
| DOCKER_IMAGE_SUFFIX=-cpu | |||
| fi | |||
| echo "DOCKER_IMAGE_PREFIX=$DOCKER_IMAGE_PREFIX" >> $GITHUB_ENV | |||
| echo "DOCKER_IMAGE_SUFFIX=$DOCKER_IMAGE_SUFFIX" >> $GITHUB_ENV | |||
|
|
|||
| - name: Check for available Docker image | |||
| working-directory: vllm | |||
| env: | |||
| HEAD_BRANCH: ${{ inputs.vllm_branch || 'main' }} | |||
| HEAD_SHA: ${{ inputs.vllm_commit || '' }} | |||
| run: | | |||
| set -eux | |||
|
|
|||
| if [[ -z "${HEAD_SHA}" ]]; then | |||
| # Looking back the latest 100 commits is enough | |||
| for i in {0..99} | |||
| do | |||
| for i in {0..99}; do | |||
| # Check if the image is there, if it doesn't then check an older one | |||
| # because the commit is too recent | |||
| HEAD_SHA=$(git rev-parse --verify HEAD~${i}) | |||
| DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}${DOCKER_IMAGE_SUFFIX}" | |||
|
|
|||
| # No Docker image available yet because the commit is too recent | |||
| DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}" | |||
| if docker manifest inspect "${DOCKER_IMAGE}"; then | |||
| break | |||
| fi | |||
| done | |||
| fi | |||
|
|
|||
| echo "HEAD_SHA=$HEAD_SHA" >> $GITHUB_ENV | |||
| echo "docker-image=${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}" >> "${GITHUB_OUTPUT}" | |||
|
|
|||
| - name: Setup CUDA GPU_FLAG for docker run | |||
| if: matrix.device-name == 'cuda' | |||
| test: | |||
| name: Run vLLM tests | |||
| needs: resolve-image | |||
| if: ${{ !github.event.pull_request.head.repo.fork && github.repository_owner == 'pytorch' }} | |||
| strategy: | |||
Comment on lines
33
to
+80
| @@ -124,98 +60,86 @@ jobs: | |||
| DOCKER_IMAGE_PREFIX=public.ecr.aws/q9t5s3a7/vllm-ci-test-repo | |||
| fi | |||
|
|
|||
| DOCKER_IMAGE_SUFFIX="" | |||
| if [[ "${DEVICE_NAME}" == "rocm" ]]; then | |||
| DOCKER_IMAGE_PREFIX=docker.io/rocm/vllm-ci | |||
| elif [[ "${DEVICE_NAME}" == "cpu" ]]; then | |||
| DOCKER_IMAGE_SUFFIX=-cpu | |||
| fi | |||
| echo "DOCKER_IMAGE_PREFIX=$DOCKER_IMAGE_PREFIX" >> $GITHUB_ENV | |||
| echo "DOCKER_IMAGE_SUFFIX=$DOCKER_IMAGE_SUFFIX" >> $GITHUB_ENV | |||
|
|
|||
| - name: Check for last commit | |||
| working-directory: vllm-profiling/vllm | |||
| env: | |||
| HEAD_BRANCH: ${{ inputs.vllm_branch || 'main' }} | |||
| HEAD_SHA: ${{ inputs.vllm_commit || '' }} | |||
| run: | | |||
| set -eux | |||
|
|
|||
| if [[ -z "${HEAD_SHA}" ]]; then | |||
| for i in {0..99} | |||
| do | |||
| for i in {0..99}; do | |||
| HEAD_SHA=$(git rev-parse --verify HEAD~${i}) | |||
| DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}${DOCKER_IMAGE_SUFFIX}" | |||
|
|
|||
| DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}" | |||
| # Docker image available for this commit, then exit | |||
| if docker manifest inspect "${DOCKER_IMAGE}"; then | |||
| break | |||
| fi | |||
| done | |||
| fi | |||
|
|
|||
| echo "HEAD_SHA=$HEAD_SHA" >> $GITHUB_ENV | |||
| echo "docker-image=${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}" >> "${GITHUB_OUTPUT}" | |||
| echo "head-sha=${HEAD_SHA}" >> "${GITHUB_OUTPUT}" | |||
| echo "### Run profiling on [${HEAD_SHA}](https://github.com/vllm-project/vllm/commit/${HEAD_SHA})" >> "${GITHUB_STEP_SUMMARY}" | |||
|
|
|||
| - name: Setup CUDA GPU_FLAG for docker run | |||
| if: env.DEVICE_NAME == 'cuda' | |||
| profiling: | |||
| name: Run vLLM profiling | |||
| needs: resolve-image | |||
…ainer
The mt- runner is an ephemeral OSDC pod with no host CUDA toolchain, so build
PyTorch inside pytorch/pytorch:2.12.0-cuda13.0-cudnn9-devel (--gpus all) when
the mt- runner is selected; linux.dgx.b200 keeps the existing bare-host path
(conditional container via fromJSON('null')). CUDA_HOME points at the image's
/usr/local/cuda on the container path (run.sh requires it non-empty). Add a
git safe.directory step for the root-owned in-container checkout.
Run every matrix device inside a job-level container: instead of the old
'docker run + docker exec' pattern, since OSDC/ARC pods have no Docker daemon.
- generate_vllm_benchmark_matrix.py: emit a per-entry 'device-name' so the
workflow can resolve the container image up front (regenerated the test
fixture, which also clears pre-existing config drift).
- set-parameters: resolve the upstream image on ubuntu-latest (which has a
daemon) via 'docker manifest inspect', then enrich every matrix entry with
container-image + device-appropriate container-options. sglang resolves per
image suffix (cuda / -cu128-b200 / -rocm630-mi30x) and skips non-cuda/rocm
devices instead of failing the whole matrix.
- benchmarks job: add container: { image, options }, drop the device probe
(device-name comes from the matrix) while keeping the runtime DEVICE_TYPE
detection, run the benchmark script natively, and assume the upload IAM
role via OIDC for all devices (no host instance role inside a pod).
chown is made sudo-optional for the in-container root user.
Flagged for CI validation: the per-model S3 'already benchmarked' dedup is
dropped (needs the runtime device-type before the container exists); the
rocm/hpu container options are best-effort; and --shm-size may be capped on
ARC pods.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Migrates this repo off the AWS H100/A100 runners onto OSDC (ARC) runners, and rewrites the GPU workflows to run their workloads inside a job-level
container:— OSDC/ARC runners are ephemeral pods with no Docker daemon, so the olddocker run … docker execpattern cannot work on them. The container pattern matches pytorch/pytorch_linux-test.yml(test-osdc) and pytorch/helion.1. Runner label migration (arc.yaml mapping +
mt-prefix)linux.aws.a100mt-l-x86iavx512-11-125-a100linux.aws.h100mt-l-x86iamx-22-225-h100linux.aws.h100.4mt-l-x86iamx-88-900-h100-4linux.aws.h100.8mt-l-bx86iamx-176-1800-h100-82. Containerization
ubuntu-latestresolve-imagepre-job (docker manifest inspectneeds a daemon the pod lacks) that picks the latest available vLLM CI image and passes it to acontainer:(--gpus all). Drop thedocker run/docker execwrapper +/tmp/workspacebind mount; run scripts natively.run_vllm_profiling.shnow uses$GITHUB_WORKSPACE. Profiling assumes the upload IAM role via OIDC.mt-path builds PyTorch insidepytorch/pytorch:2.12.0-cuda13.0-cudnn9-devel(conditionalcontainer:);linux.dgx.b200keeps the bare-host path.CUDA_HOMEpoints at the image CUDA on the container path.generate_vllm_benchmark_matrix.pynow emits a per-entrydevice-name;set-parametersresolves the per-device image up front and enriches the matrix withcontainer-image+container-options; every device runs in acontainer:with native execution; all devices assume the upload role via OIDC.These were verified for YAML validity + the matrix/enrichment pipeline end-to-end, but the runtime behavior must be confirmed on real runners:
/dev/kfd+/dev/dri, Habana runtime) — no verified evidence they work as job-levelcontainer.optionson these pods.--shm-size(4g / 32g) may be silently capped on ARC pods; large multi-GPU NCCL runs may need attention.2.12.0-cuda13.0-cudnn9-devel) and tritonparse build scripts working in-container are unverified.Out of scope (not OSDC)
flash_attention.yml(b200 DGX),inductor.yml/tritonbench*.yml(AWS g5 / b200, reusable workflow) — left untouched.🤖 Generated with Claude Code