Migrate to OSDC runners and containerize the GPU workflows by huydhn · Pull Request #179 · pytorch/pytorch-integration-testing

huydhn · 2026-05-29T17:35:55Z

Migrates this repo off the AWS H100/A100 runners onto OSDC (ARC) runners, and rewrites the GPU workflows to run their workloads inside a job-level container: — OSDC/ARC runners are ephemeral pods with no Docker daemon, so the old docker run … docker exec pattern cannot work on them. The container pattern matches pytorch/pytorch _linux-test.yml (test-osdc) and pytorch/helion.

1. Runner label migration (arc.yaml mapping + `mt-` prefix)

Old	OSDC label
`linux.aws.a100`	`mt-l-x86iavx512-11-125-a100`
`linux.aws.h100`	`mt-l-x86iamx-22-225-h100`
`linux.aws.h100.4`	`mt-l-x86iamx-88-900-h100-4`
`linux.aws.h100.8`	`mt-l-bx86iamx-176-1800-h100-8`

2. Containerization

vllm-ci-test.yml / vllm-profiling.yml — add an ubuntu-latest resolve-image pre-job (docker manifest inspect needs a daemon the pod lacks) that picks the latest available vLLM CI image and passes it to a container: (--gpus all). Drop the docker run/docker exec wrapper + /tmp/workspace bind mount; run scripts natively. run_vllm_profiling.sh now uses $GITHUB_WORKSPACE. Profiling assumes the upload IAM role via OIDC.
pytorch-bisect.yaml — the mt- path builds PyTorch inside pytorch/pytorch:2.12.0-cuda13.0-cudnn9-devel (conditional container:); linux.dgx.b200 keeps the bare-host path. CUDA_HOME points at the image CUDA on the container path.
vllm-benchmark.yml / sglang-benchmark.yml — generate_vllm_benchmark_matrix.py now emits a per-entry device-name; set-parameters resolves the per-device image up front and enriches the matrix with container-image + container-options; every device runs in a container: with native execution; all devices assume the upload role via OIDC.

⚠️ Needs CI validation (no OSDC/host runner available locally)

These were verified for YAML validity + the matrix/enrichment pipeline end-to-end, but the runtime behavior must be confirmed on real runners:

rocm / hpu container options are best-effort (/dev/kfd+/dev/dri, Habana runtime) — no verified evidence they work as job-level container.options on these pods.
--shm-size (4g / 32g) may be silently capped on ARC pods; large multi-GPU NCCL runs may need attention.
The per-model S3 “already benchmarked” dedup in the benchmarks was dropped (it needs the runtime GPU device-type, which is unknown before the container starts) — the resolved commit is now benchmarked unconditionally.
The bisect CUDA-devel image tag (2.12.0-cuda13.0-cudnn9-devel) and tritonparse build scripts working in-container are unverified.
OIDC token retrieval from inside the container for the S3 upload role.

Out of scope (not OSDC)

flash_attention.yml (b200 DGX), inductor.yml / tritonbench*.yml (AWS g5 / b200, reusable workflow) — left untouched.

🤖 Generated with Claude Code

Switch all linux.aws.h100* and linux.aws.a100 runner labels to their OSDC/ARC equivalents. Labels follow the mapping in pytorch/pytorch .github/arc.yaml, with the mt- (Meta multi-tenant) prefix that OSDC production runners use: linux.aws.a100 -> mt-l-x86iavx512-11-125-a100 (1 GPU) linux.aws.h100 -> mt-l-x86iamx-22-225-h100 (1 GPU) linux.aws.h100.4 -> mt-l-x86iamx-88-900-h100-4 (4 GPU) linux.aws.h100.8 -> mt-l-bx86iamx-176-1800-h100-8 (8 GPU) Files: - generate_vllm_benchmark_matrix.py: TP_TO_RUNNER_MAPPING and RUNNER_TO_PLATFORM_MAPPING get the full label rename. In PLATFORM_SKIPS the skip tokens become the bare GPU-type 'h100'/'a100' so they remain a substring of the OSDC names, preserving the 'skip the whole family' behavior the substring matcher relies on (matters for h100, which has 1/4/8-GPU variants). - vllm-ci-test.yml, vllm-profiling.yml, pytorch-bisect.yaml: runs-on / runner choice updated. - test fixture: expected runner values updated to the OSDC names. The matrix output is unchanged except for the runner label strings (verified: every model<->runner pairing is identical after the rename).

OSDC/ARC runners are ephemeral pods with no Docker daemon, so the old 'docker run --gpus all + docker exec' pattern cannot work on them. Run the vLLM CI image via the job-level container: key with options '--gpus all' instead (the GPU is injected by the runner pod), matching pytorch/pytorch _linux-test.yml (test-osdc) and pytorch/helion. - vllm-ci-test.yml / vllm-profiling.yml: add an ubuntu-latest 'resolve-image' pre-job that runs 'docker manifest inspect' (needs a daemon the pod lacks) to pick the latest available vLLM CI image and pass it down as the container image. Drop the GPU_FLAG/docker run/docker exec wrapper and the /tmp/workspace bind-mount; run the scripts directly in the container. - vllm-profiling.yml: assume the upload IAM role via OIDC before the S3 upload (ephemeral pods have no host instance role); pass the resolved vLLM commit through as S3_HEAD_SHA. - run_vllm_profiling.sh: use $GITHUB_WORKSPACE instead of the hardcoded /tmp/workspace bind-mount path.

@@ -52,11 +44,12 @@ jobs:
          ref: ${{ inputs.vllm_branch || 'main' }}
          fetch-depth: 0

-      - name: Set Docker registry
-        shell: bash
+      - name: Resolve the latest available vLLM CI image
+        id: resolve
+        working-directory: vllm
        env:
          HEAD_BRANCH: ${{ inputs.vllm_branch || 'main' }}
-          DEVICE_NAME: ${{ matrix.device-name }}
+          HEAD_SHA: ${{ inputs.vllm_commit || '' }}
        run: |
          set -eux

@@ -67,67 +60,59 @@ jobs:
            DOCKER_IMAGE_PREFIX=public.ecr.aws/q9t5s3a7/vllm-ci-test-repo
          fi

-          DOCKER_IMAGE_SUFFIX=""
-          if [[ "${DEVICE_NAME}" == "rocm" ]]; then
-            DOCKER_IMAGE_PREFIX=docker.io/rocm/vllm-ci
-          elif [[ "${DEVICE_NAME}" == "cpu" ]]; then
-            DOCKER_IMAGE_SUFFIX=-cpu
-          fi
-          echo "DOCKER_IMAGE_PREFIX=$DOCKER_IMAGE_PREFIX" >> $GITHUB_ENV
-          echo "DOCKER_IMAGE_SUFFIX=$DOCKER_IMAGE_SUFFIX" >> $GITHUB_ENV
-
-      - name: Check for available Docker image
-        working-directory: vllm
-        env:
-          HEAD_BRANCH: ${{ inputs.vllm_branch || 'main' }}
-          HEAD_SHA: ${{ inputs.vllm_commit || '' }}
-        run: |
-          set -eux
-
          if [[ -z "${HEAD_SHA}" ]]; then
            # Looking back the latest 100 commits is enough
-            for i in {0..99}
-            do
+            for i in {0..99}; do
              # Check if the image is there, if it doesn't then check an older one
              # because the commit is too recent
              HEAD_SHA=$(git rev-parse --verify HEAD~${i})
-              DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}${DOCKER_IMAGE_SUFFIX}"
-
-              # No Docker image available yet because the commit is too recent
+              DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}"
              if docker manifest inspect "${DOCKER_IMAGE}"; then
                break
              fi
            done
          fi

-          echo "HEAD_SHA=$HEAD_SHA" >> $GITHUB_ENV
+          echo "docker-image=${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}" >> "${GITHUB_OUTPUT}"

-      - name: Setup CUDA GPU_FLAG for docker run
-        if: matrix.device-name == 'cuda'
+  test:
+    name: Run vLLM tests
+    needs: resolve-image
+    if: ${{ !github.event.pull_request.head.repo.fork && github.repository_owner == 'pytorch' }}
+    strategy:


@@ -124,98 +60,86 @@ jobs:
            DOCKER_IMAGE_PREFIX=public.ecr.aws/q9t5s3a7/vllm-ci-test-repo
          fi

-          DOCKER_IMAGE_SUFFIX=""
-          if [[ "${DEVICE_NAME}" == "rocm" ]]; then
-            DOCKER_IMAGE_PREFIX=docker.io/rocm/vllm-ci
-          elif [[ "${DEVICE_NAME}" == "cpu" ]]; then
-            DOCKER_IMAGE_SUFFIX=-cpu
-          fi
-          echo "DOCKER_IMAGE_PREFIX=$DOCKER_IMAGE_PREFIX" >> $GITHUB_ENV
-          echo "DOCKER_IMAGE_SUFFIX=$DOCKER_IMAGE_SUFFIX" >> $GITHUB_ENV
-
-      - name: Check for last commit
-        working-directory: vllm-profiling/vllm
-        env:
-          HEAD_BRANCH: ${{ inputs.vllm_branch || 'main' }}
-          HEAD_SHA: ${{ inputs.vllm_commit || '' }}
-        run: |
-          set -eux
-
          if [[ -z "${HEAD_SHA}" ]]; then
-            for i in {0..99}
-            do
+            for i in {0..99}; do
              HEAD_SHA=$(git rev-parse --verify HEAD~${i})
-              DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}${DOCKER_IMAGE_SUFFIX}"
-
+              DOCKER_IMAGE="${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}"
              # Docker image available for this commit, then exit
              if docker manifest inspect "${DOCKER_IMAGE}"; then
                break
              fi
            done
          fi

-          echo "HEAD_SHA=$HEAD_SHA" >> $GITHUB_ENV
+          echo "docker-image=${DOCKER_IMAGE_PREFIX}:${HEAD_SHA}" >> "${GITHUB_OUTPUT}"
+          echo "head-sha=${HEAD_SHA}" >> "${GITHUB_OUTPUT}"
          echo "### Run profiling on [${HEAD_SHA}](https://github.com/vllm-project/vllm/commit/${HEAD_SHA})" >> "${GITHUB_STEP_SUMMARY}"

-      - name: Setup CUDA GPU_FLAG for docker run
-        if: env.DEVICE_NAME == 'cuda'
+  profiling:
+    name: Run vLLM profiling
+    needs: resolve-image


…ainer The mt- runner is an ephemeral OSDC pod with no host CUDA toolchain, so build PyTorch inside pytorch/pytorch:2.12.0-cuda13.0-cudnn9-devel (--gpus all) when the mt- runner is selected; linux.dgx.b200 keeps the existing bare-host path (conditional container via fromJSON('null')). CUDA_HOME points at the image's /usr/local/cuda on the container path (run.sh requires it non-empty). Add a git safe.directory step for the root-owned in-container checkout.

Run every matrix device inside a job-level container: instead of the old 'docker run + docker exec' pattern, since OSDC/ARC pods have no Docker daemon. - generate_vllm_benchmark_matrix.py: emit a per-entry 'device-name' so the workflow can resolve the container image up front (regenerated the test fixture, which also clears pre-existing config drift). - set-parameters: resolve the upstream image on ubuntu-latest (which has a daemon) via 'docker manifest inspect', then enrich every matrix entry with container-image + device-appropriate container-options. sglang resolves per image suffix (cuda / -cu128-b200 / -rocm630-mi30x) and skips non-cuda/rocm devices instead of failing the whole matrix. - benchmarks job: add container: { image, options }, drop the device probe (device-name comes from the matrix) while keeping the runtime DEVICE_TYPE detection, run the benchmark script natively, and assume the upload IAM role via OIDC for all devices (no host instance role inside a pod). chown is made sudo-optional for the in-container root user. Flagged for CI validation: the per-model S3 'already benchmarked' dedup is dropped (needs the runtime device-type before the container exists); the rocm/hpu container options are best-effort; and --shm-size may be capped on ARC pods.

huydhn had a problem deploying to pytorch-x-vllm May 29, 2026 17:36 — with GitHub Actions Error

meta-cla Bot added the cla signed label May 29, 2026

huydhn force-pushed the migrate-h100-to-osdc-runners branch from 8d498d9 to a7a54a0 Compare May 29, 2026 19:18

huydhn had a problem deploying to pytorch-x-vllm May 29, 2026 19:19 — with GitHub Actions Failure

huydhn changed the title ~~Migrate linux.aws.h100 runners to OSDC (ARC) runners~~ Migrate linux.aws.h100/a100 runners to OSDC (ARC) runners May 29, 2026

huydhn had a problem deploying to pytorch-x-vllm May 29, 2026 19:25 — with GitHub Actions Failure

huydhn had a problem deploying to pytorch-x-vllm May 29, 2026 19:50 — with GitHub Actions Failure

huydhn temporarily deployed to pytorch-x-vllm May 29, 2026 19:50 — with GitHub Actions Inactive

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

huydhn had a problem deploying to pytorch-x-vllm May 29, 2026 20:05 — with GitHub Actions Failure

huydhn temporarily deployed to pytorch-x-vllm May 29, 2026 20:06 — with GitHub Actions Inactive

huydhn changed the title ~~Migrate linux.aws.h100/a100 runners to OSDC (ARC) runners~~ Migrate to OSDC runners and containerize the GPU workflows May 29, 2026

huydhn temporarily deployed to pytorch-x-vllm May 29, 2026 20:33 — with GitHub Actions Inactive

huydhn had a problem deploying to pytorch-x-vllm May 29, 2026 20:33 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to OSDC runners and containerize the GPU workflows#179

Migrate to OSDC runners and containerize the GPU workflows#179
huydhn wants to merge 4 commits into
mainfrom
migrate-h100-to-osdc-runners

huydhn commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

huydhn commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Runner label migration (arc.yaml mapping + mt- prefix)

2. Containerization

⚠️ Needs CI validation (no OSDC/host runner available locally)

Out of scope (not OSDC)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

huydhn commented May 29, 2026 •

edited

Loading

1. Runner label migration (arc.yaml mapping + `mt-` prefix)