Skip to content

[CI] Move H100/A100 benchmark & tutorial jobs to OSDC (ARC)#4458

Open
huydhn wants to merge 1 commit into
mainfrom
huydo/osdc-benchmark-perf-jobs
Open

[CI] Move H100/A100 benchmark & tutorial jobs to OSDC (ARC)#4458
huydhn wants to merge 1 commit into
mainfrom
huydo/osdc-benchmark-perf-jobs

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 30, 2026

What

Moves the direct runs-on: H100/A100 jobs onto OSDC (ARC). These don't use linux_job, so per the agreed approach they get a job-level container: added in place rather than being converted to the reusable workflow.

Workflow Runner: EC2 → ARC Upload
run_microbenchmarks.yml linux.aws.h100mt-l-x86iamx-22-225-h100 yes
dashboard_perf_test.yml linux.aws.a100mt-l-x86iavx512-11-125-a100 yes
run_tutorials.yml linux.aws.a100mt-l-x86iavx512-11-125-a100 no

Each job now runs inside pytorch/almalinux-builder:cuda13.0 with options: --gpus all (the container CUDA version is independent of the pip-installed cu126 torch; cuda13.0 matches the ARC nodes' driver). torch's pure-python deps are pre-installed from the in-cluster pypi-cache, and PIP_EXTRA_INDEX_URL is cleared on the torch install so the default cpu index can't supply a +cpu torch (same pattern as the H100 PRs).

The two upload jobs get id-token: write + a configure-aws-credentials step using role/gha_workflow_upload-benchmark-results (ARC pods have no EC2 instance profile).

Scope: H100/A100 only — regular GPU (g5/g6/L4) and CPU jobs are intentionally left on EC2. Companion PRs: #4456 (1x/4xH100), #4457 (release_model). Depends on linux_job_v3-era infra (now on pytorch/test-infra@main).

Needs validation

  • setup-miniconda + benchmark uploads inside the ARC container (esp. that role/gha_workflow_upload-benchmark-results is assumable from these pods).
  • actions/checkout/JS actions inside almalinux-builder (git present; node provided by the ARC container hook).

Switch the direct runs-on H100/A100 jobs (run_microbenchmarks, dashboard_perf_test,
run_tutorials) onto OSDC/ARC by adding a job-level CUDA container with --gpus all
and the ARC runner labels:
- run_microbenchmarks: linux.aws.h100 -> mt-l-x86iamx-22-225-h100
- dashboard_perf_test / run_tutorials: linux.aws.a100 -> mt-l-x86iavx512-11-125-a100

Pre-install torch's pure-python deps from the in-cluster pypi-cache so the nightly
cu126 install doesn't reach the cache-enforcer-blocked files.pythonhosted.org, and
clear PIP_EXTRA_INDEX_URL so the default cpu index can't supply a +cpu torch. The
two benchmark-upload jobs get id-token + the gha_workflow_upload-benchmark-results
role for OSS dashboard uploads (no EC2 instance profile on ARC).
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4458

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d695f80 with merge base 9c010ae (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 30, 2026
@huydhn huydhn marked this pull request as ready for review May 30, 2026 01:37
@huydhn huydhn requested review from jerryzh168 and vkuzo as code owners May 30, 2026 01:37
@huydhn huydhn changed the title [Draft][CI] Move H100/A100 benchmark & tutorial jobs to OSDC (ARC) [CI] Move H100/A100 benchmark & tutorial jobs to OSDC (ARC) May 30, 2026
@huydhn huydhn added the module: not user facing Use this tag if you don't want this PR to show up in release notes label May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant