Skip to content

[CI] Switch release_model H100 job to linux_job_v3 (OSDC/ARC)#4457

Open
huydhn wants to merge 1 commit into
mainfrom
huydo/osdc-release-model
Open

[CI] Switch release_model H100 job to linux_job_v3 (OSDC/ARC)#4457
huydhn wants to merge 1 commit into
mainfrom
huydo/osdc-release-model

Conversation

@huydhn
Copy link
Copy Markdown
Contributor

@huydhn huydhn commented May 30, 2026

What

Migrates the release_model.yml H100 job from EC2 linux_job_v2 to the OSDC (ARC) linux_job_v3 reusable workflow.

  • Runner: linux.aws.h100mt-l-x86iamx-22-225-h100 (from .github/arc.yaml).
  • uses: …/linux_job_v3.yml@main.
  • Pre-install torch's pure-python deps from the in-cluster pypi-cache, then install torch from the literal nightly cu126 index. On ARC, cache-enforcer iptables-blocks files.pythonhosted.org, so the deps (which the nightly index references there) must come from the reachable cache; mslk is a +cuXXX wheel on the pytorch index and stays in torch-spec.

Part of the torchao H100/A100 → OSDC migration. Companion PRs: #4456 (1x/4xH100). Depends on linux_job_v3 (now on pytorch/test-infra@main).

Migrate the release_model H100 job from EC2 linux_job_v2 to the OSDC
linux_job_v3 reusable workflow on the ARC runner
(linux.aws.h100 -> mt-l-x86iamx-22-225-h100), and pre-install torch's
pure-python deps from the in-cluster pypi-cache so the nightly cu126
install doesn't reach the cache-enforcer-blocked files.pythonhosted.org.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4457

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 89e7458 with merge base 9c010ae (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 30, 2026
@huydhn huydhn marked this pull request as ready for review May 30, 2026 01:37
@huydhn huydhn requested review from jerryzh168 and vkuzo as code owners May 30, 2026 01:37
@huydhn huydhn changed the title [Draft][CI] Switch release_model H100 job to linux_job_v3 (OSDC/ARC) [CI] Switch release_model H100 job to linux_job_v3 (OSDC/ARC) May 30, 2026
@huydhn huydhn added the module: not user facing Use this tag if you don't want this PR to show up in release notes label May 30, 2026
pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy pillow
# Clear PIP_EXTRA_INDEX_URL (the runner's default cpu /whl/cpu/) so it can't supply a
# +cpu torch; the torch-spec's --index-url makes the literal nightly cu126 index the only source.
PIP_EXTRA_INDEX_URL= pip install ${{ matrix.torch-spec }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, is it documented anywhere?

Copy link
Copy Markdown
Contributor Author

@huydhn huydhn May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I want to set some context here:

These are 2 new pypi-cache compatibility issues that I found today with the setup, one issue per line. I have created 2 tracking issue for them here to follow on this next week:

  1. The first issue is that we point to files.pythonhosted.org in download.pytorch.org for some common python packages. For example, https://download.pytorch.org/whl/nightly/filelock. I observed that a command like pip install torch --index-url https://download.pytorch.org/whl/nightly/cu130, by fixing the index to download.pytorch.org, could get all packages from download.pytorch.org but not those from files.pythonhosted.org. Running pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy pillow is a quick way to fetch these packages separately without setting the download.pytorch.org index. It works because it correctly use the local cache without referring to download.pytorch.org. Tracking issue [pypi-cache] download.pytorch.org index links pure-python deps to files.pythonhosted.org (unreachable on OSDC) ci-infra#660

  2. The second issue is that PIP_EXTRA_INDEX_URL is set wrongly to PIP_EXTRA_INDEX_URL=http://pypi-cache-cpu.pypi-cache.svc.cluster.local:8080/whl/cpu/. With this variable set, torch CPU is somehow preferred over the correct CUDA version. Unset it is a quick way to tell pip to look into https://download.pytorch.org/whl/nightly/cu130 instead. Tracking issue [pypi-cache] Runners default PIP_EXTRA_INDEX_URL to the cpu slug on all archs → pip installs +cpu torch on GPU jobs ci-infra#661

I don't have the full context on these behavior yet and need more time to look into them. (1) is gotcha while (2) is a bug. So, these changes are here to unblock torchao CI if needed. I agree that we shouldn't need them once (1) and (2) are fixed.

Comment on lines +41 to +45
# Pre-install torch/vision's pure-python deps from the in-cluster pypi-cache for speed.
pip install filelock typing-extensions "setuptools<82" sympy networkx jinja2 fsspec numpy pillow
# Clear PIP_EXTRA_INDEX_URL (the runner's default cpu /whl/cpu/) so it can't supply a
# +cpu torch; the torch-spec's --index-url makes the literal nightly cu126 index the only source.
PIP_EXTRA_INDEX_URL= pip install ${{ matrix.torch-spec }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why one can not leave this section as is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: not user facing Use this tag if you don't want this PR to show up in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants