Skip to content

[NV] H100 (Agg): migrate model path#1537

Open
Ankur-singh wants to merge 2 commits into
mainfrom
fix-h100-model-path
Open

[NV] H100 (Agg): migrate model path#1537
Ankur-singh wants to merge 2 commits into
mainfrom
fix-h100-model-path

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

Summary

  • Launcher (runners/launch_h100-dgxc-slurm.sh agg block): switch HF_HUB_CACHE_MOUNT to /mnt/numa1/shared/models/ (audit-recommended root), export HF_HUB_CACHE to container's ~/.cache/huggingface for eval dataset downloads, and rewrite MODEL from HF id (org/name) to absolute local path under the mount. Bind-mount target aligned to host path so the rewritten MODEL resolves inside the container.
  • Bench scripts (4 H100 fixed-seq-len scripts): drop hf download "$MODEL" line. Contract is now staged-or-fail.
  • Configs (4 H100 keys): reduce search-space to a single (isl=1024, osl=1024, conc=4) point to verify model-path wiring end-to-end. To revert to full sweep after wiring is confirmed.

In scope

  • gptoss-fp4-h100-vllm, minimaxm2.5-fp8-h100-vllm, qwen3.5-fp8-h100-sglang, qwen3.5-fp8-h100-sglang-mtp

Out of scope

  • Disagg/multi-node block in the launcher
  • Agentic configs and agentic/*.sh bench scripts
  • kimik2.5-int4-h100-vllm (agentic-only)
  • Other clusters (B200/B300/H200/GB200/GB300/MI3xx)

…ownload

- launcher: switch HF_HUB_CACHE_MOUNT to /mnt/numa1/shared/models/, export HF_HUB_CACHE to container's ~/.cache/huggingface for eval datasets, and rewrite MODEL from HF id (org/name) to absolute local path under the mount; align the bind-mount target with the host path so the rewritten MODEL resolves inside the container.
- bench scripts: drop "hf download $MODEL" from the four in-scope H100 scripts (gptoss, minimaxm2.5, qwen3.5, qwen3.5-mtp); contract is now staged-or-fail.
- configs: reduce the same four H100 configs to a single (isl=1024, osl=1024, conc=4) test point to verify model-path wiring end-to-end without burning a full sweep.

Out of scope: disagg/multi-node block, agentic configs/bench scripts, and kimik2.5-int4-h100-vllm (agentic-only).
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Ankur-singh Ankur-singh changed the title h100(agg): migrate model path to /mnt/numa1/shared/models/, drop hf download [NV] H100 (Agg): migrate model path May 20, 2026
Comment thread perf-changelog.yaml Outdated
- "Migrate H100 agg model path to /mnt/numa1/shared/models/ (audit-recommended root); set HF_HUB_CACHE to container's ~/.cache/huggingface for eval dataset downloads"
- "Rewrite MODEL in launcher from HF id (org/name) to absolute local path under HF_HUB_CACHE_MOUNT; remove hf download from bench scripts so contract is staged-or-fail"
- "Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new perf-changelog entry at line 3063 has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX — the literal placeholder XXXX was never replaced with the real PR number (1537). Every other recent entry in the file uses a real PR id, and per AGENTS.md this file is the audit trail linking config changes back to the PR that triggered them. Please replace XXXX with 1537 before merging.

Extended reasoning...

What the bug is. perf-changelog.yaml line 3063 reads:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

XXXX is a forgotten-substitution placeholder, not a real PR id. The author clearly intended to fill in a real number — the rest of the new entry (config-keys list and three-line description) is complete and substantive — but the pr-link field was left with its template value.

Why this matters. Every other recent entry in the same file uses a real PR id, e.g.:

  • line 3037 → pull/1411
  • line 3044 → pull/1528
  • line 3052 → pull/1475

AGENTS.md describes perf-changelog.yaml as the chronological audit trail linking benchmark config changes back to the PR that triggered them. An unfilled XXXX permanently breaks that linkage for the four H100 agg configs touched by this change (gptoss-fp4-h100-vllm, minimaxm2.5-fp8-h100-vllm, qwen3.5-fp8-h100-sglang, qwen3.5-fp8-h100-sglang-mtp) once merged — anyone later trying to understand why those four configs were narrowed to a single (isl=1024, osl=1024, conc=4) point will hit a dead 404 link instead of this PR.

Why nothing prevents it. There is no pre-commit or CI check that validates perf-changelog.yaml pr-link URLs resolve to a real PR; it relies on author discipline. The file is structured YAML but XXXX is still a syntactically valid string, so no schema validator will flag it either.

Impact. Once merged, the link https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX will 404 forever (GitHub does not have a PR numbered XXXX). The audit trail for these four H100 configs is severed. It is also non-trivial to fix post-merge because (a) perf-changelog.yaml is append-only by convention, and (b) reviewers in 6 months won't know to look for XXXX specifically.

Step-by-step proof.

  1. Look at the diff hunk for perf-changelog.yaml: the new entry ends at line 3063 with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX.
  2. The PR number is 1537 (from the PR metadata at the top of this review).
  3. Visit https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX — GitHub will return 404 because no PR is numbered XXXX.
  4. Compare with line 3052: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475 — that one resolves to a real PR. So the convention in this file is clearly to use a real PR id.

How to fix. One-character (well, four-character) substitution at perf-changelog.yaml:3063:

-  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1537

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant