fix(pipeline): harden eval runtime plumbing#1484
Conversation
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR extends SLURM port allocation to support deterministic job-ID-based ports and concurrent-safe random allocation with file-lock TTL reservations, propagates the resulting ChangesInference Robustness
SLURM Port Allocation and Pipeline Defaults
Sequence Diagram(s)sequenceDiagram
participant Caller as Pipeline Caller
participant GPF as get_free_port
participant FSLock as Per-user File Lock
participant ResvFile as Reservation File
participant GSC as get_server_command
Caller->>GPF: get_free_port(strategy="random")
alt NEMO_SKILLS_USE_SLURM_JOB_ID_PORTS=1
GPF-->>Caller: "$((SLURM_JOB_ID + offset))" (str)
else random allocation
GPF->>FSLock: acquire lock
GPF->>ResvFile: read TTL reservations
loop until port found or max attempts
GPF->>ResvFile: check port, evict stale, write reservation
end
GPF->>FSLock: release lock
GPF-->>Caller: port (int)
end
Caller->>GSC: get_server_command(..., server_port=port)
GSC-->>Caller: command string with port embedded
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/pipeline/utils/cluster.py`:
- Around line 262-263: The conditional check at line 262 only bypasses expansion
for "NEMO_SKILLS_SANDBOX_PORT" when the value contains "SLURM_JOB_ID", but
"LISTEN_PORT" and "NGINX_PORT" also contain SLURM expressions from the same
injection path and need the same bypass to avoid triggering the resolver and
raising ValueError. Broaden the condition to check if the key is any of these
three environment variables (NEMO_SKILLS_SANDBOX_PORT, LISTEN_PORT, or
NGINX_PORT) and the value contains "SLURM_JOB_ID", then skip expansion for all
of them.
In `@nemo_skills/utils.py`:
- Around line 635-636: The readiness loop curl commands are using only -sS
flags, which exit successfully even on HTTP 4xx/5xx error responses, causing the
loop to mark the server as ready prematurely. Add the --fail flag to both curl
commands for models_url and health_url to ensure curl exits with a non-zero
status on HTTP error responses, and optionally add a short timeout using
--max-time to prevent hanging on unresponsive endpoints.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d7bc07c8-c4ef-4264-9a75-8f47025a4b53
📒 Files selected for processing (9)
nemo_skills/inference/generate.pynemo_skills/inference/model/vllm_multimodal.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/cluster.pynemo_skills/pipeline/utils/exp.pynemo_skills/pipeline/utils/scripts/eval.pynemo_skills/pipeline/utils/scripts/server.pynemo_skills/pipeline/utils/server.pynemo_skills/utils.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
Migration note
Supersedes #1474. Recreated from the
NVIDIA-NeMo/Skillsbranchcodex/pr1443-runtime-fixesso repository CI can run.Summary
Split out from #1443.
This PR contains general runtime and pipeline hardening that is useful independently of the AppTek benchmark.
What Changed
Nonetext to empty content/v1/modelsand/healthSLURM_JOB_ID-derived portsRelationship To #1443
This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.
Validation
python -m py_compileon changed Python files.Summary by CodeRabbit
Release Notes
Bug Fixes
New Features
Improvements