Warn on risky hosted server allocations by tamohannes · Pull Request #1465 · NVIDIA-NeMo/Skills

tamohannes · 2026-05-28T21:21:00Z

Summary

Adds hosted-server guardrails for common wasted-compute configurations:

warn when a partial-node hosted server uses a fixed port
warn when exclusive=True may reserve more GPUs than the server can use
make random-port behavior aware of cluster gpus_per_node
keep submission allowed; this is warning-only

Why

Partial-node serving jobs can collide on fixed localhost ports, and exclusive allocations can reserve idle GPUs when server_gpus is smaller than the node size. Both patterns can produce idle-GPU waste while the job appears submitted successfully.

Tests

python -m pytest tests/test_pipeline_utils.py::test_should_get_random_port_respects_cluster_gpu_count tests/test_pipeline_utils.py::test_get_cluster_gpus_per_node_known_clusters_and_override tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_partial_fixed_port tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_partial_exclusive tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_random_partial_has_no_port_warning -q
python -m ruff check nemo_skills/pipeline/generate.py nemo_skills/pipeline/start_server.py nemo_skills/pipeline/utils/server.py nemo_skills/pipeline/nemo_rl/grpo.py nemo_skills/pipeline/verl/ppo.py tests/test_pipeline_utils.py

Summary by CodeRabbit

New Features
- New warnings for hosted-server GPU allocation that may waste resources or risk localhost port conflicts
- Start-server CLI option now supports an “unset” mode that auto-resolves port behavior from cluster GPU layout
Improvements
- Port-allocation logic now considers cluster GPU counts and exclusive-allocation settings for smarter defaults and fewer port collisions
Tests
- Added tests covering cluster-aware port selection and allocation-warning behavior

coderabbitai · 2026-05-28T21:26:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6463db7c-4f48-42a2-8369-f54af33e900e

📥 Commits

Reviewing files that changed from the base of the PR and between a38fb66 and 0390f19.

📒 Files selected for processing (6)

nemo_skills/pipeline/generate.py
nemo_skills/pipeline/nemo_rl/grpo.py
nemo_skills/pipeline/start_server.py
nemo_skills/pipeline/utils/server.py
nemo_skills/pipeline/verl/ppo.py
tests/test_pipeline_utils.py

💤 Files with no reviewable changes (3)

nemo_skills/pipeline/nemo_rl/grpo.py
nemo_skills/pipeline/generate.py
nemo_skills/pipeline/verl/ppo.py

🚧 Files skipped from review as they are similar to previous changes (3)

nemo_skills/pipeline/start_server.py
tests/test_pipeline_utils.py
nemo_skills/pipeline/utils/server.py

📝 Walkthrough

Walkthrough

Adds cluster-GPU-aware server allocation utilities, integrates them into server startup and multiple job commands to decide random vs fixed ports, re-exports a warning helper, and adds tests exercising GPU-count resolution and warning emission for partial-node/exclusive/fixed-port scenarios.

Changes

GPU-aware hosted-server allocation

Layer / File(s)	Summary
Core server GPU allocation utilities `nemo_skills/pipeline/utils/server.py`	Adds `get_cluster_gpus_per_node()` to infer GPUs-per-node, extends `should_get_random_port()` with `gpus_per_node` and `server_port` to decide random-port usage, and adds `warn_hosted_server_allocation()` for partial-node/exclusive/fixed-port warnings.
Utility re-export `nemo_skills/pipeline/utils/__init__.py`	Re-exports `warn_hosted_server_allocation` from `nemo_skills.pipeline.utils.server`.
Server launcher command integration `nemo_skills/pipeline/start_server.py`	Changes `launch_server()` and the `start_server` CLI to default `get_random_port=None`; when unset they resolve `gpus_per_node` and compute `resolved_random_port` via `should_get_random_port()` before selecting ports.
Job command integrations `nemo_skills/pipeline/generate.py`, `nemo_skills/pipeline/nemo_rl/grpo.py`, `nemo_skills/pipeline/verl/ppo.py`	Updates per-model and judge-server setup paths to compute `gpus_per_node` from `cluster_config` and pass it into `should_get_random_port(server_gpus, exclusive, gpus_per_node)`.
Tests for server allocation helpers `tests/test_pipeline_utils.py`	Adds tests for `should_get_random_port` behavior relative to cluster GPU counts, `get_cluster_gpus_per_node` config/default resolution, and `warn_hosted_server_allocation` logging across parameter combinations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: adding warnings for risky hosted server allocations, which matches the core functionality across all modified files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch tamohannes/waste-port-guardrails

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_pipeline_utils.py (1)
72-82: 💤 Low value

Consider clarifying multi-warning test coverage.

This test correctly verifies the exclusive warning, but with get_random_port=False, the fixed-port warning should also be emitted. The test doesn't assert its presence or absence, which could cause confusion if the implementation changes. Consider either:

Adding an assertion that both warnings appear, or

Adding a comment noting that this test focuses only on the exclusive warning
📝 Example: Verify both warnings or add clarifying comment

Option 1 - Assert both warnings:
 assert "exclusive=True" in caplog.text
 assert "server_gpus=8" in caplog.text
+assert "fixed server port" in caplog.text
Option 2 - Add clarifying comment:
+# Note: This also triggers the fixed-port warning, but we only verify the exclusive warning here.
 assert "exclusive=True" in caplog.text
 assert "server_gpus=8" in caplog.text
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_pipeline_utils.py` around lines 72 - 82, The test
test_warn_hosted_server_allocation_partial_exclusive currently checks only the
exclusive and server_gpus warnings but omits verifying the fixed-port warning
emitted when get_random_port=False; update the test to also assert that the
fixed-port warning is present (e.g., assert "get_random_port=False" in
caplog.text or the literal fixed-port warning message emitted by
warn_hosted_server_allocation) so the test unambiguously covers both warnings
emitted by warn_hosted_server_allocation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 124-125: The collision warning logic incorrectly treats an
explicit server_port as non-fixed when get_random_port is True; update the
computation used where fixed_port is defined so that fixed_port is True whenever
server_port is not None regardless of get_random_port (e.g., compute fixed_port
as True if server_port is provided OR get_random_port is explicitly False), then
use that fixed_port in the existing is_partial_node check (referencing
fixed_port, get_random_port, server_port, and is_partial_node) so partial-node
collision warnings fire whenever a specific port was requested.

---

Nitpick comments:
In `@tests/test_pipeline_utils.py`:
- Around line 72-82: The test
test_warn_hosted_server_allocation_partial_exclusive currently checks only the
exclusive and server_gpus warnings but omits verifying the fixed-port warning
emitted when get_random_port=False; update the test to also assert that the
fixed-port warning is present (e.g., assert "get_random_port=False" in
caplog.text or the literal fixed-port warning message emitted by
warn_hosted_server_allocation) so the test unambiguously covers both warnings
emitted by warn_hosted_server_allocation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16d1f1e4-7e61-4246-a8d3-fca656aad35f

📥 Commits

Reviewing files that changed from the base of the PR and between b620e79 and 98f726b.

📒 Files selected for processing (7)

nemo_skills/pipeline/generate.py
nemo_skills/pipeline/nemo_rl/grpo.py
nemo_skills/pipeline/start_server.py
nemo_skills/pipeline/utils/__init__.py
nemo_skills/pipeline/utils/server.py
nemo_skills/pipeline/verl/ppo.py
tests/test_pipeline_utils.py

Kipok · 2026-06-09T17:51:24Z

+            return int(value)
+
+    cluster_name = str(cluster_config.get("_cluster_yaml_name") or cluster_config.get("name") or "").lower()
+    if any(token in cluster_name for token in ("aws-cmh", "aws-dfw")):


let's not reference any internal infrastructure. We can just add this as explicit argument in cluster config or default to 8 otherwise

Kipok · 2026-06-09T17:53:01Z

-        get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive)
+        gpus_per_node = pipeline_utils.get_cluster_gpus_per_node(cluster_config)
+        get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive, gpus_per_node)
+        pipeline_utils.warn_hosted_server_allocation(


this should be done for all pipelines, not just the ones you added it for? E.g. eval, sft, etc.

Kipok · 2026-06-09T17:53:19Z

+            gpus_per_node=gpus_per_node,
+            get_random_port=get_random_port,
+            server_port=None,
+            context="ns verl ppo",


I think this might get stale very fast and not super useful, I'd probably not use context at all

Kipok · 2026-06-09T17:56:18Z

+    if get_random_port is None:
+        get_random_port = should_get_random_port(
+            server_gpus=server_gpus,
+            exclusive=(sbatch_kwargs or {}).get("exclusive") if isinstance(sbatch_kwargs, dict) else None,


sbatch_kwargs should already be resolved to always be dict or none at this point. And we can probably check for exclusive ones

Kipok · 2026-06-09T17:57:14Z

+    if get_random_port is None:
+        get_random_port = should_get_random_port(
+            server_gpus=server_gpus,
+            exclusive=exclusive,


I think exclusive can also be passed through sbatch kwargs, so better to check for that one (and it should have explicit exclusive fused already

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>

- Drop hardcoded internal cluster names from get_cluster_gpus_per_node; read gpus_per_node / num_gpus_per_node from the cluster config, else default to 8. - Centralize the guardrail in should_get_random_port so every server-hosting pipeline (generate/eval, grpo, ppo, start_server) warns without per-call-site duplication; remove the scattered warn_hosted_server_allocation calls and the now-stale context argument. - Treat an explicitly pinned server_port as a fixed port and suppress the collision warning under exclusive allocations; update/extend tests. Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/start_server.py`:
- Line 126: The change made get_random_port's default to None in start_server
changes behavior for callers; revert the default to the original False to
preserve fixed-port behavior for existing automation (i.e., set get_random_port
back to False in the start_server signature) and update the start_server
docstring to document the behavior; if GPU-aware/random-port behavior is
desired, introduce a new explicit parameter (e.g., gpu_aware_get_random_port)
instead of overloading get_random_port so callers must opt in.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 63584624-5f15-4c56-b860-32e2ef7ea842

📥 Commits

Reviewing files that changed from the base of the PR and between 98f726b and a38fb66.

📒 Files selected for processing (7)

nemo_skills/pipeline/generate.py
nemo_skills/pipeline/nemo_rl/grpo.py
nemo_skills/pipeline/start_server.py
nemo_skills/pipeline/utils/__init__.py
nemo_skills/pipeline/utils/server.py
nemo_skills/pipeline/verl/ppo.py
tests/test_pipeline_utils.py

✅ Files skipped from review due to trivial changes (1)

nemo_skills/pipeline/generate.py

🚧 Files skipped from review as they are similar to previous changes (3)

nemo_skills/pipeline/utils/init.py
nemo_skills/pipeline/utils/server.py
tests/test_pipeline_utils.py

coderabbitai · 2026-06-10T23:03:53Z

    log_dir=None,
    mount_paths=None,
-    get_random_port=False,
+    get_random_port=None,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Breaking change: default port behavior now GPU-aware instead of fixed.

Changing get_random_port default from False to None alters the behavior for all callers that don't explicitly set this parameter. Previously they received fixed ports (5000/6000); now partial-node allocations will receive random ports. While this improves collision-safety, existing automation that relies on fixed port numbers may break.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/start_server.py` at line 126, The change made get_random_port's default to None in start_server changes behavior for callers; revert the default to the original False to preserve fixed-port behavior for existing automation (i.e., set get_random_port back to False in the start_server signature) and update the start_server docstring to document the behavior; if GPU-aware/random-port behavior is desired, introduce a new explicit parameter (e.g., gpu_aware_get_random_port) instead of overloading get_random_port so callers must opt in.

tamohannes · 2026-06-10T23:04:17Z

Updated per review:

get_cluster_gpus_per_node no longer hard-codes cluster names — it reads gpus_per_node/num_gpus_per_node from the cluster config, defaulting to 8.
Folded the guardrail into should_get_random_port so every server-hosting pipeline (generate/eval, grpo, ppo, start_server) gets it — removed the per-call-site warn_hosted_server_allocation calls and the context arg.
Simplified the sbatch_kwargs/exclusive handling; an explicitly pinned server_port is now treated as a fixed port for the collision warning (+ tests).

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread nemo_skills/pipeline/utils/server.py Outdated

Kipok requested changes Jun 9, 2026

View reviewed changes

Warn on risky hosted server allocations

7b3cef8

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>

tamohannes force-pushed the tamohannes/waste-port-guardrails branch from 98f726b to a38fb66 Compare June 10, 2026 22:56

tamohannes force-pushed the tamohannes/waste-port-guardrails branch from a38fb66 to 0390f19 Compare June 10, 2026 23:02

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Uh oh!

Conversation

tamohannes commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

tamohannes commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tamohannes commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading