Skip to content

Warn on risky hosted server allocations#1465

Open
tamohannes wants to merge 2 commits into
mainfrom
tamohannes/waste-port-guardrails
Open

Warn on risky hosted server allocations#1465
tamohannes wants to merge 2 commits into
mainfrom
tamohannes/waste-port-guardrails

Conversation

@tamohannes

@tamohannes tamohannes commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds hosted-server guardrails for common wasted-compute configurations:

  • warn when a partial-node hosted server uses a fixed port
  • warn when exclusive=True may reserve more GPUs than the server can use
  • make random-port behavior aware of cluster gpus_per_node
  • keep submission allowed; this is warning-only

Why

Partial-node serving jobs can collide on fixed localhost ports, and exclusive allocations can reserve idle GPUs when server_gpus is smaller than the node size. Both patterns can produce idle-GPU waste while the job appears submitted successfully.

Tests

  • python -m pytest tests/test_pipeline_utils.py::test_should_get_random_port_respects_cluster_gpu_count tests/test_pipeline_utils.py::test_get_cluster_gpus_per_node_known_clusters_and_override tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_partial_fixed_port tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_partial_exclusive tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_random_partial_has_no_port_warning -q
  • python -m ruff check nemo_skills/pipeline/generate.py nemo_skills/pipeline/start_server.py nemo_skills/pipeline/utils/server.py nemo_skills/pipeline/nemo_rl/grpo.py nemo_skills/pipeline/verl/ppo.py tests/test_pipeline_utils.py

Summary by CodeRabbit

  • New Features

    • New warnings for hosted-server GPU allocation that may waste resources or risk localhost port conflicts
    • Start-server CLI option now supports an “unset” mode that auto-resolves port behavior from cluster GPU layout
  • Improvements

    • Port-allocation logic now considers cluster GPU counts and exclusive-allocation settings for smarter defaults and fewer port collisions
  • Tests

    • Added tests covering cluster-aware port selection and allocation-warning behavior

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6463db7c-4f48-42a2-8369-f54af33e900e

📥 Commits

Reviewing files that changed from the base of the PR and between a38fb66 and 0390f19.

📒 Files selected for processing (6)
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/start_server.py
  • nemo_skills/pipeline/utils/server.py
  • nemo_skills/pipeline/verl/ppo.py
  • tests/test_pipeline_utils.py
💤 Files with no reviewable changes (3)
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/verl/ppo.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • nemo_skills/pipeline/start_server.py
  • tests/test_pipeline_utils.py
  • nemo_skills/pipeline/utils/server.py

📝 Walkthrough

Walkthrough

Adds cluster-GPU-aware server allocation utilities, integrates them into server startup and multiple job commands to decide random vs fixed ports, re-exports a warning helper, and adds tests exercising GPU-count resolution and warning emission for partial-node/exclusive/fixed-port scenarios.

Changes

GPU-aware hosted-server allocation

Layer / File(s) Summary
Core server GPU allocation utilities
nemo_skills/pipeline/utils/server.py
Adds get_cluster_gpus_per_node() to infer GPUs-per-node, extends should_get_random_port() with gpus_per_node and server_port to decide random-port usage, and adds warn_hosted_server_allocation() for partial-node/exclusive/fixed-port warnings.
Utility re-export
nemo_skills/pipeline/utils/__init__.py
Re-exports warn_hosted_server_allocation from nemo_skills.pipeline.utils.server.
Server launcher command integration
nemo_skills/pipeline/start_server.py
Changes launch_server() and the start_server CLI to default get_random_port=None; when unset they resolve gpus_per_node and compute resolved_random_port via should_get_random_port() before selecting ports.
Job command integrations
nemo_skills/pipeline/generate.py, nemo_skills/pipeline/nemo_rl/grpo.py, nemo_skills/pipeline/verl/ppo.py
Updates per-model and judge-server setup paths to compute gpus_per_node from cluster_config and pass it into should_get_random_port(server_gpus, exclusive, gpus_per_node).
Tests for server allocation helpers
tests/test_pipeline_utils.py
Adds tests for should_get_random_port behavior relative to cluster GPU counts, get_cluster_gpus_per_node config/default resolution, and warn_hosted_server_allocation logging across parameter combinations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: adding warnings for risky hosted server allocations, which matches the core functionality across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tamohannes/waste-port-guardrails

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_pipeline_utils.py (1)

72-82: 💤 Low value

Consider clarifying multi-warning test coverage.

This test correctly verifies the exclusive warning, but with get_random_port=False, the fixed-port warning should also be emitted. The test doesn't assert its presence or absence, which could cause confusion if the implementation changes. Consider either:

  • Adding an assertion that both warnings appear, or
  • Adding a comment noting that this test focuses only on the exclusive warning
📝 Example: Verify both warnings or add clarifying comment

Option 1 - Assert both warnings:

 assert "exclusive=True" in caplog.text
 assert "server_gpus=8" in caplog.text
+assert "fixed server port" in caplog.text

Option 2 - Add clarifying comment:

+# Note: This also triggers the fixed-port warning, but we only verify the exclusive warning here.
 assert "exclusive=True" in caplog.text
 assert "server_gpus=8" in caplog.text
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_pipeline_utils.py` around lines 72 - 82, The test
test_warn_hosted_server_allocation_partial_exclusive currently checks only the
exclusive and server_gpus warnings but omits verifying the fixed-port warning
emitted when get_random_port=False; update the test to also assert that the
fixed-port warning is present (e.g., assert "get_random_port=False" in
caplog.text or the literal fixed-port warning message emitted by
warn_hosted_server_allocation) so the test unambiguously covers both warnings
emitted by warn_hosted_server_allocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 124-125: The collision warning logic incorrectly treats an
explicit server_port as non-fixed when get_random_port is True; update the
computation used where fixed_port is defined so that fixed_port is True whenever
server_port is not None regardless of get_random_port (e.g., compute fixed_port
as True if server_port is provided OR get_random_port is explicitly False), then
use that fixed_port in the existing is_partial_node check (referencing
fixed_port, get_random_port, server_port, and is_partial_node) so partial-node
collision warnings fire whenever a specific port was requested.

---

Nitpick comments:
In `@tests/test_pipeline_utils.py`:
- Around line 72-82: The test
test_warn_hosted_server_allocation_partial_exclusive currently checks only the
exclusive and server_gpus warnings but omits verifying the fixed-port warning
emitted when get_random_port=False; update the test to also assert that the
fixed-port warning is present (e.g., assert "get_random_port=False" in
caplog.text or the literal fixed-port warning message emitted by
warn_hosted_server_allocation) so the test unambiguously covers both warnings
emitted by warn_hosted_server_allocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16d1f1e4-7e61-4246-a8d3-fca656aad35f

📥 Commits

Reviewing files that changed from the base of the PR and between b620e79 and 98f726b.

📒 Files selected for processing (7)
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/start_server.py
  • nemo_skills/pipeline/utils/__init__.py
  • nemo_skills/pipeline/utils/server.py
  • nemo_skills/pipeline/verl/ppo.py
  • tests/test_pipeline_utils.py

Comment thread nemo_skills/pipeline/utils/server.py Outdated
Comment thread nemo_skills/pipeline/utils/server.py Outdated
return int(value)

cluster_name = str(cluster_config.get("_cluster_yaml_name") or cluster_config.get("name") or "").lower()
if any(token in cluster_name for token in ("aws-cmh", "aws-dfw")):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not reference any internal infrastructure. We can just add this as explicit argument in cluster config or default to 8 otherwise

Comment thread nemo_skills/pipeline/verl/ppo.py Outdated
get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive)
gpus_per_node = pipeline_utils.get_cluster_gpus_per_node(cluster_config)
get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive, gpus_per_node)
pipeline_utils.warn_hosted_server_allocation(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done for all pipelines, not just the ones you added it for? E.g. eval, sft, etc.

Comment thread nemo_skills/pipeline/verl/ppo.py Outdated
gpus_per_node=gpus_per_node,
get_random_port=get_random_port,
server_port=None,
context="ns verl ppo",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might get stale very fast and not super useful, I'd probably not use context at all

Comment thread nemo_skills/pipeline/start_server.py Outdated
if get_random_port is None:
get_random_port = should_get_random_port(
server_gpus=server_gpus,
exclusive=(sbatch_kwargs or {}).get("exclusive") if isinstance(sbatch_kwargs, dict) else None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sbatch_kwargs should already be resolved to always be dict or none at this point. And we can probably check for exclusive ones

Comment thread nemo_skills/pipeline/start_server.py Outdated
if get_random_port is None:
get_random_port = should_get_random_port(
server_gpus=server_gpus,
exclusive=exclusive,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exclusive can also be passed through sbatch kwargs, so better to check for that one (and it should have explicit exclusive fused already

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes tamohannes force-pushed the tamohannes/waste-port-guardrails branch from 98f726b to a38fb66 Compare June 10, 2026 22:56
- Drop hardcoded internal cluster names from get_cluster_gpus_per_node; read gpus_per_node / num_gpus_per_node from the cluster config, else default to 8.
- Centralize the guardrail in should_get_random_port so every server-hosting pipeline (generate/eval, grpo, ppo, start_server) warns without per-call-site duplication; remove the scattered warn_hosted_server_allocation calls and the now-stale context argument.
- Treat an explicitly pinned server_port as a fixed port and suppress the collision warning under exclusive allocations; update/extend tests.

Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
@tamohannes tamohannes force-pushed the tamohannes/waste-port-guardrails branch from a38fb66 to 0390f19 Compare June 10, 2026 23:02

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/start_server.py`:
- Line 126: The change made get_random_port's default to None in start_server
changes behavior for callers; revert the default to the original False to
preserve fixed-port behavior for existing automation (i.e., set get_random_port
back to False in the start_server signature) and update the start_server
docstring to document the behavior; if GPU-aware/random-port behavior is
desired, introduce a new explicit parameter (e.g., gpu_aware_get_random_port)
instead of overloading get_random_port so callers must opt in.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 63584624-5f15-4c56-b860-32e2ef7ea842

📥 Commits

Reviewing files that changed from the base of the PR and between 98f726b and a38fb66.

📒 Files selected for processing (7)
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/nemo_rl/grpo.py
  • nemo_skills/pipeline/start_server.py
  • nemo_skills/pipeline/utils/__init__.py
  • nemo_skills/pipeline/utils/server.py
  • nemo_skills/pipeline/verl/ppo.py
  • tests/test_pipeline_utils.py
✅ Files skipped from review due to trivial changes (1)
  • nemo_skills/pipeline/generate.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • nemo_skills/pipeline/utils/init.py
  • nemo_skills/pipeline/utils/server.py
  • tests/test_pipeline_utils.py

log_dir=None,
mount_paths=None,
get_random_port=False,
get_random_port=None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Breaking change: default port behavior now GPU-aware instead of fixed.

Changing get_random_port default from False to None alters the behavior for all callers that don't explicitly set this parameter. Previously they received fixed ports (5000/6000); now partial-node allocations will receive random ports. While this improves collision-safety, existing automation that relies on fixed port numbers may break.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/start_server.py` at line 126, The change made
get_random_port's default to None in start_server changes behavior for callers;
revert the default to the original False to preserve fixed-port behavior for
existing automation (i.e., set get_random_port back to False in the start_server
signature) and update the start_server docstring to document the behavior; if
GPU-aware/random-port behavior is desired, introduce a new explicit parameter
(e.g., gpu_aware_get_random_port) instead of overloading get_random_port so
callers must opt in.

@tamohannes

Copy link
Copy Markdown
Collaborator Author

Updated per review:

  • get_cluster_gpus_per_node no longer hard-codes cluster names — it reads gpus_per_node/num_gpus_per_node from the cluster config, defaulting to 8.
  • Folded the guardrail into should_get_random_port so every server-hosting pipeline (generate/eval, grpo, ppo, start_server) gets it — removed the per-call-site warn_hosted_server_allocation calls and the context arg.
  • Simplified the sbatch_kwargs/exclusive handling; an explicitly pinned server_port is now treated as a fixed port for the collision warning (+ tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants