Warn on risky hosted server allocations#1465
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (6)
💤 Files with no reviewable changes (3)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds cluster-GPU-aware server allocation utilities, integrates them into server startup and multiple job commands to decide random vs fixed ports, re-exports a warning helper, and adds tests exercising GPU-count resolution and warning emission for partial-node/exclusive/fixed-port scenarios. ChangesGPU-aware hosted-server allocation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_pipeline_utils.py (1)
72-82: 💤 Low valueConsider clarifying multi-warning test coverage.
This test correctly verifies the exclusive warning, but with
get_random_port=False, the fixed-port warning should also be emitted. The test doesn't assert its presence or absence, which could cause confusion if the implementation changes. Consider either:
- Adding an assertion that both warnings appear, or
- Adding a comment noting that this test focuses only on the exclusive warning
📝 Example: Verify both warnings or add clarifying comment
Option 1 - Assert both warnings:
assert "exclusive=True" in caplog.text assert "server_gpus=8" in caplog.text +assert "fixed server port" in caplog.textOption 2 - Add clarifying comment:
+# Note: This also triggers the fixed-port warning, but we only verify the exclusive warning here. assert "exclusive=True" in caplog.text assert "server_gpus=8" in caplog.text🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_pipeline_utils.py` around lines 72 - 82, The test test_warn_hosted_server_allocation_partial_exclusive currently checks only the exclusive and server_gpus warnings but omits verifying the fixed-port warning emitted when get_random_port=False; update the test to also assert that the fixed-port warning is present (e.g., assert "get_random_port=False" in caplog.text or the literal fixed-port warning message emitted by warn_hosted_server_allocation) so the test unambiguously covers both warnings emitted by warn_hosted_server_allocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 124-125: The collision warning logic incorrectly treats an
explicit server_port as non-fixed when get_random_port is True; update the
computation used where fixed_port is defined so that fixed_port is True whenever
server_port is not None regardless of get_random_port (e.g., compute fixed_port
as True if server_port is provided OR get_random_port is explicitly False), then
use that fixed_port in the existing is_partial_node check (referencing
fixed_port, get_random_port, server_port, and is_partial_node) so partial-node
collision warnings fire whenever a specific port was requested.
---
Nitpick comments:
In `@tests/test_pipeline_utils.py`:
- Around line 72-82: The test
test_warn_hosted_server_allocation_partial_exclusive currently checks only the
exclusive and server_gpus warnings but omits verifying the fixed-port warning
emitted when get_random_port=False; update the test to also assert that the
fixed-port warning is present (e.g., assert "get_random_port=False" in
caplog.text or the literal fixed-port warning message emitted by
warn_hosted_server_allocation) so the test unambiguously covers both warnings
emitted by warn_hosted_server_allocation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 16d1f1e4-7e61-4246-a8d3-fca656aad35f
📒 Files selected for processing (7)
nemo_skills/pipeline/generate.pynemo_skills/pipeline/nemo_rl/grpo.pynemo_skills/pipeline/start_server.pynemo_skills/pipeline/utils/__init__.pynemo_skills/pipeline/utils/server.pynemo_skills/pipeline/verl/ppo.pytests/test_pipeline_utils.py
| return int(value) | ||
|
|
||
| cluster_name = str(cluster_config.get("_cluster_yaml_name") or cluster_config.get("name") or "").lower() | ||
| if any(token in cluster_name for token in ("aws-cmh", "aws-dfw")): |
There was a problem hiding this comment.
let's not reference any internal infrastructure. We can just add this as explicit argument in cluster config or default to 8 otherwise
| get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive) | ||
| gpus_per_node = pipeline_utils.get_cluster_gpus_per_node(cluster_config) | ||
| get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive, gpus_per_node) | ||
| pipeline_utils.warn_hosted_server_allocation( |
There was a problem hiding this comment.
this should be done for all pipelines, not just the ones you added it for? E.g. eval, sft, etc.
| gpus_per_node=gpus_per_node, | ||
| get_random_port=get_random_port, | ||
| server_port=None, | ||
| context="ns verl ppo", |
There was a problem hiding this comment.
I think this might get stale very fast and not super useful, I'd probably not use context at all
| if get_random_port is None: | ||
| get_random_port = should_get_random_port( | ||
| server_gpus=server_gpus, | ||
| exclusive=(sbatch_kwargs or {}).get("exclusive") if isinstance(sbatch_kwargs, dict) else None, |
There was a problem hiding this comment.
sbatch_kwargs should already be resolved to always be dict or none at this point. And we can probably check for exclusive ones
| if get_random_port is None: | ||
| get_random_port = should_get_random_port( | ||
| server_gpus=server_gpus, | ||
| exclusive=exclusive, |
There was a problem hiding this comment.
I think exclusive can also be passed through sbatch kwargs, so better to check for that one (and it should have explicit exclusive fused already
Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
98f726b to
a38fb66
Compare
- Drop hardcoded internal cluster names from get_cluster_gpus_per_node; read gpus_per_node / num_gpus_per_node from the cluster config, else default to 8. - Centralize the guardrail in should_get_random_port so every server-hosting pipeline (generate/eval, grpo, ppo, start_server) warns without per-call-site duplication; remove the scattered warn_hosted_server_allocation calls and the now-stale context argument. - Treat an explicitly pinned server_port as a fixed port and suppress the collision warning under exclusive allocations; update/extend tests. Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com>
a38fb66 to
0390f19
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/pipeline/start_server.py`:
- Line 126: The change made get_random_port's default to None in start_server
changes behavior for callers; revert the default to the original False to
preserve fixed-port behavior for existing automation (i.e., set get_random_port
back to False in the start_server signature) and update the start_server
docstring to document the behavior; if GPU-aware/random-port behavior is
desired, introduce a new explicit parameter (e.g., gpu_aware_get_random_port)
instead of overloading get_random_port so callers must opt in.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 63584624-5f15-4c56-b860-32e2ef7ea842
📒 Files selected for processing (7)
nemo_skills/pipeline/generate.pynemo_skills/pipeline/nemo_rl/grpo.pynemo_skills/pipeline/start_server.pynemo_skills/pipeline/utils/__init__.pynemo_skills/pipeline/utils/server.pynemo_skills/pipeline/verl/ppo.pytests/test_pipeline_utils.py
✅ Files skipped from review due to trivial changes (1)
- nemo_skills/pipeline/generate.py
🚧 Files skipped from review as they are similar to previous changes (3)
- nemo_skills/pipeline/utils/init.py
- nemo_skills/pipeline/utils/server.py
- tests/test_pipeline_utils.py
| log_dir=None, | ||
| mount_paths=None, | ||
| get_random_port=False, | ||
| get_random_port=None, |
There was a problem hiding this comment.
Breaking change: default port behavior now GPU-aware instead of fixed.
Changing get_random_port default from False to None alters the behavior for all callers that don't explicitly set this parameter. Previously they received fixed ports (5000/6000); now partial-node allocations will receive random ports. While this improves collision-safety, existing automation that relies on fixed port numbers may break.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/start_server.py` at line 126, The change made
get_random_port's default to None in start_server changes behavior for callers;
revert the default to the original False to preserve fixed-port behavior for
existing automation (i.e., set get_random_port back to False in the start_server
signature) and update the start_server docstring to document the behavior; if
GPU-aware/random-port behavior is desired, introduce a new explicit parameter
(e.g., gpu_aware_get_random_port) instead of overloading get_random_port so
callers must opt in.
|
Updated per review:
|
Summary
Adds hosted-server guardrails for common wasted-compute configurations:
exclusive=Truemay reserve more GPUs than the server can usegpus_per_nodeWhy
Partial-node serving jobs can collide on fixed localhost ports, and exclusive allocations can reserve idle GPUs when
server_gpusis smaller than the node size. Both patterns can produce idle-GPU waste while the job appears submitted successfully.Tests
python -m pytest tests/test_pipeline_utils.py::test_should_get_random_port_respects_cluster_gpu_count tests/test_pipeline_utils.py::test_get_cluster_gpus_per_node_known_clusters_and_override tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_partial_fixed_port tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_partial_exclusive tests/test_pipeline_utils.py::test_warn_hosted_server_allocation_random_partial_has_no_port_warning -qpython -m ruff check nemo_skills/pipeline/generate.py nemo_skills/pipeline/start_server.py nemo_skills/pipeline/utils/server.py nemo_skills/pipeline/nemo_rl/grpo.py nemo_skills/pipeline/verl/ppo.py tests/test_pipeline_utils.pySummary by CodeRabbit
New Features
Improvements
Tests