Skip to content

fix(pipeline): harden eval runtime plumbing#1474

Closed
pzelasko wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pzelasko:codex/pr1443-runtime-fixes
Closed

fix(pipeline): harden eval runtime plumbing#1474
pzelasko wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
pzelasko:codex/pr1443-runtime-fixes

Conversation

@pzelasko

@pzelasko pzelasko commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

Split out from #1443.

This PR contains general runtime and pipeline hardening that is useful independently of the AppTek benchmark.

What Changed

  • Hardened async generation merge behavior:
    • preserves async output order
    • rejects negative or duplicate positions
    • reports missing indices instead of writing null rows
    • handles empty async output cleanly
  • Improved multimodal message content normalization:
    • accepts typed text dictionaries
    • normalizes None text to empty content
    • rejects malformed dicts and unsupported content types with clear errors
  • Improved self-hosted server readiness checks:
    • normalizes server addresses
    • probes /v1/models and /health
    • avoids relying on unsupported raw-address checks
  • Reduced random port collisions for concurrent SLURM submissions:
    • supports deferred SLURM_JOB_ID-derived ports
    • stores timestamped random-port reservations
    • prunes stale reservations with a TTL
  • Plumbed integer or string-valued ports through server and sandbox scripts.
  • Refined judge/summarize sbatch override handling and exclusivity defaults.

Relationship To #1443

This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.

Validation

  • python -m py_compile on changed Python files.
  • Isolated smoke test for timestamped port reservation pruning.
  • CI lint/DCO/copyright checks are passing.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Harden async generation merging, normalize multimodal text dicts into list-form content blocks, widen port types to accept shell expressions, implement SLURM-aware deterministic or collision-resistant random port allocation, strengthen server readiness checks, and refine SLURM scheduling/defaults and env placeholder handling.

Changes

Inference & Generation

Layer / File(s) Summary
Async generation order validation
nemo_skills/inference/generate.py
restore_async_order() now sizes the merged output by maximum async position, validates positions (non-negative, unique), returns an empty merged list for empty -async files, and raises RuntimeError listing missing indices instead of writing null rows.
Multimodal content text dict normalization
nemo_skills/inference/model/vllm_multimodal.py
content_text_to_list() and chunked generation handling now accept typed text dicts ({"type":"text","text":...}) and convert them into list-form text blocks (empty when text is None); other dict shapes raise TypeError.

Server Port Allocation & SLURM Integration

Layer / File(s) Summary
Port type widening
nemo_skills/pipeline/utils/scripts/eval.py, nemo_skills/pipeline/utils/scripts/server.py
Port parameters and fields widened from Optional[int] to Optional[int | str] (_inject_single_server_overrides, ServerScript.port, SandboxScript.port).
Port allocation with SLURM support
nemo_skills/pipeline/utils/server.py
get_free_port() now returns int | str and either a SLURM-job-ID-derived shell expression when enabled or a filesystem-locked random port otherwise. Module-level in-process reservations and per-UID lock/state are introduced. get_server_command() accepts int | str.
Server wait command robustness
nemo_skills/utils.py
get_server_wait_cmd() normalizes scheme/trailing slash, derives /models//health endpoint URLs (handles /v1 suffix), and waits by curling those endpoints rather than PUT-ing the raw address.
SLURM scheduling and env placeholder handling
nemo_skills/pipeline/eval.py, nemo_skills/pipeline/utils/cluster.py, nemo_skills/pipeline/utils/exp.py
Refactors summarize-results sbatch kwargs parsing to preserve raw option and normalize dict, sets/normalizes defaults, conditionally overwrites or backfills time fields, uses setdefault for CPU partition exclusivity, removes explicit exclusive=False to allow executor defaults, and skips placeholder expansion for NEMO_SKILLS_SANDBOX_PORT when its value contains SLURM_JOB_ID.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • AlexGrinch
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(pipeline): harden eval runtime plumbing' directly and accurately summarizes the main objective of the changeset, which involves multiple runtime and pipeline fixes across inference, evaluation, and server utilities to improve robustness.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Around line 324-333: The dict branches currently accept any dict that merely
contains "text" (even if "type" is wrong or extra keys exist); change the
validation to require an exact shape for text dicts: require content.get("type")
== "text" and that set(content.keys()) <= {"type","text"} (or equals
{"type","text"} if "type" must be present) and only then set result["content"] =
[] if text is None else [{"type":"text","text": str(text)}]; otherwise raise a
TypeError listing the unexpected keys. Apply the same tightened check to the
equivalent branch later in the file (the other content dict handling block).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 003669a8-f65d-42fb-87cc-7a6e9c19fcc4

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and a9159cf.

📒 Files selected for processing (9)
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/model/vllm_multimodal.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/cluster.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/scripts/eval.py
  • nemo_skills/pipeline/utils/scripts/server.py
  • nemo_skills/pipeline/utils/server.py
  • nemo_skills/utils.py

Comment thread nemo_skills/inference/model/vllm_multimodal.py
@pzelasko

pzelasko commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review\n\nPushed the requested multimodal dict validation fix on the current head. Please re-review.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@pzelasko Sure! I'll kick off a full re-review of PR #1474, including the multimodal dict validation fix you pushed.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_skills/inference/model/vllm_multimodal.py (1)

325-325: ⚡ Quick win

Use direct key access for required dict keys

Both branches validate a required typed shape, but still read type via .get(). Switch to direct key access after the key-set check to align with repo rules and keep failure mode explicit.

Suggested diff
-            if content.get("type") == "text" and set(content) == {"type", "text"}:
+            if set(content) == {"type", "text"} and content["type"] == "text":
                 text = content["text"]
                 result["content"] = [] if text is None else [{"type": "text", "text": str(text)}]
@@
-                        if content.get("type") == "text" and set(content) == {"type", "text"}:
+                        if set(content) == {"type", "text"} and content["type"] == "text":
                             text = content["text"]
                             text_content = [] if text is None else [{"type": "text", "text": str(text)}]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 443-443

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/inference/model/vllm_multimodal.py` at line 325, The conditional
currently uses content.get("type") even after verifying the key set; change to
direct key access (content["type"]) in the vllm_multimodal.py branches that
validate a typed shape so a missing key fails loudly—specifically update the
check in the block referencing content.get("type") == "text" (around the if in
the multimodal parsing code) and the similar occurrence noted at the later
branch (around the second occurrence at the other branch) to use content["type"]
instead of .get().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Around line 442-449: The current branch lets arbitrary non-str/non-dict
`content` fall through into `text_content` and later list concatenation, causing
unclear runtime errors; update the chunking path to mirror
`content_text_to_list` semantics by explicitly accepting lists (assign to
`text_content`) and rejecting any other non-list/non-str/non-dict types with a
clear TypeError that references the offending value (use the same error style as
`content_text_to_list`), while keeping the existing dict handling for
`{"type","text"}` and string handling intact.

In `@nemo_skills/pipeline/utils/server.py`:
- Around line 76-106: The reservation file currently accumulates ports forever
causing exhaustion; change the state file to store timestamps and prune entries
older than a TTL before selecting/writing a port. While holding the existing
lock (lock_path), when reading state_path parse lines as "port[,ts]" (treat
legacy numeric-only lines as ts=0), drop entries with ts < now -
PORT_RESERVATION_TTL (define DEFAULT_PORT_RESERVATION_TTL, e.g. 3600), rebuild
the reserved set from remaining entries plus _ALLOCATED_RANDOM_PORTS, pick and
add the new port to reserved and _ALLOCATED_RANDOM_PORTS, and write back only
current entries as "port,ts" (using time.time()) so stale ports get recycled;
keep all I/O and mutation inside the same fcntl lock to avoid races and limit
total stored entries if desired.

---

Nitpick comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Line 325: The conditional currently uses content.get("type") even after
verifying the key set; change to direct key access (content["type"]) in the
vllm_multimodal.py branches that validate a typed shape so a missing key fails
loudly—specifically update the check in the block referencing
content.get("type") == "text" (around the if in the multimodal parsing code) and
the similar occurrence noted at the later branch (around the second occurrence
at the other branch) to use content["type"] instead of .get().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e852756-004f-4adf-9af0-e4aec02667ec

📥 Commits

Reviewing files that changed from the base of the PR and between a9159cf and b48e96d.

📒 Files selected for processing (9)
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/model/vllm_multimodal.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/cluster.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/scripts/eval.py
  • nemo_skills/pipeline/utils/scripts/server.py
  • nemo_skills/pipeline/utils/server.py
  • nemo_skills/utils.py
✅ Files skipped from review due to trivial changes (1)
  • nemo_skills/pipeline/utils/scripts/eval.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • nemo_skills/inference/generate.py
  • nemo_skills/pipeline/utils/cluster.py
  • nemo_skills/pipeline/utils/scripts/server.py
  • nemo_skills/utils.py

Comment thread nemo_skills/inference/model/vllm_multimodal.py
Comment thread nemo_skills/pipeline/utils/server.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
@pzelasko pzelasko force-pushed the codex/pr1443-runtime-fixes branch from b48e96d to 5775e9d Compare June 4, 2026 17:43
@pzelasko

pzelasko commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

Pushed fixes for the latest runtime review: direct typed-dict access, explicit list-only chunk content handling, and timestamped random-port reservations with TTL pruning. Please re-review the current head.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@pzelasko Sure! I'll kick off a full re-review of PR #1474 with the latest fixes — direct typed-dict access, explicit list-only chunk content handling, and timestamped random-port reservations with TTL pruning.

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/utils/cluster.py`:
- Around line 262-263: The current check only skips resolving SLURM placeholders
for NEMO_SKILLS_SANDBOX_PORT (if key == "NEMO_SKILLS_SANDBOX_PORT" and
"SLURM_JOB_ID" in value: continue) but LISTEN_PORT and NGINX_PORT are set from
the same SLURM expression and must be deferred too; update that conditional to
skip expansion when "SLURM_JOB_ID" appears in value for all sandbox-related port
keys (e.g., key in {"NEMO_SKILLS_SANDBOX_PORT","LISTEN_PORT","NGINX_PORT"}) or
alternately for any key that endswith "_PORT" used by the sandbox, so the
resolver later (the code around the existing continue and the resolver at the
subsequent lines) does not attempt to expand and fail.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 988a7506-257b-4e56-95f2-f3234d157fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and 5775e9d.

📒 Files selected for processing (9)
  • nemo_skills/inference/generate.py
  • nemo_skills/inference/model/vllm_multimodal.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/cluster.py
  • nemo_skills/pipeline/utils/exp.py
  • nemo_skills/pipeline/utils/scripts/eval.py
  • nemo_skills/pipeline/utils/scripts/server.py
  • nemo_skills/pipeline/utils/server.py
  • nemo_skills/utils.py

Comment on lines +262 to +263
if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle deferred SLURM port placeholders for all sandbox port env keys.

Line 262 only skips expansion for NEMO_SKILLS_SANDBOX_PORT, but sandbox startup also sets LISTEN_PORT/NGINX_PORT from the same SLURM expression. Those still hit the resolver and can fail at Line 270.

💡 Proposed fix
-            if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value:
+            if key in {"NEMO_SKILLS_SANDBOX_PORT", "LISTEN_PORT", "NGINX_PORT"} and "SLURM_JOB_ID" in value:
                 continue
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/pipeline/utils/cluster.py` around lines 262 - 263, The current
check only skips resolving SLURM placeholders for NEMO_SKILLS_SANDBOX_PORT (if
key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value: continue) but
LISTEN_PORT and NGINX_PORT are set from the same SLURM expression and must be
deferred too; update that conditional to skip expansion when "SLURM_JOB_ID"
appears in value for all sandbox-related port keys (e.g., key in
{"NEMO_SKILLS_SANDBOX_PORT","LISTEN_PORT","NGINX_PORT"}) or alternately for any
key that endswith "_PORT" used by the sandbox, so the resolver later (the code
around the existing continue and the resolver at the subsequent lines) does not
attempt to expand and fail.

Copy link
Copy Markdown
Collaborator Author

Recreated as #1484 from the same-repo branch codex/pr1443-runtime-fixes so repository CI can run. Closing this fork-based PR to keep the review queue focused on the compliant replacement.

@pzelasko pzelasko closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant