fix(pipeline): harden eval runtime plumbing by pzelasko · Pull Request #1474 · NVIDIA-NeMo/Skills

pzelasko · 2026-06-04T17:12:21Z

Summary

Split out from #1443.

This PR contains general runtime and pipeline hardening that is useful independently of the AppTek benchmark.

What Changed

Hardened async generation merge behavior:
- preserves async output order
- rejects negative or duplicate positions
- reports missing indices instead of writing null rows
- handles empty async output cleanly
Improved multimodal message content normalization:
- accepts typed text dictionaries
- normalizes None text to empty content
- rejects malformed dicts and unsupported content types with clear errors
Improved self-hosted server readiness checks:
- normalizes server addresses
- probes /v1/models and /health
- avoids relying on unsupported raw-address checks
Reduced random port collisions for concurrent SLURM submissions:
- supports deferred SLURM_JOB_ID-derived ports
- stores timestamped random-port reservations
- prunes stale reservations with a TTL
Plumbed integer or string-valued ports through server and sandbox scripts.
Refined judge/summarize sbatch override handling and exclusivity defaults.

Relationship To #1443

This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.

Validation

python -m py_compile on changed Python files.
Isolated smoke test for timestamped port reservation pruning.
CI lint/DCO/copyright checks are passing.

coderabbitai · 2026-06-04T17:18:10Z

📝 Walkthrough

Walkthrough

Harden async generation merging, normalize multimodal text dicts into list-form content blocks, widen port types to accept shell expressions, implement SLURM-aware deterministic or collision-resistant random port allocation, strengthen server readiness checks, and refine SLURM scheduling/defaults and env placeholder handling.

Changes

Inference & Generation

Layer / File(s)	Summary
Async generation order validation `nemo_skills/inference/generate.py`	`restore_async_order()` now sizes the merged output by maximum async position, validates positions (non-negative, unique), returns an empty merged list for empty `-async` files, and raises `RuntimeError` listing missing indices instead of writing null rows.
Multimodal content text dict normalization `nemo_skills/inference/model/vllm_multimodal.py`	`content_text_to_list()` and chunked generation handling now accept typed text dicts (`{"type":"text","text":...}`) and convert them into list-form text blocks (empty when `text` is `None`); other dict shapes raise `TypeError`.

Server Port Allocation & SLURM Integration

Layer / File(s)	Summary
Port type widening `nemo_skills/pipeline/utils/scripts/eval.py`, `nemo_skills/pipeline/utils/scripts/server.py`	Port parameters and fields widened from `Optional[int]` to `Optional[int \| str]` (`_inject_single_server_overrides`, `ServerScript.port`, `SandboxScript.port`).
Port allocation with SLURM support `nemo_skills/pipeline/utils/server.py`	`get_free_port()` now returns `int \| str` and either a SLURM-job-ID-derived shell expression when enabled or a filesystem-locked random port otherwise. Module-level in-process reservations and per-UID lock/state are introduced. `get_server_command()` accepts `int \| str`.
Server wait command robustness `nemo_skills/utils.py`	`get_server_wait_cmd()` normalizes scheme/trailing slash, derives `/models`/`/health` endpoint URLs (handles `/v1` suffix), and waits by curling those endpoints rather than PUT-ing the raw address.
SLURM scheduling and env placeholder handling `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/utils/cluster.py`, `nemo_skills/pipeline/utils/exp.py`	Refactors summarize-results sbatch kwargs parsing to preserve raw option and normalize dict, sets/normalizes defaults, conditionally overwrites or backfills `time` fields, uses `setdefault` for CPU partition exclusivity, removes explicit `exclusive=False` to allow executor defaults, and skips placeholder expansion for `NEMO_SKILLS_SANDBOX_PORT` when its value contains `SLURM_JOB_ID`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

AlexGrinch

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(pipeline): harden eval runtime plumbing' directly and accurately summarizes the main objective of the changeset, which involves multiple runtime and pipeline fixes across inference, evaluation, and server utilities to improve robustness.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Around line 324-333: The dict branches currently accept any dict that merely
contains "text" (even if "type" is wrong or extra keys exist); change the
validation to require an exact shape for text dicts: require content.get("type")
== "text" and that set(content.keys()) <= {"type","text"} (or equals
{"type","text"} if "type" must be present) and only then set result["content"] =
[] if text is None else [{"type":"text","text": str(text)}]; otherwise raise a
TypeError listing the unexpected keys. Apply the same tightened check to the
equivalent branch later in the file (the other content dict handling block).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 003669a8-f65d-42fb-87cc-7a6e9c19fcc4

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and a9159cf.

📒 Files selected for processing (9)

nemo_skills/inference/generate.py
nemo_skills/inference/model/vllm_multimodal.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/cluster.py
nemo_skills/pipeline/utils/exp.py
nemo_skills/pipeline/utils/scripts/eval.py
nemo_skills/pipeline/utils/scripts/server.py
nemo_skills/pipeline/utils/server.py
nemo_skills/utils.py

pzelasko · 2026-06-04T17:31:32Z

@coderabbitai review\n\nPushed the requested multimodal dict validation fix on the current head. Please re-review.

coderabbitai · 2026-06-04T17:31:44Z

@pzelasko Sure! I'll kick off a full re-review of PR #1474, including the multimodal dict validation fix you pushed.

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

nemo_skills/inference/model/vllm_multimodal.py (1)

325-325: ⚡ Quick win

Use direct key access for required dict keys

Both branches validate a required typed shape, but still read type via .get(). Switch to direct key access after the key-set check to align with repo rules and keep failure mode explicit.

Suggested diff

-            if content.get("type") == "text" and set(content) == {"type", "text"}:
+            if set(content) == {"type", "text"} and content["type"] == "text":
                 text = content["text"]
                 result["content"] = [] if text is None else [{"type": "text", "text": str(text)}]
@@
-                        if content.get("type") == "text" and set(content) == {"type", "text"}:
+                        if set(content) == {"type", "text"} and content["type"] == "text":
                             text = content["text"]
                             text_content = [] if text is None else [{"type": "text", "text": str(text)}]

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 443-443

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@nemo_skills/inference/model/vllm_multimodal.py` at line 325, The conditional
currently uses content.get("type") even after verifying the key set; change to
direct key access (content["type"]) in the vllm_multimodal.py branches that
validate a typed shape so a missing key fails loudly—specifically update the
check in the block referencing content.get("type") == "text" (around the if in
the multimodal parsing code) and the similar occurrence noted at the later
branch (around the second occurrence at the other branch) to use content["type"]
instead of .get().

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Around line 442-449: The current branch lets arbitrary non-str/non-dict
`content` fall through into `text_content` and later list concatenation, causing
unclear runtime errors; update the chunking path to mirror
`content_text_to_list` semantics by explicitly accepting lists (assign to
`text_content`) and rejecting any other non-list/non-str/non-dict types with a
clear TypeError that references the offending value (use the same error style as
`content_text_to_list`), while keeping the existing dict handling for
`{"type","text"}` and string handling intact.

In `@nemo_skills/pipeline/utils/server.py`:
- Around line 76-106: The reservation file currently accumulates ports forever
causing exhaustion; change the state file to store timestamps and prune entries
older than a TTL before selecting/writing a port. While holding the existing
lock (lock_path), when reading state_path parse lines as "port[,ts]" (treat
legacy numeric-only lines as ts=0), drop entries with ts < now -
PORT_RESERVATION_TTL (define DEFAULT_PORT_RESERVATION_TTL, e.g. 3600), rebuild
the reserved set from remaining entries plus _ALLOCATED_RANDOM_PORTS, pick and
add the new port to reserved and _ALLOCATED_RANDOM_PORTS, and write back only
current entries as "port,ts" (using time.time()) so stale ports get recycled;
keep all I/O and mutation inside the same fcntl lock to avoid races and limit
total stored entries if desired.

---

Nitpick comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Line 325: The conditional currently uses content.get("type") even after
verifying the key set; change to direct key access (content["type"]) in the
vllm_multimodal.py branches that validate a typed shape so a missing key fails
loudly—specifically update the check in the block referencing
content.get("type") == "text" (around the if in the multimodal parsing code) and
the similar occurrence noted at the later branch (around the second occurrence
at the other branch) to use content["type"] instead of .get().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e852756-004f-4adf-9af0-e4aec02667ec

📥 Commits

Reviewing files that changed from the base of the PR and between a9159cf and b48e96d.

📒 Files selected for processing (9)

nemo_skills/inference/generate.py
nemo_skills/inference/model/vllm_multimodal.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/cluster.py
nemo_skills/pipeline/utils/exp.py
nemo_skills/pipeline/utils/scripts/eval.py
nemo_skills/pipeline/utils/scripts/server.py
nemo_skills/pipeline/utils/server.py
nemo_skills/utils.py

✅ Files skipped from review due to trivial changes (1)

nemo_skills/pipeline/utils/scripts/eval.py

🚧 Files skipped from review as they are similar to previous changes (4)

nemo_skills/inference/generate.py
nemo_skills/pipeline/utils/cluster.py
nemo_skills/pipeline/utils/scripts/server.py
nemo_skills/utils.py

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko · 2026-06-04T17:46:33Z

@coderabbitai review

Pushed fixes for the latest runtime review: direct typed-dict access, explicit list-only chunk content handling, and timestamped random-port reservations with TTL pruning. Please re-review the current head.

coderabbitai · 2026-06-04T17:46:44Z

@pzelasko Sure! I'll kick off a full re-review of PR #1474 with the latest fixes — direct typed-dict access, explicit list-only chunk content handling, and timestamped random-port reservations with TTL pruning.

✅ Action performed

Full review finished.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@nemo_skills/pipeline/utils/cluster.py`:
- Around line 262-263: The current check only skips resolving SLURM placeholders
for NEMO_SKILLS_SANDBOX_PORT (if key == "NEMO_SKILLS_SANDBOX_PORT" and
"SLURM_JOB_ID" in value: continue) but LISTEN_PORT and NGINX_PORT are set from
the same SLURM expression and must be deferred too; update that conditional to
skip expansion when "SLURM_JOB_ID" appears in value for all sandbox-related port
keys (e.g., key in {"NEMO_SKILLS_SANDBOX_PORT","LISTEN_PORT","NGINX_PORT"}) or
alternately for any key that endswith "_PORT" used by the sandbox, so the
resolver later (the code around the existing continue and the resolver at the
subsequent lines) does not attempt to expand and fail.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 988a7506-257b-4e56-95f2-f3234d157fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 9c34cd1 and 5775e9d.

📒 Files selected for processing (9)

nemo_skills/inference/generate.py
nemo_skills/inference/model/vllm_multimodal.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/cluster.py
nemo_skills/pipeline/utils/exp.py
nemo_skills/pipeline/utils/scripts/eval.py
nemo_skills/pipeline/utils/scripts/server.py
nemo_skills/pipeline/utils/server.py
nemo_skills/utils.py

coderabbitai · 2026-06-04T17:55:30Z

+            if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value:
+                continue


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle deferred SLURM port placeholders for all sandbox port env keys.

Line 262 only skips expansion for NEMO_SKILLS_SANDBOX_PORT, but sandbox startup also sets LISTEN_PORT/NGINX_PORT from the same SLURM expression. Those still hit the resolver and can fail at Line 270.

💡 Proposed fix

- if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value: + if key in {"NEMO_SKILLS_SANDBOX_PORT", "LISTEN_PORT", "NGINX_PORT"} and "SLURM_JOB_ID" in value: continue

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/pipeline/utils/cluster.py` around lines 262 - 263, The current check only skips resolving SLURM placeholders for NEMO_SKILLS_SANDBOX_PORT (if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value: continue) but LISTEN_PORT and NGINX_PORT are set from the same SLURM expression and must be deferred too; update that conditional to skip expansion when "SLURM_JOB_ID" appears in value for all sandbox-related port keys (e.g., key in {"NEMO_SKILLS_SANDBOX_PORT","LISTEN_PORT","NGINX_PORT"}) or alternately for any key that endswith "_PORT" used by the sandbox, so the resolver later (the code around the existing continue and the resolver at the subsequent lines) does not attempt to expand and fail.

pzelasko · 2026-06-15T15:23:39Z

Recreated as #1484 from the same-repo branch codex/pr1443-runtime-fixes so repository CI can run. Closing this fork-based PR to keep the review queue focused on the compliant replacement.

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread nemo_skills/inference/model/vllm_multimodal.py

pzelasko mentioned this pull request Jun 4, 2026

feat(speech): add AppTek Call-Center Dialogues ASR benchmark #1443

Closed

pzelasko force-pushed the codex/pr1443-runtime-fixes branch from a9159cf to b48e96d Compare June 4, 2026 17:25

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread nemo_skills/inference/model/vllm_multimodal.py

Comment thread nemo_skills/pipeline/utils/server.py

fix(pipeline): harden eval runtime plumbing

5775e9d

Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>

pzelasko force-pushed the codex/pr1443-runtime-fixes branch from b48e96d to 5775e9d Compare June 4, 2026 17:43

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

pzelasko mentioned this pull request Jun 15, 2026

fix(pipeline): harden eval runtime plumbing #1484

Open

pzelasko closed this Jun 15, 2026

		if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value:
		continue

Uh oh!

Conversation

pzelasko commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Relationship To #1443

Validation

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pzelasko commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

pzelasko commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pzelasko commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading