fix(pipeline): harden eval runtime plumbing#1474
Conversation
📝 WalkthroughWalkthroughHarden async generation merging, normalize multimodal text dicts into list-form content blocks, widen port types to accept shell expressions, implement SLURM-aware deterministic or collision-resistant random port allocation, strengthen server readiness checks, and refine SLURM scheduling/defaults and env placeholder handling. ChangesInference & Generation
Server Port Allocation & SLURM Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Around line 324-333: The dict branches currently accept any dict that merely
contains "text" (even if "type" is wrong or extra keys exist); change the
validation to require an exact shape for text dicts: require content.get("type")
== "text" and that set(content.keys()) <= {"type","text"} (or equals
{"type","text"} if "type" must be present) and only then set result["content"] =
[] if text is None else [{"type":"text","text": str(text)}]; otherwise raise a
TypeError listing the unexpected keys. Apply the same tightened check to the
equivalent branch later in the file (the other content dict handling block).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 003669a8-f65d-42fb-87cc-7a6e9c19fcc4
📒 Files selected for processing (9)
nemo_skills/inference/generate.pynemo_skills/inference/model/vllm_multimodal.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/cluster.pynemo_skills/pipeline/utils/exp.pynemo_skills/pipeline/utils/scripts/eval.pynemo_skills/pipeline/utils/scripts/server.pynemo_skills/pipeline/utils/server.pynemo_skills/utils.py
a9159cf to
b48e96d
Compare
|
@coderabbitai review\n\nPushed the requested multimodal dict validation fix on the current head. Please re-review. |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemo_skills/inference/model/vllm_multimodal.py (1)
325-325: ⚡ Quick winUse direct key access for required dict keys
Both branches validate a required typed shape, but still read
typevia.get(). Switch to direct key access after the key-set check to align with repo rules and keep failure mode explicit.Suggested diff
- if content.get("type") == "text" and set(content) == {"type", "text"}: + if set(content) == {"type", "text"} and content["type"] == "text": text = content["text"] result["content"] = [] if text is None else [{"type": "text", "text": str(text)}] @@ - if content.get("type") == "text" and set(content) == {"type", "text"}: + if set(content) == {"type", "text"} and content["type"] == "text": text = content["text"] text_content = [] if text is None else [{"type": "text", "text": str(text)}]As per coding guidelines: "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".Also applies to: 443-443
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@nemo_skills/inference/model/vllm_multimodal.py` at line 325, The conditional currently uses content.get("type") even after verifying the key set; change to direct key access (content["type"]) in the vllm_multimodal.py branches that validate a typed shape so a missing key fails loudly—specifically update the check in the block referencing content.get("type") == "text" (around the if in the multimodal parsing code) and the similar occurrence noted at the later branch (around the second occurrence at the other branch) to use content["type"] instead of .get().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Around line 442-449: The current branch lets arbitrary non-str/non-dict
`content` fall through into `text_content` and later list concatenation, causing
unclear runtime errors; update the chunking path to mirror
`content_text_to_list` semantics by explicitly accepting lists (assign to
`text_content`) and rejecting any other non-list/non-str/non-dict types with a
clear TypeError that references the offending value (use the same error style as
`content_text_to_list`), while keeping the existing dict handling for
`{"type","text"}` and string handling intact.
In `@nemo_skills/pipeline/utils/server.py`:
- Around line 76-106: The reservation file currently accumulates ports forever
causing exhaustion; change the state file to store timestamps and prune entries
older than a TTL before selecting/writing a port. While holding the existing
lock (lock_path), when reading state_path parse lines as "port[,ts]" (treat
legacy numeric-only lines as ts=0), drop entries with ts < now -
PORT_RESERVATION_TTL (define DEFAULT_PORT_RESERVATION_TTL, e.g. 3600), rebuild
the reserved set from remaining entries plus _ALLOCATED_RANDOM_PORTS, pick and
add the new port to reserved and _ALLOCATED_RANDOM_PORTS, and write back only
current entries as "port,ts" (using time.time()) so stale ports get recycled;
keep all I/O and mutation inside the same fcntl lock to avoid races and limit
total stored entries if desired.
---
Nitpick comments:
In `@nemo_skills/inference/model/vllm_multimodal.py`:
- Line 325: The conditional currently uses content.get("type") even after
verifying the key set; change to direct key access (content["type"]) in the
vllm_multimodal.py branches that validate a typed shape so a missing key fails
loudly—specifically update the check in the block referencing
content.get("type") == "text" (around the if in the multimodal parsing code) and
the similar occurrence noted at the later branch (around the second occurrence
at the other branch) to use content["type"] instead of .get().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7e852756-004f-4adf-9af0-e4aec02667ec
📒 Files selected for processing (9)
nemo_skills/inference/generate.pynemo_skills/inference/model/vllm_multimodal.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/cluster.pynemo_skills/pipeline/utils/exp.pynemo_skills/pipeline/utils/scripts/eval.pynemo_skills/pipeline/utils/scripts/server.pynemo_skills/pipeline/utils/server.pynemo_skills/utils.py
✅ Files skipped from review due to trivial changes (1)
- nemo_skills/pipeline/utils/scripts/eval.py
🚧 Files skipped from review as they are similar to previous changes (4)
- nemo_skills/inference/generate.py
- nemo_skills/pipeline/utils/cluster.py
- nemo_skills/pipeline/utils/scripts/server.py
- nemo_skills/utils.py
Signed-off-by: Piotr Żelasko <pzelasko@nvidia.com>
b48e96d to
5775e9d
Compare
|
@coderabbitai review Pushed fixes for the latest runtime review: direct typed-dict access, explicit list-only chunk content handling, and timestamped random-port reservations with TTL pruning. Please re-review the current head. |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@nemo_skills/pipeline/utils/cluster.py`:
- Around line 262-263: The current check only skips resolving SLURM placeholders
for NEMO_SKILLS_SANDBOX_PORT (if key == "NEMO_SKILLS_SANDBOX_PORT" and
"SLURM_JOB_ID" in value: continue) but LISTEN_PORT and NGINX_PORT are set from
the same SLURM expression and must be deferred too; update that conditional to
skip expansion when "SLURM_JOB_ID" appears in value for all sandbox-related port
keys (e.g., key in {"NEMO_SKILLS_SANDBOX_PORT","LISTEN_PORT","NGINX_PORT"}) or
alternately for any key that endswith "_PORT" used by the sandbox, so the
resolver later (the code around the existing continue and the resolver at the
subsequent lines) does not attempt to expand and fail.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 988a7506-257b-4e56-95f2-f3234d157fe9
📒 Files selected for processing (9)
nemo_skills/inference/generate.pynemo_skills/inference/model/vllm_multimodal.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/cluster.pynemo_skills/pipeline/utils/exp.pynemo_skills/pipeline/utils/scripts/eval.pynemo_skills/pipeline/utils/scripts/server.pynemo_skills/pipeline/utils/server.pynemo_skills/utils.py
| if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value: | ||
| continue |
There was a problem hiding this comment.
Handle deferred SLURM port placeholders for all sandbox port env keys.
Line 262 only skips expansion for NEMO_SKILLS_SANDBOX_PORT, but sandbox startup also sets LISTEN_PORT/NGINX_PORT from the same SLURM expression. Those still hit the resolver and can fail at Line 270.
💡 Proposed fix
- if key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value:
+ if key in {"NEMO_SKILLS_SANDBOX_PORT", "LISTEN_PORT", "NGINX_PORT"} and "SLURM_JOB_ID" in value:
continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@nemo_skills/pipeline/utils/cluster.py` around lines 262 - 263, The current
check only skips resolving SLURM placeholders for NEMO_SKILLS_SANDBOX_PORT (if
key == "NEMO_SKILLS_SANDBOX_PORT" and "SLURM_JOB_ID" in value: continue) but
LISTEN_PORT and NGINX_PORT are set from the same SLURM expression and must be
deferred too; update that conditional to skip expansion when "SLURM_JOB_ID"
appears in value for all sandbox-related port keys (e.g., key in
{"NEMO_SKILLS_SANDBOX_PORT","LISTEN_PORT","NGINX_PORT"}) or alternately for any
key that endswith "_PORT" used by the sandbox, so the resolver later (the code
around the existing continue and the resolver at the subsequent lines) does not
attempt to expand and fail.
|
Recreated as #1484 from the same-repo branch |
Summary
Split out from #1443.
This PR contains general runtime and pipeline hardening that is useful independently of the AppTek benchmark.
What Changed
Nonetext to empty content/v1/modelsand/healthSLURM_JOB_ID-derived portsRelationship To #1443
This PR is a prerequisite/adjacent cleanup split out of the original AppTek PR. It does not add the AppTek benchmark itself.
Validation
python -m py_compileon changed Python files.