DRAFT fix: Nano-v3 recipe run fix. #2867
Conversation
Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
| if isinstance(e, ValueError) and not ( | ||
| "max_model_len" in message or "maximum context length" in message | ||
| ): | ||
| raise |
There was a problem hiding this comment.
VLLMValidationError is a subclass of ValueError in vLLM 0.20 (vllm/exceptions.py: class VLLMValidationError(ValueError)). Because of that, isinstance(e, ValueError) here is True for every VLLMValidationError, so the guard re-raises any VLLMValidationError whose message lacks "max_model_len"/"maximum context length" — e.g. the sampling-param validation errors raised throughout vllm/sampling_params.py. Before this change (except VLLMValidationError) all of those returned HTTP 400; now they propagate instead, narrowing behavior for the exact type the handler was built for.
Consider gating the substring filter so it only applies to plain ValueError, never to VLLMValidationError:
| raise | |
| if not isinstance(e, VLLMValidationError) and not ( | |
| "max_model_len" in message or "maximum context length" in message | |
| ): | |
| raise |
There was a problem hiding this comment.
Also, do you have examples for when the current behavior is a problem?
| DEFAULT_THINKING_TAGS = ["<think>", "</think>"] | ||
|
|
||
|
|
||
| def _ensure_nemo_gym_package_precedence() -> None: |
There was a problem hiding this comment.
These fixes currently ship without tests. Two are pure, CPU-only logic and would be cheap to cover:
_ensure_nemo_gym_package_precedence— monkeypatchsys.path/sys.modulesand point at a tempgym_init; assert the namespace-package shadow is purged, a real package (with__file__/PARENT_DIR) is preserved, a missing submodule is a no-op, and the call is idempotent. Home:tests/unit/environments/test_nemo_gym.py.- The vLLM
ValueError→400 filter — a context-lengthValueError(message containingmax_model_len) → 400, a genericValueError→ re-raise, and aVLLMValidationErrorwithout those substrings → 400 (this last case guards the regression flagged in the other comment).tests/unit/models/generation/test_vllm_generation.py:240already stubsVLLMValidationError(note: the stub there subclassesException, notValueError— it should subclassValueErrorto mirror the real class).
Since the sys.path/sys.modules manipulation is the subtlest change here, a unit test for (1) would be especially valuable before merge.
| srun --no-container-mount-home --overlap --container-name=ray-head --container-workdir=$CONTAINER_CWD --nodes=1 --ntasks=1 -w "$head_node" -o $LOG_DIR/ray-driver.log bash -c "$COMMAND" | ||
| driver_exit_code=$? | ||
| set -e | ||
| touch "$LOG_DIR/ENDED" |
There was a problem hiding this comment.
Let's separate this one out to another PR. It seems pretty unrelated to the rest
| _gym_port_high = self.cfg.get("port_range_high", DEFAULT_GYM_PORT_RANGE_HIGH) | ||
| self.head_server_port = _get_free_port_local(_gym_port_low, _gym_port_high) | ||
|
|
||
| _ensure_nemo_gym_package_precedence() |
There was a problem hiding this comment.
The PR fixes the old NeMo-Gym PARENT_DIR crash in NeMo-RL.
When does this happen?
When NeMo-RL starts a NeMo-Gym actor, Python could accidentally grab the wrong folder named nemo_gym:
The wrong one is just an examples folder, so it does not contain PARENT_DIR. That caused:
ImportError: cannot import name 'PARENT_DIR' from 'nemo_gym'
The PR makes NeMo-RL do this before starting NeMo-Gym:
It also fixes a smaller environment bug: NeMo-RL was setting VIRTUAL_ENV to the Python executable path like:
.../bin/python
but it should point to the venv folder:
.../
So the PR makes that correct too.
Slurm Ray job cleanup
After the driver command finished, ray.sub did not signal the Ray sidecars to stop, so the Slurm allocation could remain alive even after training completed.
The PR touches LOG_DIR/ENDED after a non-empty COMMAND exits and preserves the driver exit code.
Context-length overflow handling
In nemo_rl/models/generation/vllm/vllm_worker_async.py:716, I changed the async chat endpoint so our local ValueError for overlong prompts is converted to HTTP 400, same as vLLM
validation errors. Also a related Gym PR that fix it in the corresponding gym side fix: handle vllm context length errors in nano v3 recipe Gym#1752
Before your PR is "Ready for review"
Pre checks:
Additional Information