DRAFT fix: Nano-v3 recipe run fix. by snowmanwwg · Pull Request #2867 · NVIDIA-NeMo/RL

snowmanwwg · 2026-06-18T04:03:17Z

The PR fixes the old NeMo-Gym PARENT_DIR crash in NeMo-RL.

When NeMo-RL starts a NeMo-Gym actor, Python could accidentally grab the wrong folder named nemo_gym:

Wrong one: examples/nemo_gym
Correct one: 3rdparty/Gym-workspace/Gym/nemo_gym

The wrong one is just an examples folder, so it does not contain PARENT_DIR. That caused:

ImportError: cannot import name 'PARENT_DIR' from 'nemo_gym'

The PR makes NeMo-RL do this before starting NeMo-Gym:

Put the real NeMo-Gym package path first.
If Python already loaded the wrong examples/nemo_gym, clear it out.
Then import NeMo-Gym normally.

It also fixes a smaller environment bug: NeMo-RL was setting VIRTUAL_ENV to the Python executable path like:

.../bin/python

but it should point to the venv folder:

.../

So the PR makes that correct too.

Slurm Ray job cleanup
After the driver command finished, ray.sub did not signal the Ray sidecars to stop, so the Slurm allocation could remain alive even after training completed.
The PR touches LOG_DIR/ENDED after a non-empty COMMAND exits and preserves the driver exit code.
Context-length overflow handling
In nemo_rl/models/generation/vllm/vllm_worker_async.py:716, I changed the async chat endpoint so our local ValueError for overlong prompts is converted to HTTP 400, same as vLLM
validation errors. Also a related Gym PR that fix it in the corresponding gym side fix: handle vllm context length errors in nano v3 recipe Gym#1752

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>

copy-pr-bot · 2026-06-18T04:03:20Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>

yfw · 2026-06-26T23:34:50Z

+                if isinstance(e, ValueError) and not (
+                    "max_model_len" in message or "maximum context length" in message
+                ):
+                    raise


vllm_worker_async.py:733

VLLMValidationError is a subclass of ValueError in vLLM 0.20 (vllm/exceptions.py: class VLLMValidationError(ValueError)). Because of that, isinstance(e, ValueError) here is True for every VLLMValidationError, so the guard re-raises any VLLMValidationError whose message lacks "max_model_len"/"maximum context length" — e.g. the sampling-param validation errors raised throughout vllm/sampling_params.py. Before this change (except VLLMValidationError) all of those returned HTTP 400; now they propagate instead, narrowing behavior for the exact type the handler was built for.

Consider gating the substring filter so it only applies to plain ValueError, never to VLLMValidationError:

Suggested change

raise

if not isinstance(e, VLLMValidationError) and not (

"max_model_len" in message or "maximum context length" in message

):

raise

Also, do you have examples for when the current behavior is a problem?

yfw · 2026-06-26T23:34:51Z

 DEFAULT_THINKING_TAGS = ["<think>", "</think>"]


+def _ensure_nemo_gym_package_precedence() -> None:


nemo_gym.py:42

These fixes currently ship without tests. Two are pure, CPU-only logic and would be cheap to cover:

_ensure_nemo_gym_package_precedence — monkeypatch sys.path/sys.modules and point at a temp gym_init; assert the namespace-package shadow is purged, a real package (with __file__/PARENT_DIR) is preserved, a missing submodule is a no-op, and the call is idempotent. Home: tests/unit/environments/test_nemo_gym.py.

The vLLM ValueError→400 filter — a context-length ValueError (message containing max_model_len) → 400, a generic ValueError → re-raise, and a VLLMValidationError without those substrings → 400 (this last case guards the regression flagged in the other comment). tests/unit/models/generation/test_vllm_generation.py:240 already stubs VLLMValidationError (note: the stub there subclasses Exception, not ValueError — it should subclass ValueError to mirror the real class).

Since the sys.path/sys.modules manipulation is the subtlest change here, a unit test for (1) would be especially valuable before merge.

yfw · 2026-06-26T23:34:51Z

  srun --no-container-mount-home --overlap --container-name=ray-head --container-workdir=$CONTAINER_CWD --nodes=1 --ntasks=1 -w "$head_node" -o $LOG_DIR/ray-driver.log bash -c "$COMMAND"
+  driver_exit_code=$?
+  set -e
+  touch "$LOG_DIR/ENDED"


Let's separate this one out to another PR. It seems pretty unrelated to the rest

yfw · 2026-06-27T00:13:09Z

        _gym_port_high = self.cfg.get("port_range_high", DEFAULT_GYM_PORT_RANGE_HIGH)
        self.head_server_port = _get_free_port_local(_gym_port_low, _gym_port_high)

+        _ensure_nemo_gym_package_precedence()


The PR fixes the old NeMo-Gym PARENT_DIR crash in NeMo-RL.

When does this happen?

fix: prefer real NeMo-Gym package in actor

9ff5657

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>

snowmanwwg requested review from a team as code owners June 18, 2026 04:03

snowmanwwg changed the title ~~DRAFT fix: prefer real NeMo-Gym package in actor~~ DRAFT fix: Nano-v3 recipe run fix. Jun 25, 2026

Wenwen Gao added 2 commits June 24, 2026 22:36

fix(slurm): stop Ray sidecars after driver exits

af874c4

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>

fix(nemo-gym): map overlong vllm prompts to 400

8262980

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>

snowmanwwg requested review from a team as code owners June 25, 2026 05:36

snowmanwwg added the r0.7.0 label Jun 26, 2026

snowmanwwg requested a review from yfw June 26, 2026 00:11

Wenwen Gao added 3 commits June 26, 2026 12:31

chore: remove test-only changes from nano-v3 PR

cfc93f4

chore: drop test changes from nano-v3 PR diff

875e702

chore: drop prefetch venv change from nano-v3 PR

d35e2d5

yfw reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRAFT fix: Nano-v3 recipe run fix. #2867

DRAFT fix: Nano-v3 recipe run fix. #2867
snowmanwwg wants to merge 6 commits into
mainfrom
fix/nemo-gym-package-precedence

snowmanwwg commented Jun 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

yfw Jun 26, 2026

Uh oh!

yfw Jun 27, 2026

Uh oh!

yfw Jun 26, 2026

Uh oh!

yfw Jun 26, 2026

Uh oh!

yfw Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		DEFAULT_THINKING_TAGS = ["<think>", "</think>"]


		def _ensure_nemo_gym_package_precedence() -> None:

Uh oh!

Conversation

snowmanwwg commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 18, 2026

Uh oh!

yfw Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

yfw Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

yfw Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

yfw Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

yfw Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snowmanwwg commented Jun 18, 2026 •

edited

Loading