Skip to content

DRAFT fix: Nano-v3 recipe run fix. #2867

Open
snowmanwwg wants to merge 6 commits into
mainfrom
fix/nemo-gym-package-precedence
Open

DRAFT fix: Nano-v3 recipe run fix. #2867
snowmanwwg wants to merge 6 commits into
mainfrom
fix/nemo-gym-package-precedence

Conversation

@snowmanwwg

@snowmanwwg snowmanwwg commented Jun 18, 2026

Copy link
Copy Markdown
Contributor
  1. The PR fixes the old NeMo-Gym PARENT_DIR crash in NeMo-RL.

When NeMo-RL starts a NeMo-Gym actor, Python could accidentally grab the wrong folder named nemo_gym:

  • Wrong one: examples/nemo_gym
  • Correct one: 3rdparty/Gym-workspace/Gym/nemo_gym

The wrong one is just an examples folder, so it does not contain PARENT_DIR. That caused:

ImportError: cannot import name 'PARENT_DIR' from 'nemo_gym'

The PR makes NeMo-RL do this before starting NeMo-Gym:

  1. Put the real NeMo-Gym package path first.
  2. If Python already loaded the wrong examples/nemo_gym, clear it out.
  3. Then import NeMo-Gym normally.

It also fixes a smaller environment bug: NeMo-RL was setting VIRTUAL_ENV to the Python executable path like:

.../bin/python

but it should point to the venv folder:

.../

So the PR makes that correct too.

  1. Slurm Ray job cleanup
    After the driver command finished, ray.sub did not signal the Ray sidecars to stop, so the Slurm allocation could remain alive even after training completed.
    The PR touches LOG_DIR/ENDED after a non-empty COMMAND exits and preserves the driver exit code.

  2. Context-length overflow handling
    In nemo_rl/models/generation/vllm/vllm_worker_async.py:716, I changed the async chat endpoint so our local ValueError for overlong prompts is converted to HTTP 400, same as vLLM
    validation errors. Also a related Gym PR that fix it in the corresponding gym side fix: handle vllm context length errors in nano v3 recipe Gym#1752

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
@snowmanwwg snowmanwwg requested review from a team as code owners June 18, 2026 04:03
@copy-pr-bot

copy-pr-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@snowmanwwg snowmanwwg changed the title DRAFT fix: prefer real NeMo-Gym package in actor DRAFT fix: Nano-v3 recipe run fix. Jun 25, 2026
Wenwen Gao added 2 commits June 24, 2026 22:36
Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
Signed-off-by: Wenwen Gao <wenweng@cw-dfw-cs-001-vscode-01.cm.cluster>
@snowmanwwg snowmanwwg requested review from a team as code owners June 25, 2026 05:36
@snowmanwwg snowmanwwg requested a review from yfw June 26, 2026 00:11
if isinstance(e, ValueError) and not (
"max_model_len" in message or "maximum context length" in message
):
raise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm_worker_async.py:733

VLLMValidationError is a subclass of ValueError in vLLM 0.20 (vllm/exceptions.py: class VLLMValidationError(ValueError)). Because of that, isinstance(e, ValueError) here is True for every VLLMValidationError, so the guard re-raises any VLLMValidationError whose message lacks "max_model_len"/"maximum context length" — e.g. the sampling-param validation errors raised throughout vllm/sampling_params.py. Before this change (except VLLMValidationError) all of those returned HTTP 400; now they propagate instead, narrowing behavior for the exact type the handler was built for.

Consider gating the substring filter so it only applies to plain ValueError, never to VLLMValidationError:

Suggested change
raise
if not isinstance(e, VLLMValidationError) and not (
"max_model_len" in message or "maximum context length" in message
):
raise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you have examples for when the current behavior is a problem?

DEFAULT_THINKING_TAGS = ["<think>", "</think>"]


def _ensure_nemo_gym_package_precedence() -> None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nemo_gym.py:42

These fixes currently ship without tests. Two are pure, CPU-only logic and would be cheap to cover:

  1. _ensure_nemo_gym_package_precedence — monkeypatch sys.path/sys.modules and point at a temp gym_init; assert the namespace-package shadow is purged, a real package (with __file__/PARENT_DIR) is preserved, a missing submodule is a no-op, and the call is idempotent. Home: tests/unit/environments/test_nemo_gym.py.
  2. The vLLM ValueError→400 filter — a context-length ValueError (message containing max_model_len) → 400, a generic ValueError → re-raise, and a VLLMValidationError without those substrings → 400 (this last case guards the regression flagged in the other comment). tests/unit/models/generation/test_vllm_generation.py:240 already stubs VLLMValidationError (note: the stub there subclasses Exception, not ValueError — it should subclass ValueError to mirror the real class).

Since the sys.path/sys.modules manipulation is the subtlest change here, a unit test for (1) would be especially valuable before merge.

Comment thread ray.sub
srun --no-container-mount-home --overlap --container-name=ray-head --container-workdir=$CONTAINER_CWD --nodes=1 --ntasks=1 -w "$head_node" -o $LOG_DIR/ray-driver.log bash -c "$COMMAND"
driver_exit_code=$?
set -e
touch "$LOG_DIR/ENDED"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's separate this one out to another PR. It seems pretty unrelated to the rest

_gym_port_high = self.cfg.get("port_range_high", DEFAULT_GYM_PORT_RANGE_HIGH)
self.head_server_port = _get_free_port_local(_gym_port_low, _gym_port_high)

_ensure_nemo_gym_package_precedence()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR fixes the old NeMo-Gym PARENT_DIR crash in NeMo-RL.

When does this happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants