Skip to content

fix: make rollout /run failures recoverable#1788

Open
yuchenwang3 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
yuchenwang3:recoverable-rollout-errors
Open

fix: make rollout /run failures recoverable#1788
yuchenwang3 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
yuchenwang3:recoverable-rollout-errors

Conversation

@yuchenwang3

Copy link
Copy Markdown

Two issues that together turn one resource-server error into a fatal crash:

  1. raise_for_status re-raises aiohttp's ClientResponseError unchanged. It holds request_info/history/headers, which are multidict.CIMultiDictProxy objects that don't pickle. When rollout collection runs under Ray, the exception is pickled to cross actors and the run dies with can't pickle CIMultiDictProxy — so any resource-server 5xx takes down the whole job instead of failing a single rollout. Strip the unpicklable fields; keep status/message/response_content.

  2. _post_subroutine never retries /run. A transient 5xx (e.g. a momentarily overloaded code-exec server) aborts the batch. Retry a few times on 5xx/connection errors; leave 4xx alone.

How we hit it: under GRPO, a vLLM worker 500s on a NaN logprob (fixed separately in NVIDIA-NeMo/RL#2962). That 500 reaches raise_for_status, and (1) then crashes the whole rollout collection on the unpicklable exception rather than surfacing the underlying error.

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nemo-automation-bot nemo-automation-bot Bot added the community-request Issue reported or requested by someone from the community label Jun 26, 2026
raise_for_status re-raises aiohttp's ClientResponseError unchanged. It carries
request_info/history/headers, which are multidict.CIMultiDictProxy objects that
don't pickle. When rollout collection runs under Ray the exception is pickled to
cross actors and the run dies with "can't pickle CIMultiDictProxy" - so any
resource-server 5xx takes down the whole job instead of failing one rollout.

Strip the unpicklable fields; keep status/message/response_content.

Signed-off-by: yuchenwang3 <yuchenwang3@users.noreply.github.com>
_post_subroutine doesn't retry /run, so a transient 5xx (e.g. a momentarily
overloaded code-exec server) aborts the batch. Retry a few times on 5xx and
connection errors; leave 4xx alone.

Signed-off-by: yuchenwang3 <yuchenwang3@users.noreply.github.com>
@yuchenwang3 yuchenwang3 changed the title Make rollout /run failures recoverable fix: make rollout /run failures recoverable Jun 26, 2026
@yuchenwang3 yuchenwang3 force-pushed the recoverable-rollout-errors branch from 03e29e7 to 22ed1af Compare June 26, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Issue reported or requested by someone from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant