Skip to content

[rollout, trainer, cfg] feat: per-request abort hooks and AbortableLLMServerClient#6865

Open
cr-gao wants to merge 2 commits into
verl-project:mainfrom
cr-gao:feat/llm-client-abort-hooks
Open

[rollout, trainer, cfg] feat: per-request abort hooks and AbortableLLMServerClient#6865
cr-gao wants to merge 2 commits into
verl-project:mainfrom
cr-gao:feat/llm-client-abort-hooks

Conversation

@cr-gao

@cr-gao cr-gao commented Jun 27, 2026

Copy link
Copy Markdown

What does this PR do?

Adds selective (per-request) abort support to the rollout client path. Today
LLMServerClient exposes no way to interrupt a single in-flight request — only the
broadcast abort_all_requests on the rollout replica exists. This PR introduces
lightweight extension hooks plus a concrete AbortableLLMServerClient that tracks
in-flight requests so a custom AgentLoopWorker/AgentLoopManager can abort one specific
request (e.g. a timed-out rollout, or a custom trajectory-collection policy that decides to
drop a particular sequence).

AI assistance (Claude) was used to write this change; the author has reviewed every line.

Checklist Before Starting

Resolves #6866. Feature discussion / motivation: #6866.

  • Format the PR title as [{modules}] {type}: {description}

Test

Added tests/workers/rollout/test_abortable_llm_client_on_cpu.py (CPU-only, mocks the Ray
load balancer + server handles — no GPU/vLLM needed):

  • test_record_abort_and_cleanup — dispatch records the request; abort() targets the
    inner vLLM request id (not the outer sticky-session id); completion clears the entry.
  • test_inflight_cleared_on_normal_completion — normal finish also clears the table.
  • test_abort_unknown_request_is_noop — aborting an unknown/finished id neither raises nor
    hits the server.
  • test_cancellation_aborts_server_and_clears_inflight — cancelling generate() (e.g.
    asyncio.wait_for/timeout) fires _on_cancel, which aborts the server-side request so it
    does not leak, and _on_complete still clears the in-flight entry.
pytest tests/workers/rollout/test_abortable_llm_client_on_cpu.py -v

tests/workers/rollout/test_abortable_llm_client_on_cpu.py::test_record_abort_and_cleanup PASSED                  [ 25%]
tests/workers/rollout/test_abortable_llm_client_on_cpu.py::test_inflight_cleared_on_normal_completion PASSED     [ 50%]
tests/workers/rollout/test_abortable_llm_client_on_cpu.py::test_abort_unknown_request_is_noop PASSED             [ 75%]
tests/workers/rollout/test_abortable_llm_client_on_cpu.py::test_cancellation_aborts_server_and_clears_inflight PASSED [100%]

API and Usage Example

New config field actor_rollout_ref.rollout.agent.llm_client_class (FQN, dynamically
imported — mirrors the existing agent_loop_manager_class), wired in ray_trainer.py:

actor_rollout_ref:
  rollout:
    agent:
      llm_client_class: verl.workers.rollout.llm_server.AbortableLLMServerClient
# Inside a custom AgentLoopWorker/Manager holding the client:
# Explicit per-request abort:
await llm_client.abort(request_id)

# Cancellation (e.g. timeout) is handled automatically — the _on_cancel hook aborts the
# server-side request, so this does NOT leak server compute:
await asyncio.wait_for(llm_client.generate(request_id=request_id, ...), timeout=T)

Design & Code Changes

  • llm_server.py:
    • Hoist the inner per-turn vLLM request_id into a named inner_request_id so subclasses
      can record it.
    • Add no-op hooks _on_dispatch(request_id, inner_request_id, server) (before awaiting
      generation), _on_complete(request_id) (in finally), and _on_cancel(request_id)
      (in except asyncio.CancelledError, before the finally, so the in-flight entry still
      exists).
    • Add AbortableLLMServerClient(LLMServerClient): tracks request_id -> (inner_request_id, server) in _inflight; _on_complete is the sole cleanup point; abort(request_id)
      and _on_cancel share a _send_abort helper that sends a precise abort_request to the
      right server (abort awaits it; _on_cancel is fire-and-forget, since awaiting during
      cancellation is unsafe). Inherits the base (non-resuming) generate, i.e. hard-abort
      semantics.
  • workers/config/rollout.py: add llm_client_class field to AgentLoopConfig.
  • trainer/ppo/ray_trainer.py: load llm_client_class (if set) and pass it as client_cls
    to get_client().
  • agent_loop.py: fix typo chunkes -> chunks.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks. (All hooks pass; the local check-naming-conventions false
    positive only triggers on vendored files under .venv, which is gitignored and absent in CI.)
  • Add / Update the documentation.
  • Add unit or end-to-end test(s) to cover the code.
  • Send a message in the ci-request channel once ready for CI.

🤖 Generated with Claude Code

…MServerClient

Add selective per-request abort to the rollout client path: extension hooks on
LLMServerClient plus a concrete AbortableLLMServerClient, selectable via the new
actor_rollout_ref.rollout.agent.llm_client_class config. Also fix a typo
(chunkes -> chunks) in agent_loop.py.

Co-authored-by: Claude
Signed-off-by: cr-gao <gaochenrui@sjtu.edu.cn>
@CLAassistant

CLAassistant commented Jun 27, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces AbortableLLMServerClient to support per-request aborts and in-flight request tracking, adds corresponding CPU unit tests, and allows configuring a custom LLMServerClient class. It also fixes a minor typo in the agent loop. The review feedback highlights a critical issue where client-side task cancellation (such as timeouts) would silently leak GPU resources on the vLLM server because the server-side request is not aborted; the reviewer suggests introducing and implementing an _on_cancel hook to safely abort the request upon cancellation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 250 to +259
finally:
self._on_complete(request_id)
self._release_server(server_id)

def _on_dispatch(self, request_id, inner_request_id, server):
"""Hook fired after the inner vLLM request_id is assigned, before awaiting generation.
Default no-op; subclasses may record (request_id/inner_request_id, server) to enable abort."""

def _on_complete(self, request_id):
"""Hook fired when generation finishes or raises. Default no-op."""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If the client-side generate task is cancelled (e.g., due to a timeout via asyncio.timeout or asyncio.wait_for), the finally block runs immediately and removes the request from _inflight via _on_complete. However, the server-side request is never aborted, leading to silent GPU resource leaks on the vLLM server.

To fix this, we should catch asyncio.CancelledError in generate, trigger a new _on_cancel hook to abort the request on the server, and then re-raise the cancellation.

Suggested change
finally:
self._on_complete(request_id)
self._release_server(server_id)
def _on_dispatch(self, request_id, inner_request_id, server):
"""Hook fired after the inner vLLM request_id is assigned, before awaiting generation.
Default no-op; subclasses may record (request_id/inner_request_id, server) to enable abort."""
def _on_complete(self, request_id):
"""Hook fired when generation finishes or raises. Default no-op."""
except asyncio.CancelledError:
self._on_cancel(request_id)
raise
finally:
self._on_complete(request_id)
self._release_server(server_id)
def _on_dispatch(self, request_id, inner_request_id, server):
"""Hook fired after the inner vLLM request_id is assigned, before awaiting generation.
Default no-op; subclasses may record (request_id/inner_request_id, server) to enable abort."""
def _on_complete(self, request_id):
"""Hook fired when generation finishes or raises. Default no-op."""
def _on_cancel(self, request_id):
"""Hook fired when generation is cancelled. Default no-op."""

@cr-gao cr-gao Jun 27, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. asyncio.wait_for/timeout cancellation indeed leaked the server-side request, and that's exactly one of the motivating use cases. Added an _on_cancel(request_id) hook fired on CancelledError (before the finally so the in-flight entry still exists), implemented as a fire-and-forget abort in AbortableLLMServerClient (we can't safely await during cancellation), with abort() and _on_cancel sharing a _send_abort helper. Also added a regression test for the cancellation path. Done in aeadf2f.

Comment on lines +386 to +388
def _on_complete(self, request_id):
# Sole owner of removal: fires from generate()'s finally on success, abort, or error.
self._inflight.pop(request_id, None)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Implement the _on_cancel hook in AbortableLLMServerClient to automatically abort the request on the server when the client-side task is cancelled. This prevents GPU resource leaks on the vLLM server during timeouts or cancellations.

    def _on_complete(self, request_id):
        # Sole owner of removal: fires from generate()'s finally on success, abort, or error.
        self._inflight.pop(request_id, None)

    def _on_cancel(self, request_id):
        entry = self._inflight.get(request_id)
        if entry is not None:
            inner_request_id, server = entry
            server.abort_request.remote(inner_request_id)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented _on_cancel in AbortableLLMServerClient as fire-and-forget (_send_abort, no await during cancellation); removal stays owned by _on_complete. Done in aeadf2f.

Add an _on_cancel hook fired on asyncio.CancelledError so AbortableLLMServerClient
fire-and-forget aborts the server-side request when generate() is cancelled (e.g.
asyncio.wait_for/timeout), preventing silent GPU resource leaks. Add a regression
test for the cancellation path.

Addresses gemini-code-assist review feedback on verl-project#6865.

Co-authored-by: Claude
Signed-off-by: cr-gao <gaochenrui@sjtu.edu.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature request] Per-request abort for rollout LLMServerClient

2 participants