Skip to content

fix(rollout): abort vLLM rollout via delete-type /abort_requests#296

Open
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:feat/vllm-delete-abort
Open

fix(rollout): abort vLLM rollout via delete-type /abort_requests#296
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:feat/vllm-delete-abort

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

What

Replace the pause/resume-based rollout abort with a delete-type abort, so that under --partial-rollout the long tail is truncated to partial (and resumable next step) instead of either deadlocking or running to completion.

Why — and why this is not a duplicate of #294

abort() previously did pause?mode=abort -> drain -> resume. Under --partial-rollout, a /generate that races in after the pause parks in vLLM's PAUSED_NEW waiting queue and never returns until /resume, which ran after the drain → deadlock.

#294 reordered to pause -> resume -> drain, which fixes the hang. But resume reopens the whole waiting queue, so the raced-in long-tail requests run to completion and return full samples — defeating partial rollout's "truncate the tail, continue it next step" purpose.

This PR supersedes #294 with a delete-type abort that truncates the tail correctly and has no pause/resume (so no deadlock by construction). It keeps #294's partial-sample-collection test and replaces its pause/resume-ordering test. Credit to @Josephasafg (PR #294 author), retained as the primary commit author.

How

  • vLLM (docker/patch/latest/vllm.patch): add POST /abort_requests to the RLHF control router → EngineClient.abort(), which removes queued requests from the waiting queue and finish-aborts running ones (their partial output returns on the original /generate stream). It does not pause the scheduler, so there is no /resume and no deadlock window. Reuses the engine primitive already used on client-disconnect and mirrors the existing /abort_requests in the disagg router.
  • vime: vllm_utils/server_control.abort_inflight_requests() (the vLLM counterpart of slime's sglang_utils/server_control); abort() re-issues the sweep across drain waves and converges on state.pendings (vLLM has no /v1/loads to poll), with a small timeout bounding how long a late multi-turn straggler can run before being truncated.

Tests

  • tests/test_vllm_rollout.py:
  • Verified locally: python -m py_compile on all changed files; git apply --check of the appended vllm.patch hunk against a pristine vLLM tree (applies cleanly).
  • Full unit suite + image build run in CI (no GPU/build env available locally).

AI assistance

This change was developed with AI assistance (Claude) and reviewed by a human before submission.

Under --partial-rollout, abort() (pause -> drain -> resume) could deadlock:
/pause?mode=abort puts the scheduler in PAUSED_NEW, and a /generate that races
in after the pause parks in the waiting queue and never returns until /resume --
which ran after the drain. PR vllm-project#294 reordered to pause -> resume -> drain to avoid
the hang, but resume reopens the whole queue so the long tail runs to
COMPLETION, which breaks partial rollout's "truncate the tail, resume it next
step" semantics.

Switch to a delete-type abort instead:

- vLLM: add a POST /abort_requests control endpoint to the RLHF api_router that
  calls EngineClient.abort() (removes queued requests from the waiting queue and
  finish-aborts running ones, whose partial output returns on the original
  /generate stream). It does not pause the scheduler, so there is no /resume and
  no deadlock window. Shipped as a build-time patch in
  docker/patch/latest/vllm.patch.

- vime: add vllm_utils/server_control.abort_inflight_requests() (the vLLM
  counterpart of slime's sglang_utils/server_control) and call it from abort().
  The sweep is re-issued on each drain wave (state.pendings is the idle signal;
  a timeout bounds how long a late multi-turn straggler can run) so the tail is
  truncated to partial rather than completed.

Supersedes vllm-project#294: keeps its partial-sample collection test and replaces the
pause/resume ordering test with one asserting the delete-type behavior.

AI-assisted change; reviewed by a human before submission.

Co-authored-by: aoshen <aoshen@inferact.ai>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
@read-the-docs-community

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the abort mechanism for the vLLM rollout backend to drop in-flight requests directly via a new /abort_requests endpoint on the workers, eliminating the need to pause and resume the scheduler. The changes include adding the endpoint to the vLLM patch, implementing the control-plane helper, updating the rollout logic to periodically re-sweep during draining, and adding corresponding unit tests. Feedback on the changes highlights critical robustness issues in the vLLM patch's endpoint, specifically regarding JSON type safety when parsing request bodies, potential string iteration bugs if a single string is passed as request_ids, and import safety when catching JSON decoding errors.

Comment on lines +40 to +47
+ try:
+ body = await raw_request.json()
+ except json.JSONDecodeError:
+ body = {}
+
+ request_ids = body.get("request_ids")
+ if not request_ids:
+ request_ids = list(engine.output_processor.request_states.keys())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are a few robustness and correctness issues in this section of the patch:

  1. JSON Type Safety: raw_request.json() can return any valid JSON type (such as a list, string, number, or boolean). If a client sends a non-dictionary JSON payload, calling body.get(...) will raise an AttributeError: 'list' object has no attribute 'get'. We should check isinstance(body, dict) to prevent this.
  2. String Iteration Bug: If request_ids is passed as a single string (e.g., {"request_ids": "req_1"}), iterating over it or passing it directly to engine.abort() might treat it as an iterable of characters (['r', 'e', 'q', '_', '1']), attempting to abort non-existent single-character request IDs. Wrapping a single string in a list prevents this.
  3. Import Safety: Catching ValueError instead of json.JSONDecodeError is safer because json.JSONDecodeError inherits from ValueError, and this avoids any potential NameError if json is not imported in api_router.py.
+    try:
+        body = await raw_request.json()
+        if not isinstance(body, dict):
+            body = {}
+    except ValueError:
+        body = {}
+
+    request_ids = body.get("request_ids")
+    if isinstance(request_ids, str):
+        request_ids = [request_ids]
+    elif not request_ids:
+        request_ids = list(engine.output_processor.request_states.keys())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants