fix(rollout): abort vLLM rollout via delete-type /abort_requests by aoshen02 · Pull Request #296 · vllm-project/vime

aoshen02 · 2026-06-28T15:05:52Z

What

Replace the pause/resume-based rollout abort with a delete-type abort, so that under --partial-rollout the long tail is truncated to partial (and resumable next step) instead of either deadlocking or running to completion.

Why — and why this is not a duplicate of #294

abort() previously did pause?mode=abort -> drain -> resume. Under --partial-rollout, a /generate that races in after the pause parks in vLLM's PAUSED_NEW waiting queue and never returns until /resume, which ran after the drain → deadlock.

#294 reordered to pause -> resume -> drain, which fixes the hang. But resume reopens the whole waiting queue, so the raced-in long-tail requests run to completion and return full samples — defeating partial rollout's "truncate the tail, continue it next step" purpose.

This PR supersedes #294 with a delete-type abort that truncates the tail correctly and has no pause/resume (so no deadlock by construction). It keeps #294's partial-sample-collection test and replaces its pause/resume-ordering test. Credit to @Josephasafg (PR #294 author), retained as the primary commit author.

How

vLLM (docker/patch/latest/vllm.patch): add POST /abort_requests to the RLHF control router → EngineClient.abort(), which removes queued requests from the waiting queue and finish-aborts running ones (their partial output returns on the original /generate stream). It does not pause the scheduler, so there is no /resume and no deadlock window. Reuses the engine primitive already used on client-disconnect and mirrors the existing /abort_requests in the disagg router.
vime: vllm_utils/server_control.abort_inflight_requests() (the vLLM counterpart of slime's sglang_utils/server_control); abort() re-issues the sweep across drain waves and converges on state.pendings (vLLM has no /v1/loads to poll), with a small timeout bounding how long a late multi-turn straggler can run before being truncated.

Tests

tests/test_vllm_rollout.py:
- test_abort_deletes_inflight_without_pause_resume — asserts only /abort_requests is posted (never /pause or /resume) and state.pendings is drained.
- test_abort_collects_partial_samples_when_partial_rollout — kept from fix(rollout): Resume vLLM workers before draining in abort #294.
Verified locally: python -m py_compile on all changed files; git apply --check of the appended vllm.patch hunk against a pristine vLLM tree (applies cleanly).
Full unit suite + image build run in CI (no GPU/build env available locally).

AI assistance

This change was developed with AI assistance (Claude) and reviewed by a human before submission.

Under --partial-rollout, abort() (pause -> drain -> resume) could deadlock: /pause?mode=abort puts the scheduler in PAUSED_NEW, and a /generate that races in after the pause parks in the waiting queue and never returns until /resume -- which ran after the drain. PR vllm-project#294 reordered to pause -> resume -> drain to avoid the hang, but resume reopens the whole queue so the long tail runs to COMPLETION, which breaks partial rollout's "truncate the tail, resume it next step" semantics. Switch to a delete-type abort instead: - vLLM: add a POST /abort_requests control endpoint to the RLHF api_router that calls EngineClient.abort() (removes queued requests from the waiting queue and finish-aborts running ones, whose partial output returns on the original /generate stream). It does not pause the scheduler, so there is no /resume and no deadlock window. Shipped as a build-time patch in docker/patch/latest/vllm.patch. - vime: add vllm_utils/server_control.abort_inflight_requests() (the vLLM counterpart of slime's sglang_utils/server_control) and call it from abort(). The sweep is re-issued on each drain wave (state.pendings is the idle signal; a timeout bounds how long a late multi-turn straggler can run) so the tail is truncated to partial rather than completed. Supersedes vllm-project#294: keeps its partial-sample collection test and replaces the pause/resume ordering test with one asserting the delete-type behavior. AI-assisted change; reviewed by a human before submission. Co-authored-by: aoshen <aoshen@inferact.ai> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>

read-the-docs-community · 2026-06-28T15:06:53Z

Documentation build overview

📚 vime | 🛠️ Build #33344145 | 📁 Comparing 817a336 against latest (429b3d2)

🔍 Preview build

34 files changed · ± 34 modified

± Modified

gemini-code-assist

Code Review

This pull request refactors the abort mechanism for the vLLM rollout backend to drop in-flight requests directly via a new /abort_requests endpoint on the workers, eliminating the need to pause and resume the scheduler. The changes include adding the endpoint to the vLLM patch, implementing the control-plane helper, updating the rollout logic to periodically re-sweep during draining, and adding corresponding unit tests. Feedback on the changes highlights critical robustness issues in the vLLM patch's endpoint, specifically regarding JSON type safety when parsing request bodies, potential string iteration bugs if a single string is passed as request_ids, and import safety when catching JSON decoding errors.

gemini-code-assist · 2026-06-28T15:07:19Z

+    try:
+        body = await raw_request.json()
+    except json.JSONDecodeError:
+        body = {}
+
+    request_ids = body.get("request_ids")
+    if not request_ids:
+        request_ids = list(engine.output_processor.request_states.keys())


There are a few robustness and correctness issues in this section of the patch:

JSON Type Safety: raw_request.json() can return any valid JSON type (such as a list, string, number, or boolean). If a client sends a non-dictionary JSON payload, calling body.get(...) will raise an AttributeError: 'list' object has no attribute 'get'. We should check isinstance(body, dict) to prevent this.

String Iteration Bug: If request_ids is passed as a single string (e.g., {"request_ids": "req_1"}), iterating over it or passing it directly to engine.abort() might treat it as an iterable of characters (['r', 'e', 'q', '_', '1']), attempting to abort non-existent single-character request IDs. Wrapping a single string in a list prevents this.

Import Safety: Catching ValueError instead of json.JSONDecodeError is safer because json.JSONDecodeError inherits from ValueError, and this avoids any potential NameError if json is not imported in api_router.py.

+ try: + body = await raw_request.json() + if not isinstance(body, dict): + body = {} + except ValueError: + body = {} + + request_ids = body.get("request_ids") + if isinstance(request_ids, str): + request_ids = [request_ids] + elif not request_ids: + request_ids = list(engine.output_processor.request_states.keys())

gemini-code-assist Bot reviewed Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(rollout): abort vLLM rollout via delete-type /abort_requests#296

fix(rollout): abort vLLM rollout via delete-type /abort_requests#296
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:feat/vllm-delete-abort

aoshen02 commented Jun 28, 2026

Uh oh!

read-the-docs-community Bot commented Jun 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aoshen02 commented Jun 28, 2026

What

Why — and why this is not a duplicate of #294

How

Tests

AI assistance

Uh oh!

read-the-docs-community Bot commented Jun 28, 2026

Documentation build overview

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants