fix(rollout): abort vLLM rollout via delete-type /abort_requests#296
fix(rollout): abort vLLM rollout via delete-type /abort_requests#296aoshen02 wants to merge 1 commit into
Conversation
Under --partial-rollout, abort() (pause -> drain -> resume) could deadlock: /pause?mode=abort puts the scheduler in PAUSED_NEW, and a /generate that races in after the pause parks in the waiting queue and never returns until /resume -- which ran after the drain. PR vllm-project#294 reordered to pause -> resume -> drain to avoid the hang, but resume reopens the whole queue so the long tail runs to COMPLETION, which breaks partial rollout's "truncate the tail, resume it next step" semantics. Switch to a delete-type abort instead: - vLLM: add a POST /abort_requests control endpoint to the RLHF api_router that calls EngineClient.abort() (removes queued requests from the waiting queue and finish-aborts running ones, whose partial output returns on the original /generate stream). It does not pause the scheduler, so there is no /resume and no deadlock window. Shipped as a build-time patch in docker/patch/latest/vllm.patch. - vime: add vllm_utils/server_control.abort_inflight_requests() (the vLLM counterpart of slime's sglang_utils/server_control) and call it from abort(). The sweep is re-issued on each drain wave (state.pendings is the idle signal; a timeout bounds how long a late multi-turn straggler can run) so the tail is truncated to partial rather than completed. Supersedes vllm-project#294: keeps its partial-sample collection test and replaces the pause/resume ordering test with one asserting the delete-type behavior. AI-assisted change; reviewed by a human before submission. Co-authored-by: aoshen <aoshen@inferact.ai> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
There was a problem hiding this comment.
Code Review
This pull request refactors the abort mechanism for the vLLM rollout backend to drop in-flight requests directly via a new /abort_requests endpoint on the workers, eliminating the need to pause and resume the scheduler. The changes include adding the endpoint to the vLLM patch, implementing the control-plane helper, updating the rollout logic to periodically re-sweep during draining, and adding corresponding unit tests. Feedback on the changes highlights critical robustness issues in the vLLM patch's endpoint, specifically regarding JSON type safety when parsing request bodies, potential string iteration bugs if a single string is passed as request_ids, and import safety when catching JSON decoding errors.
| + try: | ||
| + body = await raw_request.json() | ||
| + except json.JSONDecodeError: | ||
| + body = {} | ||
| + | ||
| + request_ids = body.get("request_ids") | ||
| + if not request_ids: | ||
| + request_ids = list(engine.output_processor.request_states.keys()) |
There was a problem hiding this comment.
There are a few robustness and correctness issues in this section of the patch:
- JSON Type Safety:
raw_request.json()can return any valid JSON type (such as a list, string, number, or boolean). If a client sends a non-dictionary JSON payload, callingbody.get(...)will raise anAttributeError: 'list' object has no attribute 'get'. We should checkisinstance(body, dict)to prevent this. - String Iteration Bug: If
request_idsis passed as a single string (e.g.,{"request_ids": "req_1"}), iterating over it or passing it directly toengine.abort()might treat it as an iterable of characters (['r', 'e', 'q', '_', '1']), attempting to abort non-existent single-character request IDs. Wrapping a single string in a list prevents this. - Import Safety: Catching
ValueErrorinstead ofjson.JSONDecodeErroris safer becausejson.JSONDecodeErrorinherits fromValueError, and this avoids any potentialNameErrorifjsonis not imported inapi_router.py.
+ try:
+ body = await raw_request.json()
+ if not isinstance(body, dict):
+ body = {}
+ except ValueError:
+ body = {}
+
+ request_ids = body.get("request_ids")
+ if isinstance(request_ids, str):
+ request_ids = [request_ids]
+ elif not request_ids:
+ request_ids = list(engine.output_processor.request_states.keys())
What
Replace the pause/resume-based rollout abort with a delete-type abort, so that under
--partial-rolloutthe long tail is truncated to partial (and resumable next step) instead of either deadlocking or running to completion.Why — and why this is not a duplicate of #294
abort()previously didpause?mode=abort -> drain -> resume. Under--partial-rollout, a/generatethat races in after the pause parks in vLLM'sPAUSED_NEWwaiting queue and never returns until/resume, which ran after the drain → deadlock.#294 reordered to
pause -> resume -> drain, which fixes the hang. Butresumereopens the whole waiting queue, so the raced-in long-tail requests run to completion and return full samples — defeating partial rollout's "truncate the tail, continue it next step" purpose.This PR supersedes #294 with a delete-type abort that truncates the tail correctly and has no pause/resume (so no deadlock by construction). It keeps #294's partial-sample-collection test and replaces its pause/resume-ordering test. Credit to @Josephasafg (PR #294 author), retained as the primary commit author.
How
docker/patch/latest/vllm.patch): addPOST /abort_requeststo the RLHF control router →EngineClient.abort(), which removes queued requests from the waiting queue and finish-aborts running ones (their partial output returns on the original/generatestream). It does not pause the scheduler, so there is no/resumeand no deadlock window. Reuses the engine primitive already used on client-disconnect and mirrors the existing/abort_requestsin the disagg router.vllm_utils/server_control.abort_inflight_requests()(the vLLM counterpart of slime'ssglang_utils/server_control);abort()re-issues the sweep across drain waves and converges onstate.pendings(vLLM has no/v1/loadsto poll), with a small timeout bounding how long a late multi-turn straggler can run before being truncated.Tests
tests/test_vllm_rollout.py:test_abort_deletes_inflight_without_pause_resume— asserts only/abort_requestsis posted (never/pauseor/resume) andstate.pendingsis drained.test_abort_collects_partial_samples_when_partial_rollout— kept from fix(rollout): Resume vLLM workers before draining in abort #294.python -m py_compileon all changed files;git apply --checkof the appendedvllm.patchhunk against a pristine vLLM tree (applies cleanly).AI assistance
This change was developed with AI assistance (Claude) and reviewed by a human before submission.