Skip to content

sync(slime #2014..#2125): 3-way merge [WIP — 45 conflict files need review]#286

Open
aoshen02 wants to merge 30 commits into
mainfrom
sync/slime-2118
Open

sync(slime #2014..#2125): 3-way merge [WIP — 45 conflict files need review]#286
aoshen02 wants to merge 30 commits into
mainfrom
sync/slime-2118

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

slime → vime sync: window #2014..#2125

Syncs all 48 PRs merged in slime after cutoff 44d29ee5 (slime #2013) up to slime #2125.
Follows SOP: vllm-agent-infra/knowledge/rl/slime-to-vime-sync-sop.md

Scope: full mirror, 153 changed files. Window extended #2118→#2125 to fold in the
coding-agent harness follow-ups (#2124/#2125) to #2005 + non-float reward fix (#2121).


Structure: one PR, two commits

Commit 1 (current): 3-way merge, conflict markers preserved

  • ours = vime@main · base = translate(slime@cutoff #2013) · theirs = translate(slime@tip #2125)
  • 41 clean merges · 45 files with <<<<<<< conflict markers · 62 new-to-vime · 5 deletions

Commit 2 (pending): resolve all conflict blocks — the only commit that needs review.


Status: DRAFT — waiting on conflict resolution

Conflict markers are intentionally preserved as the review queue. The agent maintains
agent_run/reports/QUESTIONS.md (43 questions) + an interactive decision tracker.

Decisions needed (3 genuine + 6 confirm):

Tier Items
Decide README new chapters (Q26) · coding_agent_rl env-var + harness (Q37) · tau-bench user sim (Q43)
Confirm Dockerfile base (keep ours) · vllm.patch (keep ours 22L) · top_p.patch (delete) · GHA workflows (delete) · review skill (delete) · version.txt (accept)
Settled agent adapters = slime #2005 TrajectoryManager refactor (accept + fix vLLM endpoints) · vllm.srt.constants not in vLLM (delete imports) · flag/API diffs keep ours

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request migrates the codebase from the vime framework (using vLLM) to the slime framework (using SGLang as the default rollout backend), updating all documentation, Docker configurations, example scripts, and tests accordingly. It also introduces advanced features such as delta weight sync, external rollout engines, and new model recipes (e.g., GLM-5.2 744B-A40B). The review feedback highlights several critical bugs and improvement opportunities across the codebase, including a falsy check in the placement group layout calculation when rollout_num_gpus is zero, a JSON parsing issue in the retool example caused by unsafe newline replacement, a Dockerfile anti-pattern of split apt update and apt install commands, a logic error in the retool reward calculation that could penalize correct tool usage, a missing abort status handler in the multi-agent system, a potential TypeError when accessing non-dictionary labels in the search-r1 example, and potential data loss and status finalization issues in the multi-turn VLM rollout.

pytest.param({"colocate": True, "rollout_num_gpus": 8}, (16, 0), id="colocate_rollout_less_than_actor"),
pytest.param({"colocate": True, "rollout_num_gpus": 16}, (16, 0), id="colocate_rollout_equals_actor"),
pytest.param({"colocate": True, "rollout_num_gpus": 32}, (32, 0), id="colocate_rollout_more_than_actor"),
pytest.param({"rollout_num_gpus": 0}, (16, 16), id="zero_rollout_gpus"),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When rollout_num_gpus is explicitly set to 0, the placement group should not allocate any GPUs for rollout (i.e., the expected layout should be (16, 0)). The current expectation of (16, 16) suggests a bug in _get_placement_group_layout where args.rollout_num_gpus is checked using a falsy check (e.g., args.rollout_num_gpus or default) instead of an explicit is None check, causing 0 to incorrectly fall back to the default value.

Suggested change
pytest.param({"rollout_num_gpus": 0}, (16, 16), id="zero_rollout_gpus"),
pytest.param({"rollout_num_gpus": 0}, (16, 0), id="zero_rollout_gpus"),

Comment thread examples/retool/generate_with_retool.py Outdated
Comment on lines +117 to +119
# Replace newlines in string values with \n
json_str = json_str.replace("\n", "\\n")
tool_call_data = json.loads(json_str)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Replacing all newlines with \\n via json_str.replace("\n", "\\n") is highly problematic because it replaces formatting newlines outside of string literals (e.g., between keys and values in pretty-printed JSON), which makes the JSON invalid and causes json.loads to fail. To safely allow unescaped newlines and other control characters inside string values, use json.loads(json_str, strict=False) instead.

Suggested change
# Replace newlines in string values with \n
json_str = json_str.replace("\n", "\\n")
tool_call_data = json.loads(json_str)
tool_call_data = json.loads(json_str, strict=False)

Comment thread docker/Dockerfile Outdated
Comment on lines +18 to +19
RUN apt update
RUN apt install -y nvtop rsync dnsutils prometheus

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Running apt update and apt install in separate RUN instructions is an anti-pattern in Dockerfiles because of how layer caching works. If the apt update layer is cached, a subsequent build may not fetch the latest package lists, potentially causing the apt install step to fail or install outdated packages. They should be combined into a single RUN instruction, and the package lists should be cleaned up afterwards to keep the image size small.

RUN apt-get update && apt-get install -y \
    nvtop \
    rsync \
    dnsutils \
    prometheus \
 && rm -rf /var/lib/apt/lists/*

Comment thread examples/retool/generate_with_retool.py Outdated
Comment on lines +425 to +427
if result["score"] < 0:
tool_call_reward = (num_turns - 2) / 2 * 0.1
result["score"] = min(-0.6, result["score"] + tool_call_reward)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a logic error in the reward calculation when tool_call_reward is positive but the original result["score"] is already better than -0.6 (e.g., -0.5). In this case, min(-0.6, result["score"] + tool_call_reward) will actually reduce the score to -0.6, penalizing the model for calling tools. The reward should only be capped at -0.6 if it is an improvement, or we should ensure we don't decrease a score that is already better than -0.6.

Suggested change
if result["score"] < 0:
tool_call_reward = (num_turns - 2) / 2 * 0.1
result["score"] = min(-0.6, result["score"] + tool_call_reward)
if result["score"] < 0:
tool_call_reward = (num_turns - 2) / 2 * 0.1
if tool_call_reward > 0:
result["score"] = min(-0.6, result["score"] + tool_call_reward)
else:
result["score"] = result["score"] + tool_call_reward

Comment thread examples/multi_agent/agent_system.py Outdated
Comment on lines +65 to +66
# case "abort":
# sample.status = Sample.Status.ABORTED

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The case "abort" is currently commented out. If the inference engine finishes with an "abort" reason, the sample's status will not be updated to ABORTED, which could cause the training pipeline to process incomplete or empty trajectories. This case should be uncommented.

Suggested change
# case "abort":
# sample.status = Sample.Status.ABORTED
case "abort":
sample.status = Sample.Status.ABORTED


score = compute_score_em(
solution_str=sample.prompt + sample.response,
ground_truth=sample.label["ground_truth"],

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing sample.label["ground_truth"] directly assumes that sample.label is always a dictionary. However, in many datasets (and as seen in other rollout files), sample.label can be a raw string. To prevent a TypeError at runtime, use a defensive check to handle both dictionary and string labels.

Suggested change
ground_truth=sample.label["ground_truth"],
ground_truth=sample.label.get("ground_truth", sample.label) if isinstance(sample.label, dict) else sample.label,

Comment on lines +274 to +279
if (
key in sample.multimodal_inputs
and isinstance(sample.multimodal_inputs[key], list)
and isinstance(val, list)
):
sample.multimodal_inputs[key].extend(val)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In _update_multimodal_state, if a key in obs_multimodal_inputs is not already present in sample.multimodal_inputs, it is silently ignored. This can lead to losing new multimodal input keys (such as "videos" or other custom keys) introduced in subsequent turns. We should initialize the key in sample.multimodal_inputs if it is missing.

Suggested change
if (
key in sample.multimodal_inputs
and isinstance(sample.multimodal_inputs[key], list)
and isinstance(val, list)
):
sample.multimodal_inputs[key].extend(val)
if (
key in sample.multimodal_inputs
and isinstance(sample.multimodal_inputs[key], list)
and isinstance(val, list)
):
sample.multimodal_inputs[key].extend(val)
elif key not in sample.multimodal_inputs:
sample.multimodal_inputs[key] = list(val) if isinstance(val, list) else val

sample.multimodal_train_inputs = _merge_multimodal_train_inputs(multimodal_train_inputs_buffer)
sample.response = tokenizer.decode(response_tokens, skip_special_tokens=False)
sample.response_length = len(response_tokens)
if sample.status is None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking if sample.status is None: to finalize the sample status to COMPLETED might miss cases where sample.status is initialized to Sample.Status.PENDING (which is the default for a new Sample). To be safe and ensure the sample is correctly marked as completed, check for both None and PENDING.

Suggested change
if sample.status is None:
if sample.status in (None, Sample.Status.PENDING):

@aoshen02 aoshen02 changed the title sync(slime #2014..#2118): full-mirror sync [WIP — mechanical phase, translation pending] sync(slime #2014..#2125): 3-way merge [WIP — 45 conflict files need review] Jun 25, 2026
aoshen02 added a commit that referenced this pull request Jun 25, 2026
Rebased onto upstream/main 7198547 (now includes #260 scripts port, #280
tau-bench agent_strategy, #283 docker image rename, #257 data fix).

Effect vs prior base: the 12 script files #260 ported moved from new-to-vime
to in-scope; 11 auto-merged clean, only scripts/run-deepseek-r1.sh newly
conflicts (theirs adds /sgl-workspace dead path -> resolve by dropping).

3-way merge results on new base:
  52 clean / 46 conflict (110 blocks) / 50 new-to-vime / 5 del

Resolves PR #286 conflict-with-main. Conflict markers preserved for commit 2.
SOP: knowledge/rl/slime-to-vime-sync-sop.md
@aoshen02 aoshen02 force-pushed the sync/slime-2118 branch 2 times, most recently from f06efc2 to e544a01 Compare June 25, 2026 15:35
aoshen02 and others added 10 commits June 26, 2026 09:12
Mechanical commit 1 of 2 (per knowledge/rl/slime-to-vime-sync-sop.md §2).
diff3 translated 3-way merge on upstream/main (incl #260/#280/#283/#257):
  ours = vime@main, base = translate(slime@#2013), theirs = translate(slime@#2125)

Translation fixes vs prior attempt:
  - casing: SGLang->vLLM (prose) / SGLang<X>->VLLM<X> (identifiers); killed VLlm artifact (was 35 files)
  - dotted module refs slime.X->vime.X now translated (was leaking 'from slime.backends')
These resolved 9 spurious conflicts (46->37 files).

Results: 61 clean / 37 conflict (diff3 markers preserved) / 50 new-to-vime / 5 del.
Conflict markers use readable -L labels (ours/base/theirs). Resolve in commit 2.

Non-conflict provenance fix: vimerl/vime -> vllm/vime in 2 example docs.
Engine patch handling (docker/patch/) deferred to commit 2 per SOP §4.5.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Resolved all 37 conflict files / 84 diff3 blocks per agent_run RESOLUTION_POLICY.
Principle: keep vime vLLM impl (ours) + incorporate slime's new features (theirs).

Highlights:
- vLLM API form kept everywhere: /inference/v1/generate, choices parsing,
  AsyncEngineArgs, vLLM flag names (--vllm-gpu-memory-utilization etc).
- Dropped all vllm.srt.* imports (non-existent in real vLLM).
- Accepted new slime features: delta-weight-sync CLI args, append_response_tokens
  (Sample), get_server_info/start_external_rollout_servers/get_rollout_num_engines,
  old-router(<=0.2.1) compat, TrajectoryManager adapter design (vime already adopted it).
- Kept vime-only: --rollout-external, add_router_arguments, _get_metrics_router_addr,
  reinit_wandb_primary_with_open_metrics, update_tracking_open_metrics, modal sandbox,
  VIME_AGENT_* env names, local-vLLM tau-bench user sim.
- Engine patches (docker/patch/): kept ours vllm.patch (22-line MoE fix), dropped
  theirs sglang 2674-line content; deleted sglang-only vllm-top_p.patch (per SOP 4.5).
- Dockerfile kept ours (vllm/vllm-openai base); version.txt accepted theirs nightly.
- run-deepseek-r1.sh: dropped /sgl-workspace dead-path env.

Deviations from policy (documented): vllm_rollout.py abort path kept ours
pause/drain (abort_servers_until_idle would break partial-rollout drain + leave
paused_workers unbound). README ecosystem section left empty (ours) pending
de-translation of provenance.

All changed .py py_compile clean; zero conflict markers; no sglang/slime leakage.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
- re-add dropped `from vllm_router.launch_router import RouterArgs` import in
  vime/utils/arguments.py (add_router_arguments uses it; F821 from conflict resolution)
- drop unused base_top_p_token_ids/offsets in vllm_streaming_rollout.py (F841; came
  from theirs but ours's choices-parsing path doesn't use them)
- black/isort autoformat (anthropic.py, test_agent/*, arguments.py)
- pipeline.yml: agent tests moved to tests/test_agent/; wire new CPU tests

pre-commit: all hooks pass (ruff/autoflake/isort/black/yaml).
Signed-off-by: aoshen02 <aoshen@inferact.ai>
…s tests)

Local CPU CI (pre-commit + plugin + agent + utils) all green on 8xH200 host
in python:3.11 containers. Fixes found while running it:

- vllm_engine.py: make 'import vllm_router'/'packaging.parse' lazy (inside
  _register_to_router); top-level import broke CPU import (vllm_router absent in
  CPU CI). The old-router(<=0.2.1) compat branch is theirs-accepted.
- docker/Dockerfile: TMS_CUDA_MAJOR=12 for torch_memory_saver pin (its build
  backend now requires it for CUDA wheels; base is cu129). Unblocks image build.
- agent/adapters/common.py: _run_turn called parse_model_output() without the
  required tokenizer= kwarg -> 500s in adapter tests. Pass tokenizer=tok.
- tests/test_agent/_fakes.py: FakeVLLMServer served sglang /generate + meta_info;
  retarget to vime /inference/v1/generate + choices shape + x-session-id header.
- tests/test_agent/test_adapters.py: parse_model_output(tokenizer=...) + assert
  vime body keys (token_ids/max_tokens).
- tests/utils/test_vllm_config.py: vLLMConfig->VllmConfig (4 sites); fake router
  returns 3-tuple (ip,port,prom) matching _start_router; drop spurious resolve().
- tests/test_megatron_argument_validation.py: add num_gpus_per_node=8 to the
  vime_validate_args fixture (vime colocate override needs it).
- .buildkite/pipeline.yml: agent tests -> tests/test_agent/*; +cispo_loss,
  +logprob_response_spans (CPU-safe); test_rollout_metrics stays GPU-only (imports vllm).

Engine patch verdict (PATCH_ASSESSMENT.md): P1-P7 vLLM doesn't need (NIXL/Mooncake
native); kept ours vllm.patch, dropped sglang content + top_p.patch.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ate)

Found running GPU CI on 8xH200 with vllm/vime:latest:
- docker/Dockerfile: pin scipy<1.14 next to numpy<2 (scipy drifted to 1.18 which
  needs numpy>=2 and uses removed np.long -> 'import vllm' crash).
- vllm_utils/arguments.py: drop sync-added RouterArgs.add_cli_args + its import in
  add_vllm_router_arguments. main exposes the full router surface only in
  utils.add_router_arguments; the duplicate re-registered --router-request-timeout-secs
  -> argparse conflict at train startup.
- megatron_utils/loss.py: get_rollout_top_p_logprob_kwargs falls back to full-vocab
  logprob when top-p nucleus token ids are absent instead of raising. slime's
  top-p-replay needs engine-returned top-p tokens; vime's vLLM /inference/v1/generate
  does not expose them (sglang-only). Matches vime pre-sync behavior; flagged in
  OVERNIGHT_REPORT for review.

Image import smoke + Megatron ckpt load + 4x VLLMEngine bringup + NCCL weight
transfer all confirmed working in-image before this.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ort suite

slime deleted tests/test_qwen2.5_0.5B_ppo_critic_only_short.py this window (#2014..#2125);
gpu_suites.py still listed it -> 'no such file' exit 2. Critic-only path is still
covered by test_qwen3_4B_ppo_train_critic_only (megatron suite). Other 3 short tests
(gsm8k_async, gsm8k, fully_async) pass on 8xH200.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…t crash)

GPU test_qwen3_4B_streaming_partial_rollout hit
'TypeError: unsupported operand +: NoneType and list' at vllm_streaming_rollout.py:234.
The conflict resolution changed base_log_probs from main's
`list(sample.rollout_log_probs or [])` to a None-able form; a fresh sample
(rollout_log_probs=None) then did None + call_log_probs. Restored main's form.
Real sync-resolution regression caught by GPU CI.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
… update_from_meta_info)

slime #2110 renamed Sample.update_from_meta_info -> append_response_tokens; vllm_rollout
was updated but vllm_streaming_rollout still called the old name (AttributeError at
generate_streaming). Streaming already accumulates tokens incrementally for partial-rollout,
so call append_response_tokens(meta_info=meta) with tokens omitted -> metadata-only finalize
(no double-append). Caught by GPU test_qwen3_4B_streaming_partial_rollout.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…rive

3-way compare (slime / vime-main / PR) showed the merge made a broken hybrid:
it KEPT vime's num_gpus_per_node colocate override (which assumes rollout_num_gpus
is forced to actor_num_gpus_per_node*actor_num_nodes) but REPLACED vime's
unconditional re-derive (`!= -> re-derive`) with slime's `is None`-only form.
When a colocate test's rollout_num_gpus is non-None but mismatches, it was left
mis-sized -> engine/GPU misplacement -> mixed_offload IPC-UUID mismatch + ckpt
'Free memory < util'. slime passes (no override, self-consistent is-None); vime
main passes (override + unconditional re-derive, coupled). Restore vime's
re-derive; keep slime's new rollout_num_gpus==0 branch (checked first).

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…s, comments)

Docs (EN+zh parity):
- fault-tolerance: revert junk 'server'->'engine' mistranslation (keep correct /health endpoint, verified vs vllm_engine.py)
- vllm-config: 'ServerArgs'->'EngineArgs' (sglang class -> vLLM AsyncEngineArgs u FrontendArgs); fix duplicated 'vllm-router (vllm-router)' alias
- customization: restore over-deleted 'custom_generate -> list[Sample]' section + signature (dropped only the vime-absent search-r1 example link)

Rollout:
- routed_experts now flows through the slime-identical Sample._apply_meta_info (single assignment site, torch.int32 tensor matching downstream) instead of an inline numpy assign; vLLM .npy-on-choice decode stays (engine wire-format delta). Both vllm_rollout and vllm_streaming_rollout.

Comments for future syncers:
- --opd-teacher-model + on_policy_distillation: engine-driven divergence (vLLM model field; sglang /generate has none)
- overrides / _vllm_server_field_names: AsyncEngineArgs u FrontendArgs == slime's sglang ServerArgs

Examples/docker/etc: drop vime-absent npu/retool/search-r1/tau-bench files; restore eval_multi_task; docker alignment with slime.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
aoshen02 and others added 2 commits June 26, 2026 09:26
… + isort/black on test_agent

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
…-derive; parse_model_output tokenizer)

- test_..._preserves_larger_rollout_gpus_under_colocate asserted slime behavior (==12); vime re-derives to actor*nodes=8 under colocate (commit 9701304). Renamed + assert ==8 + divergence note. vime-main never had the test; slime does.

- test_parse_model_output_plain_text_no_parsers called parse_model_output without the required tokenizer kwarg (#198 made it required for vLLM parsers). Pass tokenizer=None (unused on the no-parser path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
@aoshen02 aoshen02 marked this pull request as ready for review June 26, 2026 12:32
aoshen02 and others added 4 commits June 26, 2026 12:52
…ang /generate

The slime diff3 merge took slime's sglang endpoint (/generate) for the streaming rollout POST instead of keeping vime's vLLM endpoint (/inference/v1/generate). main (7198547) had the correct URL; the sync regressed it (and dropped the base var). Result: 404 Not Found at vllm_streaming_rollout.py:182 -> test_qwen3_4B_streaming_partial_rollout fails. Caught on a clean h200 node. The file's own docstrings already say /inference/v1/generate throughout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ines

The diff3 merge adopted slime's deferred rollout-engine init (start_rollout_servers
now returns pending_init_handles awaited by the caller instead of ray.get-ing each
model's engines before the next). That broke an implicit invariant the per-model
port_cursors reset relied on: in main, model 0's engines were fully bound before
model 1 allocated ports, so the free-port bind-test in
_allocate_rollout_engine_addr_and_ports_normal skipped model 0's ports. With
deferred init, model 1 allocates while model 0 is unbound, the bind-test sees the
base ports free, and a second model (e.g. mixed_offload's frozen "ref") lands on
the same 15000-15003 as the actor. The actor's POST /update_weights to :15002 then
hits the never-started ref engine -> vLLM 500 "start_weight_update must be called
before update_weights" (test_vllm_config_mixed_offload[_ft]).

Fix: initialize port_cursors once before the model loop so the per-node next-free
cursor is monotonic across all models, keeping every engine's ports disjoint
regardless of bind timing. Single-model behaviour is unchanged; the per-model
reset only existed to scope cursors that are already node-keyed.

Caught on h200 GPU CI (new nightly-dev-20260618a image, which added the
start_weight_update-before-update_weights enforcement that exposed the collision).

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…y actors

PR #260 ("complete slime-exact port") changed execute_train's pre-launch
cleanup from

    pkill -9 -f '[v]llm serve|VLL[M]::'

to

    pkill -9 vllm

as an over-literal sglang->vllm translation. But the vLLM rollout engine runs
as Ray actor processes whose process *name* is python/ray, with "VLLM::" only in
the command line — so `pkill -9 vllm` (name match, no -f) does not kill them.
Leftover engine processes from the ckpt test's save phase survive into the load
phase, holding ~115 GiB, so the load-phase engine starts with ~24/139 GiB free
and dies with "Free memory ... less than desired GPU memory utilization (0.8,
111.84 GiB)" (test_qwen3_4B_ckpt.py, both --async-save and not).

Restore the cmdline-match pattern `-f '[v]llm serve|VLL[M]::'`.

Bisected on h200: ckpt PASSES at 289ee6d / e62d44f (old pattern, 4/4 runs) and
FAILS at 7198547/main + PR (new pattern, 0/3), same old image -> code regression
in #260. Verified: pkill-fixed PR code + new pr286 image -> ckpt PASS (579s).

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Same root cause as 764e1e1 (command_utils.py): `pkill -9 vllm` matches by
process *name*, but vLLM rollout engines run as Ray actor processes (python/ray
named, "VLLM::" only in the command line), so the name match never kills them.
Apply the robust cmdline pattern `pkill -9 -f '[v]llm serve|VLL[M]::'` everywhere
the bare `pkill -9 vllm` cleanup was used across run/example scripts, so leftover
engines don't squat GPUs across runs. No logic change beyond the kill pattern.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
aoshen02 added 2 commits June 27, 2026 01:40
Commit d41f0aa removed the core.py sleep-guard hunk from
docker/patch/latest/vllm.patch on the assumption it was "already in
v0.23.0". It is not: stock v0.23.0 `vllm/v1/engine/core.py` calls
`resume_scheduler()` and `execute_dummy_batch()` even during a partial
(weights-only) wake. So a colocate DP+EP pd_mooncake rollout, right after
`POST /wake_up?tags=weights` (KV cache still released under level-2 sleep),
has its DP busy-loop fire a decode-shaped dummy batch that touches freed
KV -> the scheduler_metadata write in flashattn_mla.py:234 (MLA, glm4.7)
and flash_attn.py:547 (FA3, qwen3.6) raises `CUDA error: invalid argument`.

Restore the guard (`if not self.model_executor.is_sleeping` around
resume_scheduler; `if not self.is_sleeping()` around execute_dummy_batch),
keeping the all2all_utils weight-reload fix. This is the #173 sleep-guard
patch re-expressed against v0.23.0 line numbers.

Verified: git-apply --check clean against stock v0.23.0; both guards land;
glm4.7/qwen3.6 pd_mooncake reproduced the crash without it (the 8-day-old
image that still carried the guard passes both).

Signed-off-by: aoshen02 <aoshen@inferact.ai>
The scipy<1.14 pin (added in 0cda11c while chasing the pd_mooncake crash)
was a red herring: the real cause was the dropped core.py partial-wake
sleep-guard, now restored. slime-2125-as-vime pins only `numpy<2`; this
restores that exact line. numpy 1.26.4 + scipy resolved naturally matches
the working baseline.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
aoshen02 and others added 10 commits June 27, 2026 01:49
slime installs FlashQLA unconditionally, but it is sm90/Hopper-only.
Restore vime's original gated form (default off; --qwen-gdn-backend fla
elsewhere). CI build passes --build-arg INSTALL_FLASHQLA=1 to include it.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…process

The mechanical mirror of slime's answer steered users to vLLM's standalone
`vllm serve` (slime's `launch_server` analog) for inference-only. That is
misleading for vime: like slime, vime launches the vLLM engines in-process
from `--vllm-config` (same in-process path as training), so a rollout-only
run serves directly with no separate server process. Point standalone users
to `--rollout-external-engine-addrs` instead. EN + ZH.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
vime's debug.md was missing slime's "INT4 / Compressed-Tensors Quantization
Checkpoint Issues" section (slime #1642) — dropped in an earlier sync, not
present on main. Restore it (EN + ZH), translated sglang→vLLM / Megatron→vLLM.
Covers the quantization_config.ignore list, all-zero MoE router weights
(mlp.gate.weight) when mis-quantized, missing safetensors shards, and
diagnosis via --check-weight-update-equal / --debug-rollout-only.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Keep slime's wording; the only engine-coupled fix is the standalone launcher
name. slime's `launch_server` is its in-process engine entry; vime's analog
is `_run_vllm_server` (vllm_engine.py, launched via multiprocessing.Process),
not the standalone `vllm serve` CLI. EN + ZH.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Reverts the scipy-pin removal in b359b24, which was wrong: vime's
vllm/vllm-openai base ships no scipy, so unpinned the build pulls scipy>=1.18,
which hard-requires numpy>=2 and uses np.long (removed numpy>=1.24) -> crashes
against the numpy<2 reinstall (Megatron needs numpy 1.x). slime's sglang base
resolves scipy 1.17.1 natively (numpy-1.x compatible), so slime needs no pin;
this is a base-image divergence, not a red herring. Pin boundary is 1.18
(slime runs 1.17.1), not the earlier 1.14 over-estimate.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
…355B-delta

These two were prefix-swapped (--sglang-X -> --vllm-X) without semantic mapping,
leaving ~20 args that aren't vLLM AsyncEngineArgs (argparse would reject). Apply the
knowledge/rl/sglang-to-vllm-translation.md §5.5 mappings:
- dp-size->data-parallel-size, ep-size->enable-expert-parallel, max-running-requests->
  max-num-seqs, cuda-graph-max-bs->max-cudagraph-capture-size
- 5x/4x --speculative-* -> one --vllm-speculative-config JSON (§5.2)
- DeepEP: per-group deepep_mode auto/low_latency -> all2all_backend deepep_high_throughput/
  low_latency in the --vllm-config overrides (vLLM has no 'auto'; PD encodes it per-role)
- watchdog-timeout -> env VLLM_ENGINE_ITERATION_TIMEOUT_S
- drop sglang-only: dp-attention / dp-lm-head / moe-dense-tp / disable-overlap-schedule / NSA
  backends (vLLM selects DeepSeek sparse attn per model) / engine delta-receiver knobs
- flag PD mooncake transport (-> --vllm-kv-transfer-config) as fabric-specific TODO

These are 744B/355B scripts not runnable in CI — translations are SOP-mapped but
hardware-unvalidated (flagged inline).

Signed-off-by: aoshen02 <aoshen@inferact.ai>
--update-weight-mode=delta (PR #278 lineage) is not yet validated on
vime+vLLM (vLLM exposes dense/sparse_flat only, not slime's gap-delta/
zstd encoding). Raise NotImplementedError at arg-validation so it fails
fast at startup instead of crashing mid weight-sync. Downstream delta
code is kept untouched; remove this raise once a real delta-load run
passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
The agent rollout engine is reached at vime's vLLM ``/inference/v1/generate``
(see vllm_rollout.get_model_url default + common.call_vllm_generate), not the
bare ``/generate`` of sglang. Fix the imprecise path in adapter/test docstrings
and comments, and rewrite the vllm-config.md custom-rollout examples (en+zh):
they were still sglang-shaped (``/generate`` path + ``{"text":..., "return_logprob":
True}`` body). Use vime's real request schema instead -- ``{"model","token_ids",
"sampling_params"}`` with ``max_tokens``/``logprobs``, ``prompt_logprobs`` for
fixed-sequence scoring, and the ``choices[0]`` response shape.

No code/logic change: comments, docstrings, and doc examples only. The Megatron
training server's own ``/generate`` endpoint and the sglang citation in
arguments.py are correct and left untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
The delta-guard NotImplementedError message was split across two adjacent
string literals; black on the CI (line-length 119) collapses/normalizes it.
Make it a single clean literal so pre-commit is green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
The hard delta guard (7bb19e6) raises NotImplementedError at the top of the
delta branch, making the downstream colocate / unknown-transport rejections
unreachable. Replace test_update_weight_delta_rejects_colocate and
test_update_weight_delta_rejects_unknown_transport (whose ValueError paths no
longer fire) with a single test_update_weight_delta_disabled that asserts the
guard raises for any delta config. Breadcrumb left to restore the per-condition
tests when delta is verified and the guard is lifted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
@aoshen02

Copy link
Copy Markdown
Collaborator Author
image

aoshen02 and others added 2 commits June 28, 2026 14:27
… (slime parity)

A mirror cross-check of the delta-weight-sync surface (slime #1806/#1991) found
the #2014..#2125 sync had silently dropped several slime-faithful pieces:

* extra_metrics logging chain — slime threads weight-update metrics from the
  actor through log_perf_data -> log_perf_data_raw. vime dropped the param at
  all three layers, so weight-update metrics were never logged. Restored:
  train_metric_utils.log_perf_data_raw(extra_metrics=...), data.log_perf_data
  passthrough, and actor passing self.weight_updater.pop_metrics().

* pop_metrics on UpdateWeightFromDistributed — the default (non-colocate, nccl)
  weight_updater. slime gives all three updaters a pop_metrics() stub so the
  actor can call it uniformly; vime kept it on tensor/disk but dropped it on
  distributed, which would AttributeError once the actor calls it. Restored the
  ~5-line stub (delta-specific plumbing stays dropped — vime+vLLM has no DeltaSpec).

* actor delta-mode dispatch branch — restores the elif selecting
  UpdateWeightFromDistributedDelta. Dead code behind the validation guard that
  rejects --update-weight-mode=delta, so vime mirrors slime with the guard as
  the single divergence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
Comment-only pass; no behavior change. Aligns vime comments to what a faithful
slime->vime translation would carry:

* Strip vime-divergence rationale markers (delta guard, opd-teacher-model,
  top-p fallback, colocate/delta tests). The rationale belongs in the
  divergence manifest, not inline; the guarded code + self-explanatory
  NotImplementedError messages stand on their own.
* De-verbose vLLM-specific comments to mirror scale: the router-args block,
  the AsyncEngineArgs u FrontendArgs docstring, and the MoE-replay /
  streaming-rollout blocks that slime does not carry at that length.
* Restore slime-original comments the sync had dropped or naively translated,
  with judgment translation of sglang-specific terms: session_id routing
  ("vLLM router", not the mechanical "Model Gateway"), "Prepare payload for
  vLLM server", the unique-session_id loop, and the pending-tasks wait.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant