sync(slime #2014..#2125): 3-way merge [WIP — 45 conflict files need review]#286
sync(slime #2014..#2125): 3-way merge [WIP — 45 conflict files need review]#286aoshen02 wants to merge 30 commits into
Conversation
Documentation build overview
40 files changed ·
|
There was a problem hiding this comment.
Code Review
This pull request migrates the codebase from the vime framework (using vLLM) to the slime framework (using SGLang as the default rollout backend), updating all documentation, Docker configurations, example scripts, and tests accordingly. It also introduces advanced features such as delta weight sync, external rollout engines, and new model recipes (e.g., GLM-5.2 744B-A40B). The review feedback highlights several critical bugs and improvement opportunities across the codebase, including a falsy check in the placement group layout calculation when rollout_num_gpus is zero, a JSON parsing issue in the retool example caused by unsafe newline replacement, a Dockerfile anti-pattern of split apt update and apt install commands, a logic error in the retool reward calculation that could penalize correct tool usage, a missing abort status handler in the multi-agent system, a potential TypeError when accessing non-dictionary labels in the search-r1 example, and potential data loss and status finalization issues in the multi-turn VLM rollout.
| pytest.param({"colocate": True, "rollout_num_gpus": 8}, (16, 0), id="colocate_rollout_less_than_actor"), | ||
| pytest.param({"colocate": True, "rollout_num_gpus": 16}, (16, 0), id="colocate_rollout_equals_actor"), | ||
| pytest.param({"colocate": True, "rollout_num_gpus": 32}, (32, 0), id="colocate_rollout_more_than_actor"), | ||
| pytest.param({"rollout_num_gpus": 0}, (16, 16), id="zero_rollout_gpus"), |
There was a problem hiding this comment.
When rollout_num_gpus is explicitly set to 0, the placement group should not allocate any GPUs for rollout (i.e., the expected layout should be (16, 0)). The current expectation of (16, 16) suggests a bug in _get_placement_group_layout where args.rollout_num_gpus is checked using a falsy check (e.g., args.rollout_num_gpus or default) instead of an explicit is None check, causing 0 to incorrectly fall back to the default value.
| pytest.param({"rollout_num_gpus": 0}, (16, 16), id="zero_rollout_gpus"), | |
| pytest.param({"rollout_num_gpus": 0}, (16, 0), id="zero_rollout_gpus"), |
| # Replace newlines in string values with \n | ||
| json_str = json_str.replace("\n", "\\n") | ||
| tool_call_data = json.loads(json_str) |
There was a problem hiding this comment.
Replacing all newlines with \\n via json_str.replace("\n", "\\n") is highly problematic because it replaces formatting newlines outside of string literals (e.g., between keys and values in pretty-printed JSON), which makes the JSON invalid and causes json.loads to fail. To safely allow unescaped newlines and other control characters inside string values, use json.loads(json_str, strict=False) instead.
| # Replace newlines in string values with \n | |
| json_str = json_str.replace("\n", "\\n") | |
| tool_call_data = json.loads(json_str) | |
| tool_call_data = json.loads(json_str, strict=False) |
| RUN apt update | ||
| RUN apt install -y nvtop rsync dnsutils prometheus |
There was a problem hiding this comment.
Running apt update and apt install in separate RUN instructions is an anti-pattern in Dockerfiles because of how layer caching works. If the apt update layer is cached, a subsequent build may not fetch the latest package lists, potentially causing the apt install step to fail or install outdated packages. They should be combined into a single RUN instruction, and the package lists should be cleaned up afterwards to keep the image size small.
RUN apt-get update && apt-get install -y \
nvtop \
rsync \
dnsutils \
prometheus \
&& rm -rf /var/lib/apt/lists/*
| if result["score"] < 0: | ||
| tool_call_reward = (num_turns - 2) / 2 * 0.1 | ||
| result["score"] = min(-0.6, result["score"] + tool_call_reward) |
There was a problem hiding this comment.
There is a logic error in the reward calculation when tool_call_reward is positive but the original result["score"] is already better than -0.6 (e.g., -0.5). In this case, min(-0.6, result["score"] + tool_call_reward) will actually reduce the score to -0.6, penalizing the model for calling tools. The reward should only be capped at -0.6 if it is an improvement, or we should ensure we don't decrease a score that is already better than -0.6.
| if result["score"] < 0: | |
| tool_call_reward = (num_turns - 2) / 2 * 0.1 | |
| result["score"] = min(-0.6, result["score"] + tool_call_reward) | |
| if result["score"] < 0: | |
| tool_call_reward = (num_turns - 2) / 2 * 0.1 | |
| if tool_call_reward > 0: | |
| result["score"] = min(-0.6, result["score"] + tool_call_reward) | |
| else: | |
| result["score"] = result["score"] + tool_call_reward |
| # case "abort": | ||
| # sample.status = Sample.Status.ABORTED |
There was a problem hiding this comment.
The case "abort" is currently commented out. If the inference engine finishes with an "abort" reason, the sample's status will not be updated to ABORTED, which could cause the training pipeline to process incomplete or empty trajectories. This case should be uncommented.
| # case "abort": | |
| # sample.status = Sample.Status.ABORTED | |
| case "abort": | |
| sample.status = Sample.Status.ABORTED |
|
|
||
| score = compute_score_em( | ||
| solution_str=sample.prompt + sample.response, | ||
| ground_truth=sample.label["ground_truth"], |
There was a problem hiding this comment.
Accessing sample.label["ground_truth"] directly assumes that sample.label is always a dictionary. However, in many datasets (and as seen in other rollout files), sample.label can be a raw string. To prevent a TypeError at runtime, use a defensive check to handle both dictionary and string labels.
| ground_truth=sample.label["ground_truth"], | |
| ground_truth=sample.label.get("ground_truth", sample.label) if isinstance(sample.label, dict) else sample.label, |
| if ( | ||
| key in sample.multimodal_inputs | ||
| and isinstance(sample.multimodal_inputs[key], list) | ||
| and isinstance(val, list) | ||
| ): | ||
| sample.multimodal_inputs[key].extend(val) |
There was a problem hiding this comment.
In _update_multimodal_state, if a key in obs_multimodal_inputs is not already present in sample.multimodal_inputs, it is silently ignored. This can lead to losing new multimodal input keys (such as "videos" or other custom keys) introduced in subsequent turns. We should initialize the key in sample.multimodal_inputs if it is missing.
| if ( | |
| key in sample.multimodal_inputs | |
| and isinstance(sample.multimodal_inputs[key], list) | |
| and isinstance(val, list) | |
| ): | |
| sample.multimodal_inputs[key].extend(val) | |
| if ( | |
| key in sample.multimodal_inputs | |
| and isinstance(sample.multimodal_inputs[key], list) | |
| and isinstance(val, list) | |
| ): | |
| sample.multimodal_inputs[key].extend(val) | |
| elif key not in sample.multimodal_inputs: | |
| sample.multimodal_inputs[key] = list(val) if isinstance(val, list) else val |
| sample.multimodal_train_inputs = _merge_multimodal_train_inputs(multimodal_train_inputs_buffer) | ||
| sample.response = tokenizer.decode(response_tokens, skip_special_tokens=False) | ||
| sample.response_length = len(response_tokens) | ||
| if sample.status is None: |
There was a problem hiding this comment.
Checking if sample.status is None: to finalize the sample status to COMPLETED might miss cases where sample.status is initialized to Sample.Status.PENDING (which is the default for a new Sample). To be safe and ensure the sample is correctly marked as completed, check for both None and PENDING.
| if sample.status is None: | |
| if sample.status in (None, Sample.Status.PENDING): |
88b7ee5 to
e596925
Compare
e596925 to
cad2658
Compare
Rebased onto upstream/main 7198547 (now includes #260 scripts port, #280 tau-bench agent_strategy, #283 docker image rename, #257 data fix). Effect vs prior base: the 12 script files #260 ported moved from new-to-vime to in-scope; 11 auto-merged clean, only scripts/run-deepseek-r1.sh newly conflicts (theirs adds /sgl-workspace dead path -> resolve by dropping). 3-way merge results on new base: 52 clean / 46 conflict (110 blocks) / 50 new-to-vime / 5 del Resolves PR #286 conflict-with-main. Conflict markers preserved for commit 2. SOP: knowledge/rl/slime-to-vime-sync-sop.md
f06efc2 to
e544a01
Compare
Mechanical commit 1 of 2 (per knowledge/rl/slime-to-vime-sync-sop.md §2). diff3 translated 3-way merge on upstream/main (incl #260/#280/#283/#257): ours = vime@main, base = translate(slime@#2013), theirs = translate(slime@#2125) Translation fixes vs prior attempt: - casing: SGLang->vLLM (prose) / SGLang<X>->VLLM<X> (identifiers); killed VLlm artifact (was 35 files) - dotted module refs slime.X->vime.X now translated (was leaking 'from slime.backends') These resolved 9 spurious conflicts (46->37 files). Results: 61 clean / 37 conflict (diff3 markers preserved) / 50 new-to-vime / 5 del. Conflict markers use readable -L labels (ours/base/theirs). Resolve in commit 2. Non-conflict provenance fix: vimerl/vime -> vllm/vime in 2 example docs. Engine patch handling (docker/patch/) deferred to commit 2 per SOP §4.5. Signed-off-by: aoshen02 <aoshen@inferact.ai>
Resolved all 37 conflict files / 84 diff3 blocks per agent_run RESOLUTION_POLICY. Principle: keep vime vLLM impl (ours) + incorporate slime's new features (theirs). Highlights: - vLLM API form kept everywhere: /inference/v1/generate, choices parsing, AsyncEngineArgs, vLLM flag names (--vllm-gpu-memory-utilization etc). - Dropped all vllm.srt.* imports (non-existent in real vLLM). - Accepted new slime features: delta-weight-sync CLI args, append_response_tokens (Sample), get_server_info/start_external_rollout_servers/get_rollout_num_engines, old-router(<=0.2.1) compat, TrajectoryManager adapter design (vime already adopted it). - Kept vime-only: --rollout-external, add_router_arguments, _get_metrics_router_addr, reinit_wandb_primary_with_open_metrics, update_tracking_open_metrics, modal sandbox, VIME_AGENT_* env names, local-vLLM tau-bench user sim. - Engine patches (docker/patch/): kept ours vllm.patch (22-line MoE fix), dropped theirs sglang 2674-line content; deleted sglang-only vllm-top_p.patch (per SOP 4.5). - Dockerfile kept ours (vllm/vllm-openai base); version.txt accepted theirs nightly. - run-deepseek-r1.sh: dropped /sgl-workspace dead-path env. Deviations from policy (documented): vllm_rollout.py abort path kept ours pause/drain (abort_servers_until_idle would break partial-rollout drain + leave paused_workers unbound). README ecosystem section left empty (ours) pending de-translation of provenance. All changed .py py_compile clean; zero conflict markers; no sglang/slime leakage. Signed-off-by: aoshen02 <aoshen@inferact.ai>
- re-add dropped `from vllm_router.launch_router import RouterArgs` import in vime/utils/arguments.py (add_router_arguments uses it; F821 from conflict resolution) - drop unused base_top_p_token_ids/offsets in vllm_streaming_rollout.py (F841; came from theirs but ours's choices-parsing path doesn't use them) - black/isort autoformat (anthropic.py, test_agent/*, arguments.py) - pipeline.yml: agent tests moved to tests/test_agent/; wire new CPU tests pre-commit: all hooks pass (ruff/autoflake/isort/black/yaml). Signed-off-by: aoshen02 <aoshen@inferact.ai>
…s tests) Local CPU CI (pre-commit + plugin + agent + utils) all green on 8xH200 host in python:3.11 containers. Fixes found while running it: - vllm_engine.py: make 'import vllm_router'/'packaging.parse' lazy (inside _register_to_router); top-level import broke CPU import (vllm_router absent in CPU CI). The old-router(<=0.2.1) compat branch is theirs-accepted. - docker/Dockerfile: TMS_CUDA_MAJOR=12 for torch_memory_saver pin (its build backend now requires it for CUDA wheels; base is cu129). Unblocks image build. - agent/adapters/common.py: _run_turn called parse_model_output() without the required tokenizer= kwarg -> 500s in adapter tests. Pass tokenizer=tok. - tests/test_agent/_fakes.py: FakeVLLMServer served sglang /generate + meta_info; retarget to vime /inference/v1/generate + choices shape + x-session-id header. - tests/test_agent/test_adapters.py: parse_model_output(tokenizer=...) + assert vime body keys (token_ids/max_tokens). - tests/utils/test_vllm_config.py: vLLMConfig->VllmConfig (4 sites); fake router returns 3-tuple (ip,port,prom) matching _start_router; drop spurious resolve(). - tests/test_megatron_argument_validation.py: add num_gpus_per_node=8 to the vime_validate_args fixture (vime colocate override needs it). - .buildkite/pipeline.yml: agent tests -> tests/test_agent/*; +cispo_loss, +logprob_response_spans (CPU-safe); test_rollout_metrics stays GPU-only (imports vllm). Engine patch verdict (PATCH_ASSESSMENT.md): P1-P7 vLLM doesn't need (NIXL/Mooncake native); kept ours vllm.patch, dropped sglang content + top_p.patch. Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ate) Found running GPU CI on 8xH200 with vllm/vime:latest: - docker/Dockerfile: pin scipy<1.14 next to numpy<2 (scipy drifted to 1.18 which needs numpy>=2 and uses removed np.long -> 'import vllm' crash). - vllm_utils/arguments.py: drop sync-added RouterArgs.add_cli_args + its import in add_vllm_router_arguments. main exposes the full router surface only in utils.add_router_arguments; the duplicate re-registered --router-request-timeout-secs -> argparse conflict at train startup. - megatron_utils/loss.py: get_rollout_top_p_logprob_kwargs falls back to full-vocab logprob when top-p nucleus token ids are absent instead of raising. slime's top-p-replay needs engine-returned top-p tokens; vime's vLLM /inference/v1/generate does not expose them (sglang-only). Matches vime pre-sync behavior; flagged in OVERNIGHT_REPORT for review. Image import smoke + Megatron ckpt load + 4x VLLMEngine bringup + NCCL weight transfer all confirmed working in-image before this. Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ort suite slime deleted tests/test_qwen2.5_0.5B_ppo_critic_only_short.py this window (#2014..#2125); gpu_suites.py still listed it -> 'no such file' exit 2. Critic-only path is still covered by test_qwen3_4B_ppo_train_critic_only (megatron suite). Other 3 short tests (gsm8k_async, gsm8k, fully_async) pass on 8xH200. Signed-off-by: aoshen02 <aoshen@inferact.ai>
…t crash) GPU test_qwen3_4B_streaming_partial_rollout hit 'TypeError: unsupported operand +: NoneType and list' at vllm_streaming_rollout.py:234. The conflict resolution changed base_log_probs from main's `list(sample.rollout_log_probs or [])` to a None-able form; a fresh sample (rollout_log_probs=None) then did None + call_log_probs. Restored main's form. Real sync-resolution regression caught by GPU CI. Signed-off-by: aoshen02 <aoshen@inferact.ai>
… update_from_meta_info) slime #2110 renamed Sample.update_from_meta_info -> append_response_tokens; vllm_rollout was updated but vllm_streaming_rollout still called the old name (AttributeError at generate_streaming). Streaming already accumulates tokens incrementally for partial-rollout, so call append_response_tokens(meta_info=meta) with tokens omitted -> metadata-only finalize (no double-append). Caught by GPU test_qwen3_4B_streaming_partial_rollout. Signed-off-by: aoshen02 <aoshen@inferact.ai>
…rive 3-way compare (slime / vime-main / PR) showed the merge made a broken hybrid: it KEPT vime's num_gpus_per_node colocate override (which assumes rollout_num_gpus is forced to actor_num_gpus_per_node*actor_num_nodes) but REPLACED vime's unconditional re-derive (`!= -> re-derive`) with slime's `is None`-only form. When a colocate test's rollout_num_gpus is non-None but mismatches, it was left mis-sized -> engine/GPU misplacement -> mixed_offload IPC-UUID mismatch + ckpt 'Free memory < util'. slime passes (no override, self-consistent is-None); vime main passes (override + unconditional re-derive, coupled). Restore vime's re-derive; keep slime's new rollout_num_gpus==0 branch (checked first). Signed-off-by: aoshen02 <aoshen@inferact.ai>
…s, comments) Docs (EN+zh parity): - fault-tolerance: revert junk 'server'->'engine' mistranslation (keep correct /health endpoint, verified vs vllm_engine.py) - vllm-config: 'ServerArgs'->'EngineArgs' (sglang class -> vLLM AsyncEngineArgs u FrontendArgs); fix duplicated 'vllm-router (vllm-router)' alias - customization: restore over-deleted 'custom_generate -> list[Sample]' section + signature (dropped only the vime-absent search-r1 example link) Rollout: - routed_experts now flows through the slime-identical Sample._apply_meta_info (single assignment site, torch.int32 tensor matching downstream) instead of an inline numpy assign; vLLM .npy-on-choice decode stays (engine wire-format delta). Both vllm_rollout and vllm_streaming_rollout. Comments for future syncers: - --opd-teacher-model + on_policy_distillation: engine-driven divergence (vLLM model field; sglang /generate has none) - overrides / _vllm_server_field_names: AsyncEngineArgs u FrontendArgs == slime's sglang ServerArgs Examples/docker/etc: drop vime-absent npu/retool/search-r1/tau-bench files; restore eval_multi_task; docker alignment with slime. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
dfe6d47 to
f6cfc31
Compare
… + isort/black on test_agent Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
…-derive; parse_model_output tokenizer) - test_..._preserves_larger_rollout_gpus_under_colocate asserted slime behavior (==12); vime re-derives to actor*nodes=8 under colocate (commit 9701304). Renamed + assert ==8 + divergence note. vime-main never had the test; slime does. - test_parse_model_output_plain_text_no_parsers called parse_model_output without the required tokenizer kwarg (#198 made it required for vLLM parsers). Pass tokenizer=None (unused on the no-parser path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ang /generate The slime diff3 merge took slime's sglang endpoint (/generate) for the streaming rollout POST instead of keeping vime's vLLM endpoint (/inference/v1/generate). main (7198547) had the correct URL; the sync regressed it (and dropped the base var). Result: 404 Not Found at vllm_streaming_rollout.py:182 -> test_qwen3_4B_streaming_partial_rollout fails. Caught on a clean h200 node. The file's own docstrings already say /inference/v1/generate throughout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
…ines The diff3 merge adopted slime's deferred rollout-engine init (start_rollout_servers now returns pending_init_handles awaited by the caller instead of ray.get-ing each model's engines before the next). That broke an implicit invariant the per-model port_cursors reset relied on: in main, model 0's engines were fully bound before model 1 allocated ports, so the free-port bind-test in _allocate_rollout_engine_addr_and_ports_normal skipped model 0's ports. With deferred init, model 1 allocates while model 0 is unbound, the bind-test sees the base ports free, and a second model (e.g. mixed_offload's frozen "ref") lands on the same 15000-15003 as the actor. The actor's POST /update_weights to :15002 then hits the never-started ref engine -> vLLM 500 "start_weight_update must be called before update_weights" (test_vllm_config_mixed_offload[_ft]). Fix: initialize port_cursors once before the model loop so the per-node next-free cursor is monotonic across all models, keeping every engine's ports disjoint regardless of bind timing. Single-model behaviour is unchanged; the per-model reset only existed to scope cursors that are already node-keyed. Caught on h200 GPU CI (new nightly-dev-20260618a image, which added the start_weight_update-before-update_weights enforcement that exposed the collision). Signed-off-by: aoshen02 <aoshen@inferact.ai>
…y actors PR #260 ("complete slime-exact port") changed execute_train's pre-launch cleanup from pkill -9 -f '[v]llm serve|VLL[M]::' to pkill -9 vllm as an over-literal sglang->vllm translation. But the vLLM rollout engine runs as Ray actor processes whose process *name* is python/ray, with "VLLM::" only in the command line — so `pkill -9 vllm` (name match, no -f) does not kill them. Leftover engine processes from the ckpt test's save phase survive into the load phase, holding ~115 GiB, so the load-phase engine starts with ~24/139 GiB free and dies with "Free memory ... less than desired GPU memory utilization (0.8, 111.84 GiB)" (test_qwen3_4B_ckpt.py, both --async-save and not). Restore the cmdline-match pattern `-f '[v]llm serve|VLL[M]::'`. Bisected on h200: ckpt PASSES at 289ee6d / e62d44f (old pattern, 4/4 runs) and FAILS at 7198547/main + PR (new pattern, 0/3), same old image -> code regression in #260. Verified: pkill-fixed PR code + new pr286 image -> ckpt PASS (579s). Signed-off-by: aoshen02 <aoshen@inferact.ai>
Same root cause as 764e1e1 (command_utils.py): `pkill -9 vllm` matches by process *name*, but vLLM rollout engines run as Ray actor processes (python/ray named, "VLLM::" only in the command line), so the name match never kills them. Apply the robust cmdline pattern `pkill -9 -f '[v]llm serve|VLL[M]::'` everywhere the bare `pkill -9 vllm` cleanup was used across run/example scripts, so leftover engines don't squat GPUs across runs. No logic change beyond the kill pattern. Signed-off-by: aoshen02 <aoshen@inferact.ai>
3f60d76 to
29f2490
Compare
Commit d41f0aa removed the core.py sleep-guard hunk from docker/patch/latest/vllm.patch on the assumption it was "already in v0.23.0". It is not: stock v0.23.0 `vllm/v1/engine/core.py` calls `resume_scheduler()` and `execute_dummy_batch()` even during a partial (weights-only) wake. So a colocate DP+EP pd_mooncake rollout, right after `POST /wake_up?tags=weights` (KV cache still released under level-2 sleep), has its DP busy-loop fire a decode-shaped dummy batch that touches freed KV -> the scheduler_metadata write in flashattn_mla.py:234 (MLA, glm4.7) and flash_attn.py:547 (FA3, qwen3.6) raises `CUDA error: invalid argument`. Restore the guard (`if not self.model_executor.is_sleeping` around resume_scheduler; `if not self.is_sleeping()` around execute_dummy_batch), keeping the all2all_utils weight-reload fix. This is the #173 sleep-guard patch re-expressed against v0.23.0 line numbers. Verified: git-apply --check clean against stock v0.23.0; both guards land; glm4.7/qwen3.6 pd_mooncake reproduced the crash without it (the 8-day-old image that still carried the guard passes both). Signed-off-by: aoshen02 <aoshen@inferact.ai>
The scipy<1.14 pin (added in 0cda11c while chasing the pd_mooncake crash) was a red herring: the real cause was the dropped core.py partial-wake sleep-guard, now restored. slime-2125-as-vime pins only `numpy<2`; this restores that exact line. numpy 1.26.4 + scipy resolved naturally matches the working baseline. Signed-off-by: aoshen02 <aoshen@inferact.ai>
slime installs FlashQLA unconditionally, but it is sm90/Hopper-only. Restore vime's original gated form (default off; --qwen-gdn-backend fla elsewhere). CI build passes --build-arg INSTALL_FLASHQLA=1 to include it. Signed-off-by: aoshen02 <aoshen@inferact.ai>
…process The mechanical mirror of slime's answer steered users to vLLM's standalone `vllm serve` (slime's `launch_server` analog) for inference-only. That is misleading for vime: like slime, vime launches the vLLM engines in-process from `--vllm-config` (same in-process path as training), so a rollout-only run serves directly with no separate server process. Point standalone users to `--rollout-external-engine-addrs` instead. EN + ZH. Signed-off-by: aoshen02 <aoshen@inferact.ai>
vime's debug.md was missing slime's "INT4 / Compressed-Tensors Quantization Checkpoint Issues" section (slime #1642) — dropped in an earlier sync, not present on main. Restore it (EN + ZH), translated sglang→vLLM / Megatron→vLLM. Covers the quantization_config.ignore list, all-zero MoE router weights (mlp.gate.weight) when mis-quantized, missing safetensors shards, and diagnosis via --check-weight-update-equal / --debug-rollout-only. Signed-off-by: aoshen02 <aoshen@inferact.ai>
Keep slime's wording; the only engine-coupled fix is the standalone launcher name. slime's `launch_server` is its in-process engine entry; vime's analog is `_run_vllm_server` (vllm_engine.py, launched via multiprocessing.Process), not the standalone `vllm serve` CLI. EN + ZH. Signed-off-by: aoshen02 <aoshen@inferact.ai>
Reverts the scipy-pin removal in b359b24, which was wrong: vime's vllm/vllm-openai base ships no scipy, so unpinned the build pulls scipy>=1.18, which hard-requires numpy>=2 and uses np.long (removed numpy>=1.24) -> crashes against the numpy<2 reinstall (Megatron needs numpy 1.x). slime's sglang base resolves scipy 1.17.1 natively (numpy-1.x compatible), so slime needs no pin; this is a base-image divergence, not a red herring. Pin boundary is 1.18 (slime runs 1.17.1), not the earlier 1.14 over-estimate. Signed-off-by: aoshen02 <aoshen@inferact.ai>
…355B-delta These two were prefix-swapped (--sglang-X -> --vllm-X) without semantic mapping, leaving ~20 args that aren't vLLM AsyncEngineArgs (argparse would reject). Apply the knowledge/rl/sglang-to-vllm-translation.md §5.5 mappings: - dp-size->data-parallel-size, ep-size->enable-expert-parallel, max-running-requests-> max-num-seqs, cuda-graph-max-bs->max-cudagraph-capture-size - 5x/4x --speculative-* -> one --vllm-speculative-config JSON (§5.2) - DeepEP: per-group deepep_mode auto/low_latency -> all2all_backend deepep_high_throughput/ low_latency in the --vllm-config overrides (vLLM has no 'auto'; PD encodes it per-role) - watchdog-timeout -> env VLLM_ENGINE_ITERATION_TIMEOUT_S - drop sglang-only: dp-attention / dp-lm-head / moe-dense-tp / disable-overlap-schedule / NSA backends (vLLM selects DeepSeek sparse attn per model) / engine delta-receiver knobs - flag PD mooncake transport (-> --vllm-kv-transfer-config) as fabric-specific TODO These are 744B/355B scripts not runnable in CI — translations are SOP-mapped but hardware-unvalidated (flagged inline). Signed-off-by: aoshen02 <aoshen@inferact.ai>
--update-weight-mode=delta (PR #278 lineage) is not yet validated on vime+vLLM (vLLM exposes dense/sparse_flat only, not slime's gap-delta/ zstd encoding). Raise NotImplementedError at arg-validation so it fails fast at startup instead of crashing mid weight-sync. Downstream delta code is kept untouched; remove this raise once a real delta-load run passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
The agent rollout engine is reached at vime's vLLM ``/inference/v1/generate``
(see vllm_rollout.get_model_url default + common.call_vllm_generate), not the
bare ``/generate`` of sglang. Fix the imprecise path in adapter/test docstrings
and comments, and rewrite the vllm-config.md custom-rollout examples (en+zh):
they were still sglang-shaped (``/generate`` path + ``{"text":..., "return_logprob":
True}`` body). Use vime's real request schema instead -- ``{"model","token_ids",
"sampling_params"}`` with ``max_tokens``/``logprobs``, ``prompt_logprobs`` for
fixed-sequence scoring, and the ``choices[0]`` response shape.
No code/logic change: comments, docstrings, and doc examples only. The Megatron
training server's own ``/generate`` endpoint and the sglang citation in
arguments.py are correct and left untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
The delta-guard NotImplementedError message was split across two adjacent string literals; black on the CI (line-length 119) collapses/normalizes it. Make it a single clean literal so pre-commit is green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
The hard delta guard (7bb19e6) raises NotImplementedError at the top of the delta branch, making the downstream colocate / unknown-transport rejections unreachable. Replace test_update_weight_delta_rejects_colocate and test_update_weight_delta_rejects_unknown_transport (whose ValueError paths no longer fire) with a single test_update_weight_delta_disabled that asserts the guard raises for any delta config. Breadcrumb left to restore the per-condition tests when delta is verified and the guard is lifted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
… (slime parity) A mirror cross-check of the delta-weight-sync surface (slime #1806/#1991) found the #2014..#2125 sync had silently dropped several slime-faithful pieces: * extra_metrics logging chain — slime threads weight-update metrics from the actor through log_perf_data -> log_perf_data_raw. vime dropped the param at all three layers, so weight-update metrics were never logged. Restored: train_metric_utils.log_perf_data_raw(extra_metrics=...), data.log_perf_data passthrough, and actor passing self.weight_updater.pop_metrics(). * pop_metrics on UpdateWeightFromDistributed — the default (non-colocate, nccl) weight_updater. slime gives all three updaters a pop_metrics() stub so the actor can call it uniformly; vime kept it on tensor/disk but dropped it on distributed, which would AttributeError once the actor calls it. Restored the ~5-line stub (delta-specific plumbing stays dropped — vime+vLLM has no DeltaSpec). * actor delta-mode dispatch branch — restores the elif selecting UpdateWeightFromDistributedDelta. Dead code behind the validation guard that rejects --update-weight-mode=delta, so vime mirrors slime with the guard as the single divergence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>
Comment-only pass; no behavior change. Aligns vime comments to what a faithful
slime->vime translation would carry:
* Strip vime-divergence rationale markers (delta guard, opd-teacher-model,
top-p fallback, colocate/delta tests). The rationale belongs in the
divergence manifest, not inline; the guarded code + self-explanatory
NotImplementedError messages stand on their own.
* De-verbose vLLM-specific comments to mirror scale: the router-args block,
the AsyncEngineArgs u FrontendArgs docstring, and the MoE-replay /
streaming-rollout blocks that slime does not carry at that length.
* Restore slime-original comments the sync had dropped or naively translated,
with judgment translation of sglang-specific terms: session_id routing
("vLLM router", not the mechanical "Model Gateway"), "Prepare payload for
vLLM server", the unique-session_id loop, and the pending-tasks wait.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>

slime → vime sync: window #2014..#2125
Syncs all 48 PRs merged in slime after cutoff
44d29ee5(slime #2013) up to slime#2125.Follows SOP:
vllm-agent-infra/knowledge/rl/slime-to-vime-sync-sop.mdScope: full mirror, 153 changed files. Window extended #2118→#2125 to fold in the
coding-agent harness follow-ups (#2124/#2125) to #2005 + non-float reward fix (#2121).
Structure: one PR, two commits
Commit 1 (current): 3-way merge, conflict markers preserved
ours= vime@main ·base= translate(slime@cutoff #2013) ·theirs= translate(slime@tip #2125)<<<<<<<conflict markers · 62 new-to-vime · 5 deletionsCommit 2 (pending): resolve all conflict blocks — the only commit that needs review.
Status: DRAFT — waiting on conflict resolution
Conflict markers are intentionally preserved as the review queue. The agent maintains
agent_run/reports/QUESTIONS.md(43 questions) + an interactive decision tracker.Decisions needed (3 genuine + 6 confirm):
🤖 Generated with Claude Code