fix(delta-sync): surface failed engine apply results instead of silently discarding them by tanishkasinghhh · Pull Request #2119 · THUDM/slime

tanishkasinghhh · 2026-06-22T20:15:58Z

Summary

_finalize_sync() in update_weight_from_distributed_delta.py calls
ray.get(object_refs) but discards the (success: bool, msg: str) tuples
returned by each SGLang receiver engine (via update_weights_from_disk /
update_weights_from_distributed with load_format="delta").

A failed apply — from a checksum mismatch, I/O error, or decode error — is
therefore completely silent.

Why this matters

The sender advances its pinned-CPU snapshot before flushing to the
receiver (update_snapshot_async runs ahead of the bucket flush in
_enqueue_chunk). A silently-failed apply therefore leaves the sender
snapshot permanently ahead of what the receiver actually holds. All
subsequent diffs are taken against the (already-advanced) snapshot, so the
missed weight updates are never re-sent. The drift persists until a
full broadcast reseed (process restart).

The SGLang receiver already catches all exceptions internally and returns
(False, error_msg) — the only gap was that slime never checked the return
value.

Fix

Extracts check_apply_results() into a lightweight, dependency-free module
(_apply_result_check.py) and calls it immediately after ray.get():

# Before
ray.get(object_refs)             # (False, msg) silently dropped

# After
results = ray.get(object_refs)
check_apply_results(results)     # logs per-engine errors + raises

Separating the helper into its own module keeps it unit-testable without
torch / ray / megatron / GPU.

Changes

File	What
`slime/backends/megatron_utils/update_weight/_apply_result_check.py`	New: the `check_apply_results()` helper with full docstring
`slime/backends/megatron_utils/update_weight/update_weight_from_distributed_delta.py`	Import + call the helper after `ray.get()`
`tests/test_delta_sync_check_apply.py`	6 CPU unit tests (no GPU/Ray needed)

Tests

python3 -m pytest tests/test_delta_sync_check_apply.py -v
# 6 passed

Closes #2104

Prepared with AI assistance (Claude Code). Every change reviewed and understood line-by-line.

…rding them _finalize_sync() called ray.get(object_refs) and discarded the (success: bool, msg: str) tuples returned by each SGLang receiver engine. A failed delta-weight apply (checksum mismatch, I/O error, decode error) was therefore completely silent: ray.get(object_refs) # return values dropped — issue THUDM#2104 self._pending_publishes.clear() Because the sender advances its pinned-CPU snapshot *before* flushing to the receiver, a silently-failed apply leaves the sender snapshot ahead of the receiver. All subsequent diffs are computed against the (already- advanced) snapshot, so the missed updates are never re-sent. The drift persists for the rest of the run. Fix: extract check_apply_results() into a dependency-free module (_apply_result_check.py) and call it immediately after ray.get(): results = ray.get(object_refs) check_apply_results(results) # logs + raises on any (False, msg) The helper is in its own module so it can be unit-tested without torch, ray, megatron, or any GPU runtime. The SGLang receiver already catches every exception and returns (False, error_msg) rather than propagating; the only gap was that slime never checked the return value. Closes THUDM#2104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(delta-sync): surface failed engine apply results instead of silently discarding them#2119

fix(delta-sync): surface failed engine apply results instead of silently discarding them#2119
tanishkasinghhh wants to merge 1 commit into
THUDM:mainfrom
tanishkasinghhh:fix/delta-sync-inspect-results-2104

tanishkasinghhh commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

tanishkasinghhh commented Jun 22, 2026

Summary

Why this matters

Fix

Changes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant