Skip to content

fix(delta-sync): surface failed engine apply results instead of silently discarding them#2119

Open
tanishkasinghhh wants to merge 1 commit into
THUDM:mainfrom
tanishkasinghhh:fix/delta-sync-inspect-results-2104
Open

fix(delta-sync): surface failed engine apply results instead of silently discarding them#2119
tanishkasinghhh wants to merge 1 commit into
THUDM:mainfrom
tanishkasinghhh:fix/delta-sync-inspect-results-2104

Conversation

@tanishkasinghhh

Copy link
Copy Markdown

Summary

_finalize_sync() in update_weight_from_distributed_delta.py calls
ray.get(object_refs) but discards the (success: bool, msg: str) tuples
returned by each SGLang receiver engine (via update_weights_from_disk /
update_weights_from_distributed with load_format="delta").

A failed apply — from a checksum mismatch, I/O error, or decode error — is
therefore completely silent.

Why this matters

The sender advances its pinned-CPU snapshot before flushing to the
receiver (update_snapshot_async runs ahead of the bucket flush in
_enqueue_chunk). A silently-failed apply therefore leaves the sender
snapshot permanently ahead of what the receiver actually holds. All
subsequent diffs are taken against the (already-advanced) snapshot, so the
missed weight updates are never re-sent. The drift persists until a
full broadcast reseed (process restart).

The SGLang receiver already catches all exceptions internally and returns
(False, error_msg) — the only gap was that slime never checked the return
value.

Fix

Extracts check_apply_results() into a lightweight, dependency-free module
(_apply_result_check.py) and calls it immediately after ray.get():

# Before
ray.get(object_refs)             # (False, msg) silently dropped

# After
results = ray.get(object_refs)
check_apply_results(results)     # logs per-engine errors + raises

Separating the helper into its own module keeps it unit-testable without
torch / ray / megatron / GPU.

Changes

File What
slime/backends/megatron_utils/update_weight/_apply_result_check.py New: the check_apply_results() helper with full docstring
slime/backends/megatron_utils/update_weight/update_weight_from_distributed_delta.py Import + call the helper after ray.get()
tests/test_delta_sync_check_apply.py 6 CPU unit tests (no GPU/Ray needed)

Tests

python3 -m pytest tests/test_delta_sync_check_apply.py -v
# 6 passed

Closes #2104

Prepared with AI assistance (Claude Code). Every change reviewed and understood line-by-line.

…rding them

_finalize_sync() called ray.get(object_refs) and discarded the
(success: bool, msg: str) tuples returned by each SGLang receiver engine.
A failed delta-weight apply (checksum mismatch, I/O error, decode error)
was therefore completely silent:

  ray.get(object_refs)   # return values dropped — issue THUDM#2104
  self._pending_publishes.clear()

Because the sender advances its pinned-CPU snapshot *before* flushing to
the receiver, a silently-failed apply leaves the sender snapshot ahead of
the receiver. All subsequent diffs are computed against the (already-
advanced) snapshot, so the missed updates are never re-sent. The drift
persists for the rest of the run.

Fix: extract check_apply_results() into a dependency-free module
(_apply_result_check.py) and call it immediately after ray.get():

  results = ray.get(object_refs)
  check_apply_results(results)    # logs + raises on any (False, msg)

The helper is in its own module so it can be unit-tested without torch,
ray, megatron, or any GPU runtime.

The SGLang receiver already catches every exception and returns (False,
error_msg) rather than propagating; the only gap was that slime never
checked the return value.

Closes THUDM#2104
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Delta weight sync: failed engine apply is silently swallowed, leaving sender snapshot ahead of receiver state

1 participant