Skip to content

Delta weight sync: failed engine apply is silently swallowed, leaving sender snapshot ahead of receiver state #2104

Description

@ChangyiYang

Summary

In delta weight sync, the receiver-side apply (in SGLang's update_weights_from_disk / update_weights_from_distributed with load_format="delta") is wrapped in try/except and returns a (success, msg) tuple to slime over Ray. Slime's _finalize_sync then calls ray.get(...) on the resulting object refs but discards the return values — the success flag and the error message are never inspected.

Because the sender already advanced its pinned-CPU snapshot before the flush left the trainer, a silently-failed apply leaves the sender's "what the receiver currently has" model permanently out of sync with the receiver's actual weights. Subsequent diffs are taken against the (already-advanced) snapshot, so the positions that the failed flush was carrying are never re-sent. The drift persists for the rest of the run until the next snapshot reseed (which only happens on process restart).

Where this is in the code

(commit 8f5e215, branch main)

  • Sender snapshot is updated before the bucket is dispatched:
    slime/backends/megatron_utils/update_weight/update_weight_from_distributed_delta.py, around _enqueue_chunk, where self.delta_state.update_snapshot_async(hf_chunk) runs immediately after compute_diffs and before the encoded chunk is bucketed / flushed.

  • Receiver return values are discarded at _finalize_sync:

    # update_weight_from_distributed_delta.py:811
    ray.get(object_refs)
    self._pending_publishes.clear()
    # ... no inspection of (success, msg) returned by
    # engine.update_weights_from_disk.remote(...)

    followed by ray.get([engine.continue_generation.remote(...) for ...]) — i.e. rollout generation is resumed regardless of whether the apply succeeded.

The SGLang receiver (sgl-project/sglang#26519) does compute and verify a torch.hash_tensor-based checksum and raises RuntimeError on mismatch, but the outer _apply_delta_from_distributed / _apply_delta catch the exception, log it, and return (False, error_msg) — which slime then drops.

Failure modes this can mask

  • Bit corruption between encode and apply (covered by the checksum), e.g. transient NCCL transport errors or shared-FS corruption on the disk path.
  • I/O failures during safetensors read on the disk transport (_apply_delta returns (False, ...) and continues).
  • Decoding errors (_decode_delta_one_param / _decode_and_apply_delta_blob exceptions caught and swallowed).

In each case, the receiver model is correctly left untouched at that flush (so no partial-write corruption), but the sender snapshot has moved on, so the missed updates are never retried.

Why this hasn't been observed in practice (hypothesis)

NCCL collectives in an intra-cluster RDMA setup are either successful or hang the entire job, so checksum mismatches are vanishingly rare. The disk transport on a shared filesystem in cross-DC deployments is where this becomes more likely.

Minimal proposed fix (open to alternatives)

The smallest change that closes the loop:

  1. In _finalize_sync, unpack the (success, msg) tuples returned by engine.update_weights_from_disk.remote(...) / engine.update_weights_from_distributed.remote(...).
  2. If any apply returns success=False:
    • Log a structured error including the engine identity and the receiver's error string.
    • Force the next sync to use full broadcast (or otherwise reseed DeltaState.snapshot) so the receiver and sender are realigned.
  3. Optional: expose a counter metric (update_weights_failures_total) so operators can monitor drift incidents.

Happy to send a PR if the maintainers think this direction is reasonable — open to suggestions on the recovery semantics (e.g. "fail the sync vs auto-reseed vs raise upward").

Reproduction

A non-trivial real-world reproduction would require deliberately injecting transport corruption (e.g. flipping a byte in a published safetensors file before the engine reads it) and observing that training continues to step but inference outputs gradually diverge from a known-good checkpoint. Happy to put together a targeted test if useful — would land alongside the fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions