Summary
In delta weight sync, the receiver-side apply (in SGLang's update_weights_from_disk / update_weights_from_distributed with load_format="delta") is wrapped in try/except and returns a (success, msg) tuple to slime over Ray. Slime's _finalize_sync then calls ray.get(...) on the resulting object refs but discards the return values — the success flag and the error message are never inspected.
Because the sender already advanced its pinned-CPU snapshot before the flush left the trainer, a silently-failed apply leaves the sender's "what the receiver currently has" model permanently out of sync with the receiver's actual weights. Subsequent diffs are taken against the (already-advanced) snapshot, so the positions that the failed flush was carrying are never re-sent. The drift persists for the rest of the run until the next snapshot reseed (which only happens on process restart).
Where this is in the code
(commit 8f5e215, branch main)
-
Sender snapshot is updated before the bucket is dispatched:
slime/backends/megatron_utils/update_weight/update_weight_from_distributed_delta.py, around _enqueue_chunk, where self.delta_state.update_snapshot_async(hf_chunk) runs immediately after compute_diffs and before the encoded chunk is bucketed / flushed.
-
Receiver return values are discarded at _finalize_sync:
# update_weight_from_distributed_delta.py:811
ray.get(object_refs)
self._pending_publishes.clear()
# ... no inspection of (success, msg) returned by
# engine.update_weights_from_disk.remote(...)
followed by ray.get([engine.continue_generation.remote(...) for ...]) — i.e. rollout generation is resumed regardless of whether the apply succeeded.
The SGLang receiver (sgl-project/sglang#26519) does compute and verify a torch.hash_tensor-based checksum and raises RuntimeError on mismatch, but the outer _apply_delta_from_distributed / _apply_delta catch the exception, log it, and return (False, error_msg) — which slime then drops.
Failure modes this can mask
- Bit corruption between encode and apply (covered by the checksum), e.g. transient NCCL transport errors or shared-FS corruption on the disk path.
- I/O failures during safetensors read on the disk transport (
_apply_delta returns (False, ...) and continues).
- Decoding errors (
_decode_delta_one_param / _decode_and_apply_delta_blob exceptions caught and swallowed).
In each case, the receiver model is correctly left untouched at that flush (so no partial-write corruption), but the sender snapshot has moved on, so the missed updates are never retried.
Why this hasn't been observed in practice (hypothesis)
NCCL collectives in an intra-cluster RDMA setup are either successful or hang the entire job, so checksum mismatches are vanishingly rare. The disk transport on a shared filesystem in cross-DC deployments is where this becomes more likely.
Minimal proposed fix (open to alternatives)
The smallest change that closes the loop:
- In
_finalize_sync, unpack the (success, msg) tuples returned by engine.update_weights_from_disk.remote(...) / engine.update_weights_from_distributed.remote(...).
- If any apply returns
success=False:
- Log a structured error including the engine identity and the receiver's error string.
- Force the next sync to use full broadcast (or otherwise reseed
DeltaState.snapshot) so the receiver and sender are realigned.
- Optional: expose a counter metric (
update_weights_failures_total) so operators can monitor drift incidents.
Happy to send a PR if the maintainers think this direction is reasonable — open to suggestions on the recovery semantics (e.g. "fail the sync vs auto-reseed vs raise upward").
Reproduction
A non-trivial real-world reproduction would require deliberately injecting transport corruption (e.g. flipping a byte in a published safetensors file before the engine reads it) and observing that training continues to step but inference outputs gradually diverge from a known-good checkpoint. Happy to put together a targeted test if useful — would land alongside the fix.
Summary
In delta weight sync, the receiver-side apply (in SGLang's
update_weights_from_disk/update_weights_from_distributedwithload_format="delta") is wrapped intry/exceptand returns a(success, msg)tuple to slime over Ray. Slime's_finalize_syncthen callsray.get(...)on the resulting object refs but discards the return values — the success flag and the error message are never inspected.Because the sender already advanced its pinned-CPU snapshot before the flush left the trainer, a silently-failed apply leaves the sender's "what the receiver currently has" model permanently out of sync with the receiver's actual weights. Subsequent diffs are taken against the (already-advanced) snapshot, so the positions that the failed flush was carrying are never re-sent. The drift persists for the rest of the run until the next snapshot reseed (which only happens on process restart).
Where this is in the code
(commit
8f5e215, branchmain)Sender snapshot is updated before the bucket is dispatched:
slime/backends/megatron_utils/update_weight/update_weight_from_distributed_delta.py, around_enqueue_chunk, whereself.delta_state.update_snapshot_async(hf_chunk)runs immediately aftercompute_diffsand before the encoded chunk is bucketed / flushed.Receiver return values are discarded at
_finalize_sync:followed by
ray.get([engine.continue_generation.remote(...) for ...])— i.e. rollout generation is resumed regardless of whether the apply succeeded.The SGLang receiver (sgl-project/sglang#26519) does compute and verify a
torch.hash_tensor-based checksum and raisesRuntimeErroron mismatch, but the outer_apply_delta_from_distributed/_apply_deltacatch the exception, log it, and return(False, error_msg)— which slime then drops.Failure modes this can mask
_apply_deltareturns(False, ...)and continues)._decode_delta_one_param/_decode_and_apply_delta_blobexceptions caught and swallowed).In each case, the receiver model is correctly left untouched at that flush (so no partial-write corruption), but the sender snapshot has moved on, so the missed updates are never retried.
Why this hasn't been observed in practice (hypothesis)
NCCL collectives in an intra-cluster RDMA setup are either successful or hang the entire job, so checksum mismatches are vanishingly rare. The disk transport on a shared filesystem in cross-DC deployments is where this becomes more likely.
Minimal proposed fix (open to alternatives)
The smallest change that closes the loop:
_finalize_sync, unpack the(success, msg)tuples returned byengine.update_weights_from_disk.remote(...)/engine.update_weights_from_distributed.remote(...).success=False:DeltaState.snapshot) so the receiver and sender are realigned.update_weights_failures_total) so operators can monitor drift incidents.Happy to send a PR if the maintainers think this direction is reasonable — open to suggestions on the recovery semantics (e.g. "fail the sync vs auto-reseed vs raise upward").
Reproduction
A non-trivial real-world reproduction would require deliberately injecting transport corruption (e.g. flipping a byte in a published safetensors file before the engine reads it) and observing that training continues to step but inference outputs gradually diverge from a known-good checkpoint. Happy to put together a targeted test if useful — would land alongside the fix.