Delta weight sync: failed engine apply is silently swallowed, leaving sender snapshot ahead of receiver state

## Summary

In delta weight sync, the receiver-side apply (in SGLang's `update_weights_from_disk` / `update_weights_from_distributed` with `load_format="delta"`) is wrapped in `try/except` and returns a `(success, msg)` tuple to slime over Ray. Slime's `_finalize_sync` then calls `ray.get(...)` on the resulting object refs but discards the return values — the success flag and the error message are never inspected.

Because the sender already advanced its pinned-CPU snapshot before the flush left the trainer, a silently-failed apply leaves the sender's "what the receiver currently has" model permanently out of sync with the receiver's actual weights. Subsequent diffs are taken against the (already-advanced) snapshot, so the positions that the failed flush was carrying are never re-sent. The drift persists for the rest of the run until the next snapshot reseed (which only happens on process restart).

## Where this is in the code

(commit `8f5e215`, branch `main`)

- Sender snapshot is updated *before* the bucket is dispatched:
  `slime/backends/megatron_utils/update_weight/update_weight_from_distributed_delta.py`, around `_enqueue_chunk`, where `self.delta_state.update_snapshot_async(hf_chunk)` runs immediately after `compute_diffs` and before the encoded chunk is bucketed / flushed.

- Receiver return values are discarded at `_finalize_sync`:
  ```python
  # update_weight_from_distributed_delta.py:811
  ray.get(object_refs)
  self._pending_publishes.clear()
  # ... no inspection of (success, msg) returned by
  # engine.update_weights_from_disk.remote(...)
  ```
  followed by `ray.get([engine.continue_generation.remote(...) for ...])` — i.e. rollout generation is resumed regardless of whether the apply succeeded.

The SGLang receiver (sgl-project/sglang#26519) does compute and verify a `torch.hash_tensor`-based checksum and raises `RuntimeError` on mismatch, but the outer `_apply_delta_from_distributed` / `_apply_delta` catch the exception, log it, and return `(False, error_msg)` — which slime then drops.

## Failure modes this can mask

- Bit corruption between encode and apply (covered by the checksum), e.g. transient NCCL transport errors or shared-FS corruption on the disk path.
- I/O failures during safetensors read on the disk transport (`_apply_delta` returns `(False, ...)` and continues).
- Decoding errors (`_decode_delta_one_param` / `_decode_and_apply_delta_blob` exceptions caught and swallowed).

In each case, the receiver model is correctly left untouched at that flush (so no partial-write corruption), but the sender snapshot has moved on, so the missed updates are never retried.

## Why this hasn't been observed in practice (hypothesis)

NCCL collectives in an intra-cluster RDMA setup are either successful or hang the entire job, so checksum mismatches are vanishingly rare. The disk transport on a shared filesystem in cross-DC deployments is where this becomes more likely.

## Minimal proposed fix (open to alternatives)

The smallest change that closes the loop:

1. In `_finalize_sync`, unpack the `(success, msg)` tuples returned by `engine.update_weights_from_disk.remote(...)` / `engine.update_weights_from_distributed.remote(...)`.
2. If any apply returns `success=False`:
   - Log a structured error including the engine identity and the receiver's error string.
   - Force the next sync to use full broadcast (or otherwise reseed `DeltaState.snapshot`) so the receiver and sender are realigned.
3. Optional: expose a counter metric (`update_weights_failures_total`) so operators can monitor drift incidents.

Happy to send a PR if the maintainers think this direction is reasonable — open to suggestions on the recovery semantics (e.g. "fail the sync vs auto-reseed vs raise upward").

## Reproduction

A non-trivial real-world reproduction would require deliberately injecting transport corruption (e.g. flipping a byte in a published safetensors file before the engine reads it) and observing that training continues to step but inference outputs gradually diverge from a known-good checkpoint. Happy to put together a targeted test if useful — would land alongside the fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delta weight sync: failed engine apply is silently swallowed, leaving sender snapshot ahead of receiver state #2104

Summary

Where this is in the code

Failure modes this can mask

Why this hasn't been observed in practice (hypothesis)

Minimal proposed fix (open to alternatives)

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Delta weight sync: failed engine apply is silently swallowed, leaving sender snapshot ahead of receiver state #2104

Description

Summary

Where this is in the code

Failure modes this can mask

Why this hasn't been observed in practice (hypothesis)

Minimal proposed fix (open to alternatives)

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions