Skip to content

Opaque HTTP rollout endpoint with publish-only delta sync#5

Open
nanjiangwill wants to merge 14 commits into
disk-delta-weight-syncfrom
disaggregated-rollout
Open

Opaque HTTP rollout endpoint with publish-only delta sync#5
nanjiangwill wants to merge 14 commits into
disk-delta-weight-syncfrom
disaggregated-rollout

Conversation

@nanjiangwill

@nanjiangwill nanjiangwill commented Jun 16, 2026

Copy link
Copy Markdown

Stacked on the disk-level delta weight sync branch. Adds --rollout-endpoint-url to train against an elastic rollout fleet behind a single opaque HTTP endpoint — no per-engine handles, no router worker APIs.

Three things follow from that one flag:

  • Generation routes to the URL (get_model_url); the rollout server holds no engines (reuses ExternalRolloutServer with empty engines); placement allocates 0 rollout GPUs.
  • Weights are published, not pushed: the disk-delta updater writes each version to --update-weight-disk-dir and advances a latest pointer (via the existing pre-push commit hook), skipping the per-engine update_weights_from_disk/pause/resume RPCs. The fleet pulls and hot-loads on its own.
  • Abort has no router worker list to query (an opaque endpoint exposes none). With surplus discarded it cancels slime's local pending requests, so the client disconnect aborts the fleet; with --partial-rollout the streaming generation tasks self-break and return the partial trajectories for resumption (each tagged with the weight version it stopped at).

The weight version each trajectory was generated with already flows back via sglang's meta_info["weight_version"] into Sample.weight_versions, so staleness handling stays the algorithm's concern — unchanged here.

Add-only: managed and external-addressed paths are byte-for-byte unchanged. Requires --update-weight-mode delta --update-weight-transport disk (disk is the only cross-cluster channel; full checkpoints are too large, hence deltas) and is non-colocate; --partial-rollout requires the streaming generation path.

Follow-ups (not in this PR): CPU-mockable unit tests for the new path, docs (external-rollout-engines / delta-weight-sync), and an e2e test against a mock pulling endpoint.

@nanjiangwill nanjiangwill force-pushed the disaggregated-rollout branch 2 times, most recently from f506c33 to 94ac929 Compare June 17, 2026 04:21
@nanjiangwill nanjiangwill force-pushed the disk-delta-weight-sync branch from fd7c00d to a0b4b09 Compare June 17, 2026 18:17
@nanjiangwill nanjiangwill force-pushed the disaggregated-rollout branch 2 times, most recently from ec8b0b2 to 4536d5d Compare June 17, 2026 19:30
zhuzilin and others added 2 commits June 18, 2026 12:11
)

Co-authored-by: EazyReal <8047065+EazyReal@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@jvmncs jvmncs force-pushed the disaggregated-rollout branch 2 times, most recently from b4ccec2 to 570cd0b Compare June 18, 2026 23:43
zhuzilin and others added 11 commits June 19, 2026 09:56
Ship only the changed bytes between weight syncs as a canonical HF delta
checkpoint; rollout hosts apply it into a host-local checkpoint and reload via
the vanilla update_weights_from_disk path. Replaces the NCCL delta transport
from THUDM#1806 with a disk-only path that needs no engine-side delta support.
sync_local_checkpoint (was sync_weights) materializes the base lazily via the
idempotent init_local_checkpoint instead of a background thread; record per-sync
update time in update_weight_metrics; state the pre-read/pre-push hooks' purpose
(non-POSIX filesystem coherence).
The actor's update_weights is already @timer-wrapped (perf/update_weights_time),
so the per-sync total/publish/reload breakdown was duplicate instrumentation.
Keep only the delta-specific metrics (density, wire bytes).
The delta scaffold reworked the update-weight args: delta requires
--update-weight-transport=disk (was nccl-or-disk), needs
--update-weight-local-checkpoint-dir, and the --update-weight-delta-dir
compatibility alias is gone (the directory belongs to the transport, not the
encoding). Drop the alias resolve/backfill/conflict tests, point the transport
and colocate tests at the disk path, and cover the local-checkpoint requirement.
With the delta-dir alias gone, _resolve_update_weight_disk_dir no longer
normalizes anything — it's a single transport-level check, so fold it into
_validate_update_weight_args.
slime_validate_args validates everything else inline; the extracted
_validate_update_weight_args was the lone exception. Fold it in and test it
the same way as the other slime_validate_args checks (make_slime_validate_args).
Materialize the host-local checkpoint in a daemon thread at engine init so the
one-time base copy overlaps sglang launch and the first rollout (which serves
from init-loaded weights) instead of blocking the first delta reload. The first
sync_local_checkpoint's init_local_checkpoint is idempotent and flock-guarded,
so it either finds the copy done or blocks on the same lock — no join needed.
Add --rollout-endpoint-url to train against an elastic rollout fleet behind a single
opaque HTTP endpoint (no per-engine handles). Generation routes to the URL; the
disk-delta updater publishes each version to the shared disk dir and advances a
`latest` pointer for the fleet to pull, instead of pushing via per-engine RPCs.

Add-only: managed and external-addressed paths are unchanged. Requires delta mode +
disk transport (the only cross-cluster channel) and is non-colocate.

- abort: the endpoint cancels surplus when discarding; with --partial-rollout it
  drains so streaming tasks return their partials (closing each stream disconnects,
  which aborts the fleet-side request). Endpoint + --partial-rollout requires the
  streaming rollout, the only path that captures partials client-side.
- streaming: record the weight version on an aborted partial so off-policy correction
  can weight it (update_from_meta_info is skipped without a finish_reason).
Add --custom-rollout-request-hook-path: a hook that can mutate each outgoing
generate request (payload/headers/max_retries/retry_sleep) before it is sent,
with rollout_id and evaluation context threaded in via rollout_request_context.

Weight-version gating is left to the hook rather than a core arg: the right
target version is loop-specific (it only equals rollout_id in the synchronous
loop) and the useful knob — a staleness offset — isn't expressible as an enum.
Thread retry_sleep through http_utils.post so a gating hook can back off while
waiting for the fleet to load a version. A documented example hook shows
min/exact weight-version gating.
@jvmncs jvmncs force-pushed the disaggregated-rollout branch 2 times, most recently from 367ec82 to e2c6806 Compare June 20, 2026 07:30
needed this for moonlight to be used w/ update_weight_from_disk_delta
@jvmncs jvmncs force-pushed the disaggregated-rollout branch from e2c6806 to ebfe153 Compare June 20, 2026 23:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants