diff --git a/examples/README.md b/examples/README.md index 128b1562d4..c0357cecba 100644 --- a/examples/README.md +++ b/examples/README.md @@ -8,11 +8,11 @@ These examples provide concrete examples to leverage slime in your own RL workfl - **[fully_async](./fully_async)**: Demonstrates fully asynchronous rollout generation for higher efficiency. - **[geo3k_vlm](./geo3k_vlm)**: Training VLMs on a single-turn reasoning task using GRPO on the GEO3K dataset. - **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)**: VLM multi-turn training on Geo3k dataset. -- **[low_precision](./low_precision)**: Examples of FP8 training and inference for improved throughput and stability. +- **[low_precision](../docs/en/advanced/low-precision.md)**: Examples of FP8 training and inference for improved throughput and stability. - **[multi_agent](./multi_agent)**: Example of running multi-agent RL with `slime`. - **[on_policy_distillation](./on_policy_distillation)**: Example implementation for on-policy distillation, extending the reinforcement learning pipeline to support teacher–student distillation directly within on-policy training. - **[delta_weight_sync](./delta_weight_sync)**: Non-colocated weight sync that ships only changed positions + values over disk (training/inference disaggregation) or NCCL. -- **[reproducibility](./reproducibility)**: Guides on achieving bitwise experiment reproduction using deterministic modes. +- **[reproducibility](../docs/en/advanced/reproducibility.md)**: Guides on achieving bitwise experiment reproduction using deterministic modes. - **[retool](./retool)**: Demonstrates the retool functionality for tool-enabled language model generation. - **[search-r1](./search-r1)**: A minimal reproduction of Search-R1, featuring multi-turn conversation and tool-calling. - **[strands_sglang](./strands_sglang)**: Integration example with the Strands-Agents scaffolding framework. diff --git a/examples/delta_weight_sync/README.md b/examples/delta_weight_sync/README.md index b2c7521578..92e519a97e 100644 --- a/examples/delta_weight_sync/README.md +++ b/examples/delta_weight_sync/README.md @@ -48,20 +48,15 @@ See [docs/en/advanced/delta-weight-sync.md](../../docs/en/advanced/delta-weight- ## Results -W&B traces comparing delta sync against the full-sync baseline on GLM-4.7-355B-A32B / DAPO-Math-17k. +W&B traces comparing delta sync against the full-sync baseline on GLM-4.7-355B-A32B / DAPO-Math-17k track: -![Raw reward](./raw_reward.png) - -![Train/rollout logprob abs diff](./train_rollout_logprob_abs_diff.png) - -![Update weights time](./update_weights_time.png) +- `raw_reward` — training reward curve vs full-sync baseline +- `train/train_rollout_logprob_abs_diff` — token-level logprob mismatch between train and rollout +- `perf/update_weights_time` — wall time per weight sync +- `perf/update_weights_density` — fraction of weight positions that moved between consecutive syncs (sync 0 omitted: snapshot-seeding pass with density = 1.0) > **Note on the small curve-to-curve gap.** RL training is inherently non-deterministic (cuBLAS reductions, FlashAttention split-K, NCCL all-reduce ordering, dynamic-batch token assignment). Two identically-configured *full*-sync runs would diverge the same way. Delta sync's selective overwrite is bit-exact with full sync per step (no arithmetic, no drift); the trajectory matches, the bits don't. -![Update weights density](./update_weights_density.png) - -*Per-sync change density (`perf/update_weights_density`) — fraction of weight positions that moved between consecutive syncs. Sync 0 is omitted: it's the snapshot-seeding pass with density = 1.0, which would compress the y-axis.* - ## Why these encoding defaults Per-sync change density during RL fine-tuning at conservative LRs sits around **2-3%** ([arXiv:2602.03839](https://arxiv.org/pdf/2602.03839) reports ~1% on a related setup; we measured ~2-3% on this run). Below the 3.125% break-even point, gap-encoded positions are smaller than absolute indices — the disk default `deltas_zstd` adds zstd L1 on top to squeeze the gap byte stream further (~35-40%), which is the right tradeoff when shared-FS bandwidth is ≤ 300 MB/s. Intra-datacenter NCCL has no bandwidth pressure, so `indices` (lowest compute, biggest payload) is the cleaner default there. diff --git a/slime_plugins/rollout_buffer/README.md b/slime_plugins/rollout_buffer/README.md index e85e68d89c..899ee5f09d 100644 --- a/slime_plugins/rollout_buffer/README.md +++ b/slime_plugins/rollout_buffer/README.md @@ -40,7 +40,7 @@ In addition, Rollout Buffer also provides some customizable functions to meet sp ### Example Script -First, you need to follow [Example: Qwen3-4B Model](../../docs/en/models/qwen3-4B.md) to configure the environment, download data and convert model checkpoints. And then run the following scripts: +First, you need to follow [Example: Qwen3-4B Model](../../docs/en/examples/qwen3-4B.md) to configure the environment, download data and convert model checkpoints. And then run the following scripts: ```bash cd slime_plugins/rollout_buffer bash rollout_buffer_example.sh diff --git a/slime_plugins/rollout_buffer/README_zh.md b/slime_plugins/rollout_buffer/README_zh.md index cfa689fe71..e6e9d2141d 100644 --- a/slime_plugins/rollout_buffer/README_zh.md +++ b/slime_plugins/rollout_buffer/README_zh.md @@ -40,7 +40,7 @@ generator/ ### 示例脚本 -请仿照 [示例:Qwen3-4B 模型](../../docs/zh/models/qwen3-4B.md) 文档中配置好 slime 的运行环境,下载数据,并转换模型 ckpt。之后分别运行 +请仿照 [示例:Qwen3-4B 模型](../../docs/zh/examples/qwen3-4B.md) 文档中配置好 slime 的运行环境,下载数据,并转换模型 ckpt。之后分别运行 ```bash cd slime_plugins/rollout_buffer