You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue scopes the P0.3 roadmap item for an end-to-end log-prob cross-benchmark tool that compares rollout engines with training engines.
Target identity:
same model + same prompt/completion tokens + same policy state
=> rollout-engine selected logprobs match training-engine selected logprobs
This is the executable counterpart to the train-inference consistency roadmap (#83) and the batch-invariant consistency RFC (#101). #101 defines the broad invariance contract; this issue should produce the tool contributors and maintainers run to measure whether real rollout and training stacks obey that contract.
Problem
Rollout and training paths rarely compute logprobs through the same execution shape:
rollout engines such as vLLM or sglang use dynamic batching, chunked prefill, paged KV cache, prefix cache, and decode-time scheduling;
training engines such as Megatron, DeepSpeed, FSDP, or a torch/HF reference path use teacher-forced forward passes, micro-batches, packing, padding masks, and different distributed layouts;
model weights may be synchronized correctly while token shifting, position ids, masks, dtype casts, or reduction order still drift;
selected logprobs may agree on small examples but diverge under long completions, ragged batches, TP>1, fp16/bf16, or prefix-cache reuse.
GRPO/PPO consume logprob deltas directly through ratios and KL terms, so even small rollout-vs-training drift can change the policy update. We need one reproducible cross-benchmark that captures the same logical sequence and reports exactly where the two engines disagree.
Scope
Build a benchmark harness that can:
run or ingest rollout outputs from one or more rollout engines, starting with a minimal local path and optional vLLM/sglang adapters;
replay the exact same prompt/completion token ids through one or more training engines, starting with a torch/HF reference and optional Megatron/DeepSpeed/FSDP adapters;
compare selected token logprobs for current policy, old policy, and reference policy when available;
preserve and report token ids, position ids, attention masks, completion masks, prompt/completion boundaries, dtype, engine config, model revision, and policy/weight-sync metadata;
measure per-token, per-sequence, and batch-level drift: max abs error, relative error, mean error, sequence sum/mean error, and KL-input drift;
A smaller deterministic smoke command should also exist for CI, using a tiny model or synthetic fixture where optional engine dependencies can be skipped.
Proposed Deliverables
A reusable fixture schema for captured rollout sequences, including token ids, prompt/completion boundaries, masks, logprobs, engine metadata, and policy-state metadata.
A reference training-side replay path that computes selected logprobs for the captured tokens.
At least one rollout-side adapter or ingest path, with a documented path for vLLM and sglang adapters.
A comparator that reports drift at token, sequence, and aggregate levels.
A benchmark summary format that records model, tokenizer, dtype, backend, batch shape, cache settings, distributed settings, tolerance, and command line.
A CI-friendly smoke mode and a GPU/nightly mode for real engine comparisons.
A contributor can run one documented command that compares rollout-engine logprobs against a training/reference engine on fixed prompts, model, dtype, and seed.
The tool can run a fast deterministic smoke benchmark without requiring a full distributed cluster.
The output identifies the worst drift by sequence id, token position, target token id, prompt/completion region, engine pair, dtype, and config.
The benchmark records enough metadata to reproduce the run, including model revision/path, tokenizer, engine versions, seed, batch/cache/layout settings, and tolerance policy.
The comparator supports explicit pass/fail thresholds for CI or nightly regression lanes.
At least one test fixture intentionally injects a mismatch in token shift, mask, or dtype path and proves the report points to the offending token/sequence.
The design leaves room for vLLM, sglang, Megatron, DeepSpeed, FSDP, and torch/HF reference adapters without making all of them mandatory for the first PR.
This issue scopes the P0.3 roadmap item for an end-to-end log-prob cross-benchmark tool that compares rollout engines with training engines.
Target identity:
This is the executable counterpart to the train-inference consistency roadmap (#83) and the batch-invariant consistency RFC (#101). #101 defines the broad invariance contract; this issue should produce the tool contributors and maintainers run to measure whether real rollout and training stacks obey that contract.
Problem
Rollout and training paths rarely compute logprobs through the same execution shape:
GRPO/PPO consume logprob deltas directly through ratios and KL terms, so even small rollout-vs-training drift can change the policy update. We need one reproducible cross-benchmark that captures the same logical sequence and reports exactly where the two engines disagree.
Scope
Build a benchmark harness that can:
Non-Goals
Proposed CLI Shape
The final interface can evolve, but the first version should support a command in this spirit:
A smaller deterministic smoke command should also exist for CI, using a tiny model or synthetic fixture where optional engine dependencies can be skipped.
Proposed Deliverables
Acceptance Criteria