[FEAT] End-to-end log-prob cross-benchmark tool for rollout vs training engines

This issue scopes the P0.3 roadmap item for an end-to-end log-prob cross-benchmark tool that compares rollout engines with training engines.

Target identity:

```text
same model + same prompt/completion tokens + same policy state
=> rollout-engine selected logprobs match training-engine selected logprobs
```

This is the executable counterpart to the train-inference consistency roadmap (#83) and the batch-invariant consistency RFC (#101). #101 defines the broad invariance contract; this issue should produce the tool contributors and maintainers run to measure whether real rollout and training stacks obey that contract.

## Problem

Rollout and training paths rarely compute logprobs through the same execution shape:

- rollout engines such as vLLM or sglang use dynamic batching, chunked prefill, paged KV cache, prefix cache, and decode-time scheduling;
- training engines such as Megatron, DeepSpeed, FSDP, or a torch/HF reference path use teacher-forced forward passes, micro-batches, packing, padding masks, and different distributed layouts;
- model weights may be synchronized correctly while token shifting, position ids, masks, dtype casts, or reduction order still drift;
- selected logprobs may agree on small examples but diverge under long completions, ragged batches, TP>1, fp16/bf16, or prefix-cache reuse.

GRPO/PPO consume logprob deltas directly through ratios and KL terms, so even small rollout-vs-training drift can change the policy update. We need one reproducible cross-benchmark that captures the same logical sequence and reports exactly where the two engines disagree.

## Scope

Build a benchmark harness that can:

- run or ingest rollout outputs from one or more rollout engines, starting with a minimal local path and optional vLLM/sglang adapters;
- replay the exact same prompt/completion token ids through one or more training engines, starting with a torch/HF reference and optional Megatron/DeepSpeed/FSDP adapters;
- compare selected token logprobs for current policy, old policy, and reference policy when available;
- preserve and report token ids, position ids, attention masks, completion masks, prompt/completion boundaries, dtype, engine config, model revision, and policy/weight-sync metadata;
- measure per-token, per-sequence, and batch-level drift: max abs error, relative error, mean error, sequence sum/mean error, and KL-input drift;
- optionally sweep batch size, chunked prefill on/off, prefix cache on/off, padding layout, micro-batch size, and TP/FSDP configuration;
- emit machine-readable JSON/JSONL plus a concise markdown summary suitable for CI, nightly benchmarks, and issue comments.

## Non-Goals

- Do not re-define the full batch/cache/layout invariance contract; that belongs to #101.
- Do not solve TP-invariant reductions inside this tool; the reduction contract belongs to #102, though this tool should be able to surface TP-related drift.
- Do not require every rollout/training framework to be integrated before the first version lands.
- Do not make the benchmark depend on production-scale clusters for the smoke path.

## Proposed CLI Shape

The final interface can evolve, but the first version should support a command in this spirit:

```bash
python -m rl_kernel.benchmarks.logprob_cross_engine \
  --model <model-or-local-path> \
  --tokenizer <tokenizer-or-local-path> \
  --prompts fixtures/prompts.jsonl \
  --rollout-engine vllm \
  --training-engine torch \
  --dtype bf16 \
  --max-new-tokens 128 \
  --seed 1234 \
  --output-dir artifacts/logprob_cross_engine/vllm_vs_torch
```

A smaller deterministic smoke command should also exist for CI, using a tiny model or synthetic fixture where optional engine dependencies can be skipped.

## Proposed Deliverables

- A reusable fixture schema for captured rollout sequences, including token ids, prompt/completion boundaries, masks, logprobs, engine metadata, and policy-state metadata.
- A reference training-side replay path that computes selected logprobs for the captured tokens.
- At least one rollout-side adapter or ingest path, with a documented path for vLLM and sglang adapters.
- A comparator that reports drift at token, sequence, and aggregate levels.
- A benchmark summary format that records model, tokenizer, dtype, backend, batch shape, cache settings, distributed settings, tolerance, and command line.
- A CI-friendly smoke mode and a GPU/nightly mode for real engine comparisons.
- Clear links from failure reports to likely follow-up areas: batch/cache/layout invariance (#101), TP reductions (#102), GRPO loss semantics (#64), or fused logprob kernels (#96).

## Acceptance Criteria

- A contributor can run one documented command that compares rollout-engine logprobs against a training/reference engine on fixed prompts, model, dtype, and seed.
- The tool can run a fast deterministic smoke benchmark without requiring a full distributed cluster.
- The output identifies the worst drift by sequence id, token position, target token id, prompt/completion region, engine pair, dtype, and config.
- The benchmark records enough metadata to reproduce the run, including model revision/path, tokenizer, engine versions, seed, batch/cache/layout settings, and tolerance policy.
- The comparator supports explicit pass/fail thresholds for CI or nightly regression lanes.
- At least one test fixture intentionally injects a mismatch in token shift, mask, or dtype path and proves the report points to the offending token/sequence.
- The design leaves room for vLLM, sglang, Megatron, DeepSpeed, FSDP, and torch/HF reference adapters without making all of them mandatory for the first PR.
- The issue links cleanly into #83 and #101 as the concrete P0.3 cross-engine benchmark tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] End-to-end log-prob cross-benchmark tool for rollout vs training engines #106

Problem

Scope

Non-Goals

Proposed CLI Shape

Proposed Deliverables

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[FEAT] End-to-end log-prob cross-benchmark tool for rollout vs training engines #106

Description

Problem

Scope

Non-Goals

Proposed CLI Shape

Proposed Deliverables

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions