Skip to content

[FEAT] End-to-end log-prob cross-benchmark tool for rollout vs training engines #106

Description

@inaniloquentee

This issue scopes the P0.3 roadmap item for an end-to-end log-prob cross-benchmark tool that compares rollout engines with training engines.

Target identity:

same model + same prompt/completion tokens + same policy state
=> rollout-engine selected logprobs match training-engine selected logprobs

This is the executable counterpart to the train-inference consistency roadmap (#83) and the batch-invariant consistency RFC (#101). #101 defines the broad invariance contract; this issue should produce the tool contributors and maintainers run to measure whether real rollout and training stacks obey that contract.

Problem

Rollout and training paths rarely compute logprobs through the same execution shape:

  • rollout engines such as vLLM or sglang use dynamic batching, chunked prefill, paged KV cache, prefix cache, and decode-time scheduling;
  • training engines such as Megatron, DeepSpeed, FSDP, or a torch/HF reference path use teacher-forced forward passes, micro-batches, packing, padding masks, and different distributed layouts;
  • model weights may be synchronized correctly while token shifting, position ids, masks, dtype casts, or reduction order still drift;
  • selected logprobs may agree on small examples but diverge under long completions, ragged batches, TP>1, fp16/bf16, or prefix-cache reuse.

GRPO/PPO consume logprob deltas directly through ratios and KL terms, so even small rollout-vs-training drift can change the policy update. We need one reproducible cross-benchmark that captures the same logical sequence and reports exactly where the two engines disagree.

Scope

Build a benchmark harness that can:

  • run or ingest rollout outputs from one or more rollout engines, starting with a minimal local path and optional vLLM/sglang adapters;
  • replay the exact same prompt/completion token ids through one or more training engines, starting with a torch/HF reference and optional Megatron/DeepSpeed/FSDP adapters;
  • compare selected token logprobs for current policy, old policy, and reference policy when available;
  • preserve and report token ids, position ids, attention masks, completion masks, prompt/completion boundaries, dtype, engine config, model revision, and policy/weight-sync metadata;
  • measure per-token, per-sequence, and batch-level drift: max abs error, relative error, mean error, sequence sum/mean error, and KL-input drift;
  • optionally sweep batch size, chunked prefill on/off, prefix cache on/off, padding layout, micro-batch size, and TP/FSDP configuration;
  • emit machine-readable JSON/JSONL plus a concise markdown summary suitable for CI, nightly benchmarks, and issue comments.

Non-Goals

Proposed CLI Shape

The final interface can evolve, but the first version should support a command in this spirit:

python -m rl_kernel.benchmarks.logprob_cross_engine \
  --model <model-or-local-path> \
  --tokenizer <tokenizer-or-local-path> \
  --prompts fixtures/prompts.jsonl \
  --rollout-engine vllm \
  --training-engine torch \
  --dtype bf16 \
  --max-new-tokens 128 \
  --seed 1234 \
  --output-dir artifacts/logprob_cross_engine/vllm_vs_torch

A smaller deterministic smoke command should also exist for CI, using a tiny model or synthetic fixture where optional engine dependencies can be skipped.

Proposed Deliverables

Acceptance Criteria

  • A contributor can run one documented command that compares rollout-engine logprobs against a training/reference engine on fixed prompts, model, dtype, and seed.
  • The tool can run a fast deterministic smoke benchmark without requiring a full distributed cluster.
  • The output identifies the worst drift by sequence id, token position, target token id, prompt/completion region, engine pair, dtype, and config.
  • The benchmark records enough metadata to reproduce the run, including model revision/path, tokenizer, engine versions, seed, batch/cache/layout settings, and tolerance policy.
  • The comparator supports explicit pass/fail thresholds for CI or nightly regression lanes.
  • At least one test fixture intentionally injects a mismatch in token shift, mask, or dtype path and proves the report points to the offending token/sequence.
  • The design leaves room for vLLM, sglang, Megatron, DeepSpeed, FSDP, and torch/HF reference adapters without making all of them mandatory for the first PR.
  • The issue links cleanly into [Roadmap] RL-Kernel Roadmap Q3 - Q4 #83 and [RFC] Batch-Invariant RL Kernel Suite for Train-Inference Consistency #101 as the concrete P0.3 cross-engine benchmark tool.

Metadata

Metadata

Labels

component: alignmentTasks involving RL loss functions such as DPO and GRPO, and mathematical alignment logiccomponent: executorsTasks involving the interaction of vLLM inference and DeepSpeed ​​training endpoints.component: testingAdd test cases and benchmark-related tasksfeaturepriority: highSevere congestion issues require the highest priority for resolution.type: designIssues requiring in-depth discussion of architecture design

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions