You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue scopes the P0.3 roadmap item for tensor-parallel-invariant reductions across rollout and training engines.
Target identity:
same model + same sequence + same policy state
=> selected logprobs / loss reductions are invariant to tensor-parallel degree
This is a focused follow-up to the train-inference consistency roadmap (#83) and the batch-invariant consistency RFC (#101). #101 covers batch/cache/layout invariance broadly; this issue covers the distributed reduction contract needed when training uses FSDP with TP=1 but rollout or scoring uses TP>1.
Concrete Example
Consider one token position with vocab size 6 and target token id 4.
That kind of drift is not a harmless implementation detail: GRPO/PPO consume logprob deltas through ratios and KL terms, so a TP-dependent selected-logprob can change the policy update even when the model weights and sampled sequence are identical.
Problem
Rollout/scoring stacks often run TP>1, while the training/reference scorer may run FSDP with TP=1. The same logical operation can then travel through different reduction paths:
full-vocab selected logprob in FSDP(TP=1) vs vocab-sharded selected logprob in TP>1;
lm_head output projection with full logits vs sharded vocab logits;
masked token/sentence loss denominators under different shard or micro-batch layouts;
group reward/advantage statistics when group members are partitioned differently;
dtype and accumulation differences such as fp16/bf16 inputs with fp32 reduction state;
reduction topology differences across CUDA / ROCm / Triton / torch.distributed paths.
Without an explicit TP-invariant reduction contract, rollout-vs-training parity tests can fail even when weight sync, tokenization, causal shifting, and masks are correct.
Scope
Define and test canonical TP-invariant reductions for:
selected-token logprob / cross entropy over vocab-sharded logits;
masked token reductions used by logprob summaries and loss denominators;
GRPO/PPO-style policy-ratio, KL, and loss aggregation inputs where TP or micro-batch partitioning can alter denominator semantics;
optional simulated TP partitions on CPU/CUDA so the contract can be tested without launching a full distributed engine;
real distributed TP smoke tests when multi-GPU CI or nightly hardware is available.
The contract should state:
reduction order/topology where bitwise identity is expected;
dtype policy for local partials, global reductions, and final casts;
mask and padding denominator semantics;
behavior for uneven vocab shards and target tokens owned by different ranks;
tolerance policy when bitwise identity is not realistic across backends.
Non-Goals
Do not implement a full vLLM, Megatron, DeepSpeed, or FSDP integration in this issue.
Do not require bitwise equality for every hardware/backend combination unless the RFC explicitly proves it is realistic.
Proposed Deliverables
A short RFC/design note describing TP-invariant reduction semantics.
A reference implementation or test helper that computes selected logprob from full logits and from simulated TP shards using the same contract.
A parity test matrix for TP=1 vs TP=2/4 simulated shards, covering target-token ownership on each rank, padding masks, variable completion lengths, and uneven vocab shard tails.
A drift report format with max abs/relative error, token position, target id, owner rank, TP size, dtype, backend, and reduction name.
A path to plug the same checks into the end-to-end rollout/training logprob cross-benchmark.
Acceptance Criteria
TP=1 full-vocab reference and simulated TP>1 sharded reference produce matching selected logprobs within the declared tolerance.
Tests include at least TP=2 and TP=4, with target tokens on every shard.
Tests include prompt/completion/padding masks and verify denominator semantics do not change with TP degree.
Tests include uneven vocab shard sizes.
The design explicitly names the dtype/accumulation policy, including fp16/bf16 input and fp32 reduction state if chosen.
Failure output is actionable enough to identify whether drift came from vocab logsumexp, selected-token ownership, mask denominator, group/loss aggregation, or backend dtype behavior.
This issue scopes the P0.3 roadmap item for tensor-parallel-invariant reductions across rollout and training engines.
Target identity:
This is a focused follow-up to the train-inference consistency roadmap (#83) and the batch-invariant consistency RFC (#101). #101 covers batch/cache/layout invariance broadly; this issue covers the distributed reduction contract needed when training uses FSDP with TP=1 but rollout or scoring uses TP>1.
Concrete Example
Consider one token position with vocab size 6 and target token id 4.
Full-vocab TP=1 logits:
FSDP(TP=1) computes:
With TP=2, vocab is sharded:
The TP-invariant reduction is:
Bad reductions drift immediately:
That kind of drift is not a harmless implementation detail: GRPO/PPO consume logprob deltas through ratios and KL terms, so a TP-dependent selected-logprob can change the policy update even when the model weights and sampled sequence are identical.
Problem
Rollout/scoring stacks often run TP>1, while the training/reference scorer may run FSDP with TP=1. The same logical operation can then travel through different reduction paths:
lm_headoutput projection with full logits vs sharded vocab logits;Without an explicit TP-invariant reduction contract, rollout-vs-training parity tests can fail even when weight sync, tokenization, causal shifting, and masks are correct.
Scope
Define and test canonical TP-invariant reductions for:
The contract should state:
Non-Goals
Proposed Deliverables
Acceptance Criteria