Fix DPO completion log-prob loss for unequal target lengths by TengJiao33 · Pull Request #708 · zjunlp/EasyEdit

TengJiao33 · 2026-07-04T11:23:15Z

Summary

compute DPO preference scores from chosen/rejected completion sequence log-probabilities instead of full-vocab log-prob tensors
mask prompt and padding tokens so the DPO objective only compares completion tokens
support chosen and rejected completions with different token lengths by reducing each example to a scalar sequence log-probability
restore adapter/training state with a finally block after reference-model scoring
keep public APIs, hparams, and editor call patterns unchanged

python -m py_compile easyeditor/models/dpo/dpo_main.py
temporary regression checks for completion-only masking, shifted token log-prob gathering, and unequal chosen/rejected completion lengths
smoke edit run confirmed the previous shape mismatch no longer occurs when chosen and rejected targets have different lengths

Fix DPO completion log-prob loss

c0fe2d5