Skip to content

Fix DPO completion log-prob loss for unequal target lengths#708

Open
TengJiao33 wants to merge 1 commit into
zjunlp:mainfrom
TengJiao33:codex/dpo-sequence-logprob-fix
Open

Fix DPO completion log-prob loss for unequal target lengths#708
TengJiao33 wants to merge 1 commit into
zjunlp:mainfrom
TengJiao33:codex/dpo-sequence-logprob-fix

Conversation

@TengJiao33

Copy link
Copy Markdown
Contributor

Summary

  • compute DPO preference scores from chosen/rejected completion sequence log-probabilities instead of full-vocab log-prob tensors
  • mask prompt and padding tokens so the DPO objective only compares completion tokens
  • support chosen and rejected completions with different token lengths by reducing each example to a scalar sequence log-probability
  • restore adapter/training state with a finally block after reference-model scoring
  • keep public APIs, hparams, and editor call patterns unchanged

Validation

  • python -m py_compile easyeditor/models/dpo/dpo_main.py
  • temporary regression checks for completion-only masking, shifted token log-prob gathering, and unequal chosen/rejected completion lengths
  • smoke edit run confirmed the previous shape mismatch no longer occurs when chosen and rejected targets have different lengths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant