Offical code for the paper FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings (ACL 2025 long paper).
Chen et al., 2024 empirically finds that DPO training rarely improves these misranked preference, despite its gradient emphasizing on these cases. We add a simple factor to DPO loss to make DPO focus on "more correct" (see gradient curve) samples. With the introduced hyperparameter fixed (we do not want to over-rely on hyperparameter tuning), it consistently outperforms DPO on Arena-hard and Alpaca Eval.
We release the following model that are built on top of Mistral-Base SFT (7B) model by training FocalPO on UltraFeedback dataset.
| models | Alpaca Eval 2.0 LC | AH WR |
|---|---|---|
| tongliuphysics/Mistral-7B-Base-SFT-FocalPO | 23.9 | 17.1 |
We release the following model that are built on top of Llama-3-Instruct (8B) model by training FocalPO on the on-policy Llama3-ultrafeedbackarmorm dataset.
| models | Alpaca Eval 2.0 LC | AH WR |
|---|---|---|
| tongliuphysics/Llama-3-8B-Instruct-FocalPO | 54.7 | 34.6 |
@article{liu2025focalpo,
title={FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings},
author={Liu, Tong and Yu, Xiao and Zhou, Wenxuan and Gu, Jindong and Tresp, Volker},
journal={arXiv preprint arXiv:2501.06645},
year={2025}
}