Official repository for "Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness" (Accepted to ACL 2026 Main Conference 🎉).
- [2026.04.06] 🎉 Our paper has been accepted to ACL 2026 Main Conference!
- [2026.03.17] 💻 We have officially released the training and evaluation code!
- [2026.03.17] 🔥 Our paper is now available on
- [2026.06.10] Release SDiaReward-3B and 7B model weights on HuggingFace.
- [2026.06.10] Release ESDR-Bench and SDiaReward-Dataset.
SDiaReward is an end-to-end multi-turn reward model for evaluating spoken dialogue quality. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality-awareness (prosody, emotion, acoustic naturalness) and colloquialness (conversational spontaneity vs. scripted style) in a single evaluator.
Built on Qwen2.5-Omni, SDiaReward extends the multimodal LLM backbone with a pooling layer and linear score head for scalar reward prediction. It is trained on SDiaReward-Dataset, a collection of ~13k episode-level preference pairs (~200 hours of paired speech), and evaluated on ESDR-Bench, a stratified benchmark for robust episode-level evaluation.
SDiaReward achieves state-of-the-art pairwise preference accuracy on ESDR-Bench, significantly outperforming general-purpose audio LLMs:
| Model | Modality Micro | Modality Macro | Colloquialness | Overall Micro | Overall Macro |
|---|---|---|---|---|---|
| GPT-4o Audio | 51.12 | 50.47 | 98.00 | 57.91 | 74.23 |
| Gemini 2.5 Pro | 72.63 | 70.50 | 98.80 | 76.42 | 84.65 |
| Qwen 2.5 Omni 7B | 51.85 | 51.82 | 49.20 | 51.47 | 50.51 |
| Kimi-Audio | 65.30 | 63.38 | 66.00 | 65.40 | 64.69 |
| SDiaReward 3B | 88.62 | 79.20 | 92.00 | 89.11 | 85.60 |
| SDiaReward 7B | 96.61 | 94.91 | 97.20 | 96.70 | 96.06 |
SDiaReward processes interleaved speech-text multi-turn dialogues through a multimodal LLM backbone and computes scalar rewards via:
where H is the hidden representation from the final transformer layer, and Pool(·) is a sequence-level pooling operator. The model supports three pooling strategies:
- Mean Pooling (default, best stability and accuracy)
- Attention Pooling (learnable, higher variance)
- Last-Token Pooling
The model is trained with a Bradley-Terry pairwise preference loss with center loss regularization to prevent reward score drift.
pip install -r requirements.txtCore dependencies: PyTorch >= 2.4.0, Transformers >= 4.46.0, DeepSpeed >= 0.14.0, TRL >= 0.21.0, qwen_omni_utils >= 0.0.8.
Score a single conversation:
python inference.py \
--ckpt_dir <path_to_reward_model_checkpoint> \
--base_ckpt <path_to_base_qwen_omni_model> \
--conversation_json <path_to_conversation.json>The conversation JSON should contain a list of message dicts with interleaved text and audio:
[
{"role": "user", "content": [
{"type": "text", "text": "Hello, how are you?"},
{"type": "audio", "audio": "path/to/user_audio.wav"}
]},
{"role": "assistant", "content": [
{"type": "text", "text": "I'm doing great!"},
{"type": "audio", "audio": "path/to/assistant_audio.wav"}
]}
]3B model:
bash scripts/train_3b.sh7B model:
bash scripts/train_7b.shBefore running, edit the script to set:
MODEL_NAME_OR_PATH: path to base Qwen2.5-Omni checkpointDATASET_NAME: path to the preference datasetCUDA_VISIBLE_DEVICES/NUM_GPUS: adjust for your hardware
python eval_model.py \
--ckpt_dir <path_to_checkpoint> \
--base_ckpt <path_to_base_qwen_omni_model> \
--dataset_path <path_to_eval_dataset> \
--output_dir eval_outputs/SDiaReward/
├── model/
│ ├── modeling_qwen_omni_thinker_reward.py # Reward model (pooling + score head)
│ └── processing_qwen_omni_thinker_reward.py # Multimodal processor
├── trainer/
│ ├── multimodal_reward_trainer.py # Custom reward trainer
│ └── collator.py # Data collator & dataset wrapper
├── utils/
│ ├── fast_whisper_feature_extractor.py # Optimized audio feature extractor
│ └── load_utils.py # Dataset loading utilities
├── deepspeed_configs/ # ZeRO-2/3/3-offload configs
├── scripts/
│ ├── train_3b.sh # 3B training launch script
│ └── train_7b.sh # 7B training launch script
├── train.py # Training entry point
├── eval_model.py # Evaluation script
├── inference.py # Inference example
└── requirements.txt
The training data should be a HuggingFace Dataset (or JSON) with chosen and rejected fields, each containing a multi-turn conversation in the chat format. Each turn can include text, audio, image, or video content.
If you find this project useful for your research, please consider citing our paper:
@misc{lu2026modelingbenchmarkingspokendialogue,
title={Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness},
author={Jingyu Lu and Yuhan Wang and Fan Zhuo and Xize Cheng and Changhao Pan and Xueyi Pu and Yifu Chen and Chenyuhao Wen and Tianle Liang and Zhou Zhao},
year={2026},
eprint={2603.14889},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.14889},
}This project is intended for research purposes only. The code is licensed under the Apache 2.0 License.
Due to copyright and privacy considerations, the public repository only contains derived artifacts for the SDiaReward-Dataset and ESDR-Bench. To request access to the original audio data, please contact us via email at lujingyu@zju.edu.cn.
