This is the repository documenting the paper Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation (NAACL 2025-Oral) by Dongryeol Lee* , Yerin Hwang *, Yongil Kim, Joonsuk Park, and Kyomin Jung.
- Read the paper
- Download the dataset: (https://huggingface.co/datasets/Dongryeol/EMBER)
If you find our task or EMBER useful, please cite our paper:
@misc{lee2024llmjudgesrobustexpressionsuncertainty,
title={Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation},
author={Dongryeol Lee and Yerin Hwang and Yongil Kim and Joonsuk Park and Kyomin Jung},
year={2024},
eprint={2410.20774},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.20774},
}
Please also make sure to credit and cite the creators of EVOUNA and MIXINSTRUCT, the dataset which we built ours off of:
@article{wang2023evaluating,
title={Evaluating open-qa evaluation},
author={Wang, Cunxiang and Cheng, Sirui and Guo, Qipeng and Yue, Yuanhao and Ding, Bowen and Xu, Zhikun and Wang, Yidong and Hu, Xiangkun and Zhang, Zheng and Zhang, Yue},
journal={Advances in Neural Information Processing Systems},
volume={36},
pages={77013--77042},
year={2023}
}
@inproceedings{jiang2023llm,
title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
author={Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen},
booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={14165--14178},
year={2023}
}
We provide our new dataset EMBER:
- ember_if.json (1.42M)
- ember_qa_gpt4.json (0.87M)
- ember_qa_newbing.json (1.6M)
This file contains a list of dictionary that represents a single data point, with the following keys
- id: Original data ID from the MIXINSTRUCT dataset
- input: Input instruction
- reference: Reference answer
- output_1: Output candidate 1
- output_2: Output candidate 2
- output_1_str: Output candidate 1 with a Strengthener
- output_1_weak: Output candidate 1 with a Weakener
- output_2_str: Output candidate 2 with a Strengthener
- output_2_weak: Output candidate 2 with a Weakener
- str: Applied Strengthener
- weak: Applied Weakener
These files contain a list of dictionary that represents a single data point, with the following keys
- question: Input question
- golden_answer: Reference answer set
- answer_[gpt4/newbing]: Answer generated by GPT-4/New Bing reader
- judge_[gpt4/newbing]: Human judgment of the answer generated by GPT-4/New Bing reader
- answer_[gpt4/newbing]_str: Answer from GPT-4/New Bing reader with a Strengthener
- answer_[gpt4/newbing]_weak: Answer from GPT-4/New Bing reader with a Weakener
- answer_[gpt4/newbing]_plain: Original answer from GPT-4/New Bing reader (without modifications)
- str: Applied Strengthener
- weak: Applied Weakener
conda create -n ember && conda activate ember
pip install -r requirements.txt
Simply change parameters in "run_ifeval.sh" file and run
bash run_ifeval.sh
to get the main evaluation result on Instructions Following task

Also, simply change parameters in "run_qaeval.sh" file and run
bash run_qaeval.sh
to get the main evaluation result on Question Answering task

