EMBER README

This is the repository documenting the paper Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation (NAACL 2025-Oral) by Dongryeol Lee^*, Yerin Hwang^*, Yongil Kim, Joonsuk Park, and Kyomin Jung.

Read the paper
Download the dataset: (https://huggingface.co/datasets/Dongryeol/EMBER)

Content

Citation
Dataset Contents
- EMBER_IF Format
- EMBER_QA Format
Evaluation Codes

Citation

If you find our task or EMBER useful, please cite our paper:

@misc{lee2024llmjudgesrobustexpressionsuncertainty,
      title={Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation}, 
      author={Dongryeol Lee and Yerin Hwang and Yongil Kim and Joonsuk Park and Kyomin Jung},
      year={2024},
      eprint={2410.20774},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.20774}, 
}

Please also make sure to credit and cite the creators of EVOUNA and MIXINSTRUCT, the dataset which we built ours off of:

@article{wang2023evaluating,
  title={Evaluating open-qa evaluation},
  author={Wang, Cunxiang and Cheng, Sirui and Guo, Qipeng and Yue, Yuanhao and Ding, Bowen and Xu, Zhikun and Wang, Yidong and Hu, Xiangkun and Zhang, Zheng and Zhang, Yue},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={77013--77042},
  year={2023}
}

@inproceedings{jiang2023llm,
  title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
  author={Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={14165--14178},
  year={2023}
}

Dataset Contents

We provide our new dataset EMBER:

ember_if.json (1.42M)
ember_qa_gpt4.json (0.87M)
ember_qa_newbing.json (1.6M)

EMBER_IF Format

This file contains a list of dictionary that represents a single data point, with the following keys

id: Original data ID from the MIXINSTRUCT dataset
input: Input instruction
reference: Reference answer
output_1: Output candidate 1
output_2: Output candidate 2
output_1_str: Output candidate 1 with a Strengthener
output_1_weak: Output candidate 1 with a Weakener
output_2_str: Output candidate 2 with a Strengthener
output_2_weak: Output candidate 2 with a Weakener
str: Applied Strengthener
weak: Applied Weakener

EMBER_QA Format

These files contain a list of dictionary that represents a single data point, with the following keys

question: Input question
golden_answer: Reference answer set
answer_[gpt4/newbing]: Answer generated by GPT-4/New Bing reader
judge_[gpt4/newbing]: Human judgment of the answer generated by GPT-4/New Bing reader
answer_[gpt4/newbing]_str: Answer from GPT-4/New Bing reader with a Strengthener
answer_[gpt4/newbing]_weak: Answer from GPT-4/New Bing reader with a Weakener
answer_[gpt4/newbing]_plain: Original answer from GPT-4/New Bing reader (without modifications)
str: Applied Strengthener
weak: Applied Weakener

Create conda environment and install requirements

conda create -n ember && conda activate ember
pip install -r requirements.txt

Evaluation Codes

Simply change parameters in "run_ifeval.sh" file and run

bash run_ifeval.sh

to get the main evaluation result on Instructions Following task

Also, simply change parameters in "run_qaeval.sh" file and run

bash run_qaeval.sh

to get the main evaluation result on Question Answering task

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
datasets		datasets
image		image
README.md		README.md
__init__.py		__init__.py
gen_util.py		gen_util.py
if_util.py		if_util.py
qa_util.py		qa_util.py
requirements.txt		requirements.txt
run_ifeval.py		run_ifeval.py
run_ifeval.sh		run_ifeval.sh
run_qaeval.py		run_qaeval.py
run_qaeval.sh		run_qaeval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMBER README

Content

Citation

Dataset Contents

EMBER_IF Format

EMBER_QA Format

Create conda environment and install requirements

Evaluation Codes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EMBER README

Content

Citation

Dataset Contents

EMBER_IF Format

EMBER_QA Format

Create conda environment and install requirements

Evaluation Codes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages