Skip to content

DongryeolLee96/EMBER

Repository files navigation

EMBER README

This is the repository documenting the paper Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation (NAACL 2025-Oral) by Dongryeol Lee* , Yerin Hwang *, Yongil Kim, Joonsuk Park, and Kyomin Jung.

Main Figure

Content

  1. Citation
  2. Dataset Contents
  3. Evaluation Codes

Citation

If you find our task or EMBER useful, please cite our paper:

@misc{lee2024llmjudgesrobustexpressionsuncertainty,
      title={Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation}, 
      author={Dongryeol Lee and Yerin Hwang and Yongil Kim and Joonsuk Park and Kyomin Jung},
      year={2024},
      eprint={2410.20774},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.20774}, 
}

Please also make sure to credit and cite the creators of EVOUNA and MIXINSTRUCT, the dataset which we built ours off of:

@article{wang2023evaluating,
  title={Evaluating open-qa evaluation},
  author={Wang, Cunxiang and Cheng, Sirui and Guo, Qipeng and Yue, Yuanhao and Ding, Bowen and Xu, Zhikun and Wang, Yidong and Hu, Xiangkun and Zhang, Zheng and Zhang, Yue},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={77013--77042},
  year={2023}
}

@inproceedings{jiang2023llm,
  title={LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion},
  author={Jiang, Dongfu and Ren, Xiang and Lin, Bill Yuchen},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={14165--14178},
  year={2023}
}

Dataset Contents

We provide our new dataset EMBER:

  • ember_if.json (1.42M)
  • ember_qa_gpt4.json (0.87M)
  • ember_qa_newbing.json (1.6M)

EMBER_IF Format

This file contains a list of dictionary that represents a single data point, with the following keys

  • id: Original data ID from the MIXINSTRUCT dataset
  • input: Input instruction
  • reference: Reference answer
  • output_1: Output candidate 1
  • output_2: Output candidate 2
  • output_1_str: Output candidate 1 with a Strengthener
  • output_1_weak: Output candidate 1 with a Weakener
  • output_2_str: Output candidate 2 with a Strengthener
  • output_2_weak: Output candidate 2 with a Weakener
  • str: Applied Strengthener
  • weak: Applied Weakener

EMBER_QA Format

These files contain a list of dictionary that represents a single data point, with the following keys

  • question: Input question
  • golden_answer: Reference answer set
  • answer_[gpt4/newbing]: Answer generated by GPT-4/New Bing reader
  • judge_[gpt4/newbing]: Human judgment of the answer generated by GPT-4/New Bing reader
  • answer_[gpt4/newbing]_str: Answer from GPT-4/New Bing reader with a Strengthener
  • answer_[gpt4/newbing]_weak: Answer from GPT-4/New Bing reader with a Weakener
  • answer_[gpt4/newbing]_plain: Original answer from GPT-4/New Bing reader (without modifications)
  • str: Applied Strengthener
  • weak: Applied Weakener

Create conda environment and install requirements

conda create -n ember && conda activate ember
pip install -r requirements.txt

Evaluation Codes

Simply change parameters in "run_ifeval.sh" file and run

bash run_ifeval.sh

to get the main evaluation result on Instructions Following task Main Result IF

Also, simply change parameters in "run_qaeval.sh" file and run

bash run_qaeval.sh

to get the main evaluation result on Question Answering task

Main Result QA

About

The official implementation of NAACL 2025, "Are LLM-Judges Robust to Expressions of Uncertainty? Investigating the effect of Epistemic Markers on LLM-based Evaluation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors