Name	Name	Last commit message	Last commit date
parent directory ..
ARC	ARC
C-Eval	C-Eval
CMMLU	CMMLU
MMLU	MMLU
PIQA	PIQA
bbh	bbh
lm-evaluation-harness	lm-evaluation-harness
README.md	README.md
requirements.txt	requirements.txt

Name

Last commit message

Last commit date

C-Eval

lm-evaluation-harness

README.md

requirements.txt

Evaluation

Overview

We provide test code and scripts for different evaluation tasks to facilitate the reproduction of results, including implementations for specific datasets and adaptation based on the lm-evaluation-harness framework. Evaluation tasks included:

World knowledge and math knowledge tasks: MMLU,BBH,GSM8K,MATH
Chinese tasks: C-EVAL,CMMLU
Commonsense & reading comprehension tasks: SciQ,PIQA,ARC-E,ARC-C,LogiQA,BoolQ

Usage

Implementations for specific datasets

prepare environment:

pip install -r requirements.txt

Evaluation single dataset: ARC for example

cd ARC
bash eval.sh

lm-evaluation-harness

Install

Install the lm-evaluation-harness package that is adapted for the OpenBA evaluation.:

pip install transformers==4.31.0
cd lm-evaluation-harness
pip install -e .

Evaluation all datasets：

 bash script/eval_harness.sh

Evaluation single dataset such as 'sciq':

lm_eval --model hf \
	--model_args pretrained=<model_name>,backend=seq2seq,trust_remote_code=True,max_length=1040 \
	--tasks   'sciq' \
	--batch_size 8 \
	--use_cache <cache_path> \
	--write_out --output_path <write_out_path> \
	--log_samples \

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

Evaluation

Overview

Usage

Implementations for specific datasets

lm-evaluation-harness

Install

Uh oh!

FilesExpand file tree

evaluation

Directory actions

More options

Directory actions

More options

Latest commit

History

evaluation

Folders and files

parent directory

README.md

Evaluation

Overview

Usage

Implementations for specific datasets

lm-evaluation-harness

Install