diff --git a/.trae/documents/improve_custom_config_docs.md b/.trae/documents/improve_custom_config_docs.md new file mode 100644 index 00000000..fe818116 --- /dev/null +++ b/.trae/documents/improve_custom_config_docs.md @@ -0,0 +1,222 @@ +# 计划:提升自定义配置文件方式的文档可见性与完善度 + +## 目标 + +让用户充分感知并理解"自定义配置文件"这一强大能力,它本质上可以覆盖 AISBench 所有主要评测场景,并且支持完整的 Python 语法灵活性。 + +--- + +## 一、完善 `run_custom_config.md` 文档 + +### 1.1 新增"为什么使用自定义配置文件"章节 +- 对比 CLI 参数方式(`--models` + `--datasets`)与自定义配置文件方式的优劣 +- 强调自定义配置文件的优势: + - 可复用,一次编写多次执行 + - 支持 Python 全部语法(循环、条件、函数、列表推导等) + - 精确控制模型-数据集组合(`model_dataset_combinations`) + - 可在一个文件中组合任意数量的模型和数据集 + - 支持自定义 `infer`/`eval` 分区器和运行器配置 + - 方便版本管理和团队共享 + +### 1.2 新增"配置文件即 Python 脚本"章节 +- 明确说明:配置文件本质上是 Python 脚本,支持所有 Python 语法 +- 提供实用示例: + - 使用 for 循环批量生成模型配置(如不同端口、不同 IP) + - 使用列表推导式批量修改数据集 abbr + - 使用条件判断动态选择配置 + - 使用 `.copy()` 复用配置并修改特定字段 + - 从外部文件读取配置参数 + +### 1.3 新增"配置文件完整变量参考"章节 +- `models`:模型配置列表,每个元素为 dict +- `datasets`:数据集配置列表,每个元素为 dict +- `summarizer`:汇总器配置 dict +- `model_dataset_combinations`:可选,精确控制模型-数据集配对 +- `work_dir`:输出目录 +- `infer`:可选,推理阶段的分区器和运行器配置 +- `eval`:可选,评估阶段的分区器和运行器配置 +- 说明每种变量的类型、必填字段、可选字段 + +### 1.4 新增"各场景自定义配置文件示例"章节 +为每个主要场景提供完整的自定义配置文件示例: + +| 场景 | 说明 | +|------|------| +| 服务化精度测评 | API 模型 + 开源数据集精度测评 | +| 纯模型精度测评 | HuggingFace 本地模型精度测评 | +| 服务化性能测评 | API 流式模型 + 数据集性能测评 | +| 合成数据集性能测评 | API 流式模型 + synthetic 数据集性能测评 | +| 多模型多数据集组合 | 多个模型 × 多个数据集的笛卡尔积测评 | +| 自定义模型-数据集配对 | 使用 `model_dataset_combinations` 精确配对 | +| 裁判模型测评 | 被测模型 + 裁判模型的精度测评 | +| 稳态性能测评 | 使用 `stable_stage` summarizer 的性能测评 | +| 多轮对话性能测评 | ShareGPT/MTBench 多轮对话性能测评 | +| 自定义数据集测评 | 使用自定义 CSV/JSONL 数据集的测评 | +| 多模态测评 | 多模态数据集 + 多模态模型测评 | + +### 1.5 完善现有内容 +- 更新"预设自定义配置文件样例列表"表格,补充新增的样例文件 +- 修正文档中可能存在的过时或不准确描述 + +--- + +## 二、在整个文档体系中接入自定义配置文件引用 + +### 2.1 修改 `index.rst`(文档首页) +- 在"推荐上手路径"中,将自定义配置文件的推荐提前或加重 +- 在 toctree 中可以考虑将 `run_custom_config` 提升到更显眼的位置,或添加一个独立的引导章节 + +### 2.2 修改 `get_started/quick_start.md`(快速入门) +- 在执行命令章节末尾添加"💡 进阶提示",引导用户了解自定义配置文件方式 +- 说明:当需要重复执行或复杂组合时,推荐使用自定义配置文件 + +### 2.3 修改各场景文档,添加"通过自定义配置文件实现"小节 + +在以下文档的适当位置(建议在"主要功能场景"之后或每个功能场景末尾)添加引用: + +| 文档 | 添加位置 | 引用内容 | +|------|---------|---------| +| `base_tutorials/scenes_intro/accuracy_benchmark.md` | 多任务测评章节后 | 说明如何用自定义配置文件实现多任务精度测评 | +| `base_tutorials/scenes_intro/accuracy_benchmark_local.md` | 主要功能章节后 | 说明如何用自定义配置文件实现纯模型精度测评 | +| `base_tutorials/scenes_intro/performance_benchmark.md` | 多任务测评章节后 | 说明如何用自定义配置文件实现多任务性能测评 | +| `advanced_tutorials/multiturn_benchmark.md` | 快速入门章节后 | 说明如何用自定义配置文件实现多轮对话测评 | +| `advanced_tutorials/synthetic_dataset.md` | 使用指导章节后 | 说明如何用自定义配置文件实现合成数据集测评 | +| `advanced_tutorials/custom_dataset.md` | 配置文件方式章节后 | 说明如何用自定义配置文件实现自定义数据集测评 | +| `advanced_tutorials/judge_model_evaluate.md` | 快速上手章节后 | 说明如何用自定义配置文件实现裁判模型测评 | +| `advanced_tutorials/stable_stage.md` | 快速入门章节后 | 说明如何用自定义配置文件实现稳态性能测评 | +| `advanced_tutorials/rps_distribution.md` | 文件配置章节后 | 说明 RPS 分布控制参数在自定义配置文件中同样适用 | +| `extended_benchmark/agent/swe_bench.md` | 适当位置 | 说明已有预设配置文件样例 | +| `extended_benchmark/agent/tau2_bench.md` | 适当位置 | 说明已有预设配置文件样例 | +| `extended_benchmark/agent/harbor_bench.md` | 适当位置 | 说明已有预设配置文件样例 | +| `extended_benchmark/lmm_generate/vbench.md` | 适当位置 | 说明已有预设配置文件样例 | +| `extended_benchmark/lmm_generate/gedit_bench.md` | 适当位置 | 说明已有预设配置文件样例 | +| `best_practices/practice_nvidia.md` | 适当位置 | 说明推荐使用自定义配置文件 | +| `best_practices/practice_ascend.md` | 适当位置 | 说明推荐使用自定义配置文件 | + +每个引用小节的统一格式: +```markdown +### 通过自定义配置文件实现 + +> 💡 上述场景同样可以通过 [自定义配置文件](../advanced_tutorials/run_custom_config.md) 实现,将模型、数据集、summarizer 等配置写入一个 Python 文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](../advanced_tutorials/run_custom_config.md#各场景自定义配置文件示例) 中对应场景的示例。 +``` + +--- + +## 三、强调配置文件支持 Python 全部语法 + +### 3.1 在 `run_custom_config.md` 中重点强调 +- 新增独立章节"配置文件即 Python 脚本"(如 1.2 所述) +- 在文档开头添加醒目的提示框 + +### 3.2 在 `index.rst` 中提及 +- 在推荐上手路径中增加一句描述,强调配置文件的灵活性 + +### 3.3 在各场景文档的引用小节中提及 +- 在"通过自定义配置文件实现"小节中简要提及 Python 语法的灵活性 + +--- + +## 四、在代码仓库中预置不同场景的自定义配置文件样例 + +在 `ais_bench/configs/` 下新增/完善以下样例文件: + +### 4.1 精度测评场景(`ais_bench/configs/api_examples/`) + +| 文件名 | 说明 | +|--------|------| +| `demo_infer_vllm_api.py` | ✅ 已有,需检查完善 | +| `infer_vllm_api_general.py` | ✅ 已有 | +| `infer_vllm_api_general_chat.py` | ✅ 已有 | +| `infer_vllm_api_stream_chat.py` | ✅ 已有 | +| `infer_mindie_stream_api_general.py` | ✅ 已有 | +| `infer_vllm_api_multi_model_multi_dataset.py` | 🆕 多模型多数据集精度测评示例 | +| `infer_vllm_api_with_model_dataset_combinations.py` | 🆕 自定义模型-数据集配对示例 | +| `infer_vllm_api_with_judge_model.py` | 🆕 裁判模型测评示例 | + +### 4.2 性能测评场景(`ais_bench/configs/api_examples/`) + +| 文件名 | 说明 | +|--------|------| +| `demo_infer_vllm_api_perf.py` | ✅ 已有,需检查完善 | +| `perf_vllm_api_synthetic.py` | 🆕 合成数据集性能测评示例 | +| `perf_vllm_api_stable_stage.py` | 🆕 稳态性能测评示例 | +| `perf_vllm_api_multiturn.py` | 🆕 多轮对话性能测评示例 | +| `perf_vllm_api_custom_dataset.py` | 🆕 自定义数据集性能测评示例 | +| `perf_vllm_api_rps_distribution.py` | 🆕 RPS 分布控制性能测评示例 | + +### 4.3 纯模型测评场景(`ais_bench/configs/hf_example/`) + +| 文件名 | 说明 | +|--------|------| +| `infer_hf_chat_model.py` | ✅ 已有 | +| `infer_hf_base_model.py` | ✅ 已有 | +| `infer_hf_multi_model_multi_dataset.py` | 🆕 多模型多数据集纯模型测评示例 | + +### 4.4 多模态场景(`ais_bench/configs/lmm_example/`) + +| 文件名 | 说明 | +|--------|------| +| `multi_device_run_qwen_image_edit.py` | ✅ 已有 | +| `infer_lmm_multi_dataset.py` | 🆕 多模态多数据集测评示例 | + +### 4.5 Agent 场景(已有,需检查) + +| 文件名 | 说明 | +|--------|------| +| `agent_example/harbor_terminal_bench_2_task.py` | ✅ 已有 | +| `agent_example/tau2_bench_task.py` | ✅ 已有 | +| `swe_bench_examples/*.py` | ✅ 已有 | +| `swe_bench_pro_examples/*.py` | ✅ 已有 | + +### 4.6 VBench 场景(已有,需检查) + +| 文件名 | 说明 | +|--------|------| +| `vbench_examples/eval_vbench_standard.py` | ✅ 已有 | +| `vbench_examples/eval_vbench_custom.py` | ✅ 已有 | + +### 4.7 更新 `all_dataset_configs.py` +- 确保所有常用数据集的导入都已包含 +- 添加注释说明每个导入对应的数据集 + +--- + +## 五、实施步骤 + +### 步骤 1:完善 `run_custom_config.md` +- 新增"为什么使用自定义配置文件"章节 +- 新增"配置文件即 Python 脚本"章节(含 Python 语法示例) +- 新增"配置文件完整变量参考"章节 +- 新增"各场景自定义配置文件示例"章节(含所有主要场景的完整示例代码) +- 更新"预设自定义配置文件样例列表" + +### 步骤 2:创建新的预设配置文件样例 +- 在 `ais_bench/configs/api_examples/` 下创建 4.1 和 4.2 列出的新文件 +- 在 `ais_bench/configs/hf_example/` 下创建 4.3 列出的新文件 +- 在 `ais_bench/configs/lmm_example/` 下创建 4.4 列出的新文件 +- 更新 `all_dataset_configs.py` + +### 步骤 3:在文档体系中添加交叉引用 +- 修改 `index.rst` +- 修改 `get_started/quick_start.md` +- 修改 `base_tutorials/scenes_intro/accuracy_benchmark.md` +- 修改 `base_tutorials/scenes_intro/accuracy_benchmark_local.md` +- 修改 `base_tutorials/scenes_intro/performance_benchmark.md` +- 修改 `advanced_tutorials/multiturn_benchmark.md` +- 修改 `advanced_tutorials/synthetic_dataset.md` +- 修改 `advanced_tutorials/custom_dataset.md` +- 修改 `advanced_tutorials/judge_model_evaluate.md` +- 修改 `advanced_tutorials/stable_stage.md` +- 修改 `advanced_tutorials/rps_distribution.md` +- 修改 `extended_benchmark/agent/swe_bench.md` +- 修改 `extended_benchmark/agent/tau2_bench.md` +- 修改 `extended_benchmark/agent/harbor_bench.md` +- 修改 `extended_benchmark/lmm_generate/vbench.md` +- 修改 `extended_benchmark/lmm_generate/gedit_bench.md` +- 修改 `best_practices/practice_nvidia.md` +- 修改 `best_practices/practice_ascend.md` + +### 步骤 4:验证 +- 检查所有新增/修改的 Markdown 文件中的内部链接是否正确 +- 确保新增的配置文件样例语法正确、可执行 +- 确保文档中的代码示例与实际代码逻辑一致 diff --git a/README.md b/README.md index 8c8f4ab9..87645758 100644 --- a/README.md +++ b/README.md @@ -138,16 +138,19 @@ pip3 install -r requirements/datasets/ocrbench_v2.txt ### 📦 安装方式-一键安装(备选) AISBench 也提供了一键安装方式,适用于基于预置配置文件的快速体验和评估场景,请确保安装环境联网。 + - 基本功能的安装命令如下: + ```shell pip3 install ais_bench_benchmark ``` + - 全量功能的安装命令如下: + ```shell pip3 install ais_bench_benchmark[full] ``` - 如需进一步配置、使用 CLI 或 Python 脚本发起评测任务,请参考[快速入门指南](#快速入门)。 ## ❌ 工具卸载 @@ -160,9 +163,85 @@ pip3 uninstall ais_bench_benchmark ## 🚀 快速入门 -### 命令含义 +### 运行命令前置准备 + +- 需要准备支持`v1/chat/completions`子服务的推理服务,可以参考🔗 [VLLM启动OpenAI 兼容服务器](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server)启动推理服务 +- 需要准备gsm8k数据集,可以从🔗 [opencompass + 提供的gsm8k数据集压缩包](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip)下载。将解压后的`gsm8k/`文件夹部署到AISBench评测工具根路径下的`ais_bench/datasets`文件夹下。 + +### 启动测评(两种方式任选其一) + +| ⭐ 推荐:使用自定义配置文件 | 备选:使用命令行参数(原快速入门方式) | +| :------------------ | :------------------------------ | +| 修改一个文件,集中管理所有配置,在任意路径写配置 | 通过 `--models` `--datasets` 参数指定 | +| 一次编写,多次复用 | 每次运行需输入完整命令 | +| 支持 Python 全部语法,灵活扩展 | 仅支持笛卡尔积组合 | + +**⭐ 推荐:使用自定义配置文件** + +AISBench 提供了预置的自定义配置文件 [model\_api\_test\_zh\_cn.py](ais_bench/configs/model_api_test_zh_cn.py),将常见的推理服务化测试配置(模型选择、服务地址、端口、生成参数等)集中在一个文件中,无需分别查找和修改多个配置文件。该文件本质上是 Python 脚本,支持所有 Python 语法,你可以自由扩展。 + +打开 `ais_bench/configs/model_api_test_zh_cn.py`,根据实际情况修改以下配置(如果是`pip3 install ais_bench_benchmark`方式直接安装工具,可以在任意路径自行创建`model_api_test_zh_cn.py`,将以下配置内容写入该文件): + +```python +from mmengine.config import read_base + +with read_base(): +# 模型任务,选择其中一个,其他模型任务参考:https://ais-bench-benchmark-rf.readthedocs.io/zh-cn/latest/base_tutorials/all_params/models.html 获取更多模型任务 + # vllm_api_general 是基础模型,仅支持文本生成 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + # vllm_api_general_chat 是对话模型,支持对话 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + # vllm_api_stream_chat 是流式对话模型,支持流式对话 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + # vllm_api_general_stream 是流式模型,支持流式生成 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + +# 数据集任务,参考:https://ais-bench-benchmark-rf.readthedocs.io/zh-cn/latest/get_started/datasets.html 获取更多数据集任务 + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = vllm_api_general_chat + +models[0]["path"] = "" # 指定模型序列化词表文件的绝对路径(精度测试场景一般不需要配置) +models[0]["model"] = "" # 指定服务端加载的模型名称,根据 VLLM 推理服务实际拉取的模型名称配置(配置为空字符串则自动获取) +models[0]["request_rate"] = 0 # 请求发送频率:每 1/request_rate 秒向服务端发送 1 条请求;小于 0.001 时一次性发送所有请求 +models[0]["api_key"] = "" # 自定义 API key,默认为空字符串 +models[0]["host_ip"] = "localhost" # 指定推理服务的 IP +models[0]["host_port"] = 8080 # 指定推理服务的端口 +models[0]["url"] = "" # 自定义访问推理服务的 URL 路径(当基础 URL 不是 http://host_ip:host_port 的组合时需要配置;配置后 host_ip 和 host_port 将被忽略) +models[0]["max_out_len"] = 512 # 推理服务输出的最大 token 数 +models[0]["batch_size"] = 1 # 发送请求的最大并发数 +models[0]["trust_remote_code"] = False # tokenizer 是否信任远程代码,默认为 False +models[0]["generation_kwargs"] = dict( # 模型推理参数,参考 VLLM 文档配置;AISBench 评测工具不做处理,直接附加到发送的请求中 + temperature=0.01, + ignore_eos=False, +) + +# datasets[0]["path"] = ais_bench/datasets/gsm8k # 指定数据集目录的绝对路径(精度测试场景需要配置) + +work_dir = 'outputs/default/' # 指定任务结果和日志的保存工作目录(默认为 outputs/default/) + +``` +> 💡 配置文件中已预置了常用模型类型的导入(`vllm_api_general`、`vllm_api_general_chat`、`vllm_api_stream_chat`、`vllm_api_general_stream`),只需取消/修改注释即可切换。更多自定义配置文件的用法请参考 📚 [自定义配置文件运行AISBench](./docs/source_zh_cn/advanced_tutorials/run_custom_config.md)。 -AISBench命令执行的单个或多个评测任务是由模型任务(单个或多个)、数据集任务(单个或多个)和结果呈现任务(单个)的组合定义的,AISBench的其他命令行则规定了评测任务的场景(精度评测场景、性能评测场景等)。以如下AISBench命令为例: +数据集任务的选取、准备和使用参考如下步骤: +1. 在📚 [开源数据集](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/get_started/datasets.html#id3)内选取数据集任务 +2. 进入数据的 📚 [详细介绍/数据集部署](ais_bench/benchmark/configs/datasets/demo/README.md#数据集部署)准备数据集 +3. 参考📚 [详细介绍/可用数据集任务](ais_bench/benchmark/configs/datasets/demo/README.md#可用数据集任务)选取可用数据集任务,并将对应的任务导入方式(例如`from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets`)复制到自定义配置文件中 + +修改好配置文件后,执行如下命令启动服务化精度评测: + +```bash +ais_bench ais_bench/configs/model_api_test_zh_cn.py +``` + +*** + +**备选:使用命令行参数** + +如果你更习惯使用命令行参数方式,AISBench 同样支持通过 `--models`、`--datasets`、`--summarizer` 参数直接指定任务。以下是与上述自定义配置文件方式**执行效果完全相同**的命令行方式。 + +AISBench命令执行的单个或多个评测任务是由模型任务(单个或多个)、数据集任务(单个或多个)和结果呈现任务(单个)的组合定义的。以如下AISBench命令为例: ```shell ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example @@ -172,28 +251,18 @@ ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_ch - `--models`指定了模型任务,即`vllm_api_general_chat`模型任务。 - `--datasets`指定了数据集任务,即`demo_gsm8k_gen_4_shot_cot_chat_prompt`数据集任务。 -- `--summarizer`指定了结果呈现任务,即`example`结果呈现任务(不指定`--summarizer`精度评测场景默认使用`example`任务),一般使用默认,不需要在命令行中指定,后续命令不指定。 +- `--summarizer`指定了结果呈现任务,即`example`结果呈现任务(不指定`--summarizer`精度评测场景默认使用`example`任务),一般使用默认,不需要在命令行中指定。 多任务测评请参考:📚 精度场景的[多任务测评](./docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark.md#多任务测评) 和 性能场景的[多任务测评](./docs/source_zh_cn/base_tutorials/scenes_intro/performance_benchmark.md#多任务测评)。 如需自行组合测评任务,实现更灵活的测评方式,可参考:📚 [自定义配置文件运行AISBench](./docs/source_zh_cn/advanced_tutorials/run_custom_config.md#自定义配置文件运行AISBench)。 -### 任务含义查询(可选) - 所选模型任务`vllm_api_general_chat`、数据集任务`demo_gsm8k_gen_4_shot_cot_chat_prompt`和结果呈现任务`example`的具体信息(简介,使用约束等)可以分别从如下链接中查询含义: - `--models`: 📚 [服务化推理后端](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/all_params/models.html#id2) - `--datasets`: 📚 [开源数据集](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/get_started/datasets.html#id3) → 📚 [详细介绍](ais_bench/benchmark/configs/datasets/demo/README.md) - `--summarizer`: 📚 [结果汇总任务](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/all_params/summarizer.html) -### 运行命令前置准备 - -- `--models`: 使用`vllm_api_general_chat`模型任务,需要准备支持`v1/chat/completions`子服务的推理服务,可以参考🔗 [VLLM启动OpenAI 兼容服务器](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server)启动推理服务 -- `--datasets`: 使用`demo_gsm8k_gen_4_shot_cot_chat_prompt`数据集任务,需要准备gsm8k数据集,可以从🔗 [opencompass - 提供的gsm8k数据集压缩包](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip)下载。将解压后的`gsm8k/`文件夹部署到AISBench评测工具根路径下的`ais_bench/datasets`文件夹下。 - -### 任务对应配置文件修改 - 每个模型任务、数据集任务和结果呈现任务都对应一个配置文件,运行命令前需要修改这些配置文件的内容。这些配置文件路径可以通过在原有AISBench命令基础上加上`--search`来查询,例如: ```shell @@ -248,15 +317,13 @@ models = [ ] ``` -### 执行命令 - 修改好配置文件后,执行命令启动服务化精度评测: ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt ``` -#### 查看任务执行细节 +### 查看任务执行细节 执行AISBench命令后,任务管理界面会在命令行实时刷新显示任务执行状态(键盘按"P"键可以暂停/恢复刷新,用于复制看板信息,再按"P"键可以继续刷新)。任务管理界面支持同时监控多个任务的详细执行状态,包括任务名称、进度、时间成本、状态、日志路径、扩展参数等信息,例如: @@ -311,7 +378,7 @@ outputs/default/20250628_151326/logs/infer/vllm-api-general-chat/demo_gsm8k.out > ⚠️ **注意**: 不同评测场景落盘任务执行细节内容不同,具体请参考具体评测场景的指南。 -#### 输出结果 +### 输出结果 因为只有8条数据,会很快跑出结果,结果显示的示例如下 diff --git a/README_en.md b/README_en.md index 80f05536..609fee86 100644 --- a/README_en.md +++ b/README_en.md @@ -159,16 +159,98 @@ pip3 uninstall ais_bench_benchmark ## 🚀 Quick Start -### Command Meaning -A single or multiple evaluation tasks executed by an AISBench command are defined by a combination of model tasks (single or multiple), dataset tasks (single or multiple), and result presentation tasks (single). Other command-line options of AISBench specify the scenario of the evaluation task (accuracy evaluation scenario, performance evaluation scenario, etc.). Take the following AISBench command as an example: +### Pre-execution Preparation + +- You need to prepare an inference service that supports the `v1/chat/completions` sub-service. Refer to 🔗 [VLLM Start OpenAI-Compatible Server](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server) to start the inference service. +- You need to prepare the GSM8K dataset. You can download it from the 🔗 [GSM8K dataset archive provided by OpenCompass](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip). After decompression, place the `gsm8k/` folder under the `ais_bench/datasets` directory of the AISBench evaluation tool root path. + +### Start Evaluation (Choose one of two methods) + +| ⭐ Recommended: Use Custom Configuration File | Alternative: Use Command-Line Parameters (Original Quick Start) | +| :------------------ | :------------------------------ | +| Modify one file to centrally manage all configurations, write configurations at any path | Specify via `--models` and `--datasets` parameters | +| Write once, reuse many times | Each run requires entering the full command | +| Supports full Python syntax for flexible extension | Only supports Cartesian product combinations | + +**⭐ Recommended: Use Custom Configuration File** + +AISBench provides a preset custom configuration file [model_api_test_en.py](ais_bench/configs/model_api_test_en.py), which centralizes common inference service-deployed test configurations (model selection, service address, port, generation parameters, etc.) in one file, eliminating the need to look up and modify multiple configuration files. The file is essentially a Python script that supports all Python syntax, allowing you to freely extend it. + +Open `ais_bench/configs/model_api_test_en.py` and modify the following configuration according to your actual situation (if you installed the tool via `pip3 install ais_bench_benchmark`, you can create `model_api_test_en.py` at any path and write the following configuration content into that file): + +```python +from mmengine.config import read_base + +with read_base(): +# Model task, select one. For other model tasks, see: https://ais-bench-benchmark-rf.readthedocs.io/en/latest/base_tutorials/all_params/models.html for more model tasks + # vllm_api_general is a base model that only supports text generation + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + # vllm_api_general_chat is a chat model that supports dialogue + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + # vllm_api_stream_chat is a streaming chat model that supports streaming dialogue + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + # vllm_api_general_stream is a streaming model that supports streaming generation + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + +# Dataset task, see: https://ais-bench-benchmark-rf.readthedocs.io/en/latest/get_started/datasets.html for more dataset tasks + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = vllm_api_general_chat + +models[0]["path"] = "" # Specify the absolute path of the model serialized vocabulary file (generally not required for accuracy testing scenarios) +models[0]["model"] = "" # Specify the model name loaded on the server, configured according to the actual model name pulled by the VLLM inference service (configure as an empty string to get it automatically) +models[0]["request_rate"] = 0 # Request sending frequency: send 1 request to the server every 1/request_rate seconds; if less than 0.001, all requests are sent at once +models[0]["api_key"] = "" # Custom API key, default is an empty string +models[0]["host_ip"] = "localhost" # Specify the IP of the inference service +models[0]["host_port"] = 8080 # Specify the port of the inference service +models[0]["url"] = "" # Custom URL path for accessing the inference service (needs to be configured when the base URL is not a combination of http://host_ip:host_port; after configuration, host_ip and host_port will be ignored) +models[0]["max_out_len"] = 512 # Maximum number of tokens output by the inference service +models[0]["batch_size"] = 1 # Maximum concurrency for sending requests +models[0]["trust_remote_code"] = False # Whether the tokenizer trusts remote code, default is False +models[0]["generation_kwargs"] = dict( # Model inference parameters, configured with reference to the VLLM documentation; the AISBench evaluation tool does not process them and attaches them directly to the sent request + temperature=0.01, + ignore_eos=False, +) + +# datasets[0]["path"] = ais_bench/datasets/gsm8k # Specify the absolute path of the dataset directory (required for accuracy testing scenarios) + +work_dir = 'outputs/default/' # Specify the working directory for saving task results and logs (default is outputs/default/) + +``` +> 💡 The configuration file has pre-imported common model types (`vllm_api_general`, `vllm_api_general_chat`, `vllm_api_stream_chat`, `vllm_api_general_stream`). You only need to uncomment/modify the comment to switch. For more usage of custom configuration files, please refer to 📚 [Running AISBench with Custom Configuration File](./docs/source_en/advanced_tutorials/run_custom_config.md). + +For selecting, preparing, and using dataset tasks, refer to the following steps: +1. Select a dataset task in 📚 [Open-Source Datasets](https://ais-bench-benchmark.readthedocs.io/en/latest/get_started/datasets.html#id3) +2. Enter the data's 📚 [Detailed Introduction/Dataset Deployment](ais_bench/benchmark/configs/datasets/demo/README_en.md#dataset-deployment) to prepare the dataset +3. Refer to 📚 [Detailed Introduction/Available Dataset Tasks](ais_bench/benchmark/configs/datasets/demo/README_en.md#available-dataset-tasks) to select an available dataset task, and copy the corresponding task import method (e.g., `from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets`) to the custom configuration file + +After modifying the configuration file, execute the following command to start the service-deployed accuracy evaluation: + +```bash +ais_bench ais_bench/configs/model_api_test_en.py +``` + +*** + +**Alternative: Use Command-Line Parameters** + +If you are more familiar with using command-line parameters, AISBench also supports directly specifying tasks via the `--models`, `--datasets`, and `--summarizer` parameters. The following is the command-line approach that has **exactly the same execution effect** as the above custom configuration file approach. + +A single or multiple evaluation tasks executed by an AISBench command are defined by a combination of model tasks (single or multiple), dataset tasks (single or multiple), and result presentation tasks (single). Take the following AISBench command as an example: + ```shell ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example ``` + This command does not specify other command-line options, so it defaults to an accuracy evaluation task, where: - `--models` specifies the model task: the `vllm_api_general_chat` model task. - `--datasets` specifies the dataset task: the `demo_gsm8k_gen_4_shot_cot_chat_prompt` dataset task. - `--summarizer` specifies the result presentation task: the `example` result presentation task (if `--summarizer` is not specified, the `example` task is used by default for accuracy evaluation scenarios). It is generally used as default and does not need to be specified in the command line (subsequent commands will omit this option). +For multi-task evaluation, refer to: 📚 [Multi-Task Evaluation in Accuracy Scenarios](./docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark.md#multi-task-evaluation) and [Multi-Task Evaluation in Performance Scenarios](./docs/source_en/base_tutorials/scenes_intro/performance_benchmark.md#multi-task-evaluation). + +If you need to combine evaluation tasks on your own for more flexible evaluation methods, refer to: 📚 [Running AISBench with Custom Configuration File](./docs/source_en/advanced_tutorials/run_custom_config.md#running-aisbench-with-a-custom-configuration-file). + ### Task Meaning Query (Optional) Detailed information (introduction, usage constraints, etc.) about the selected model task (`vllm_api_general_chat`), dataset task (`demo_gsm8k_gen_4_shot_cot_chat_prompt`), and result presentation task (`example`) can be queried from the following links: diff --git a/ais_bench/benchmark/cli/config_manager.py b/ais_bench/benchmark/cli/config_manager.py index a0425fcc..545a3435 100644 --- a/ais_bench/benchmark/cli/config_manager.py +++ b/ais_bench/benchmark/cli/config_manager.py @@ -23,7 +23,7 @@ def __init__(self, config, file_path): def check(self): self._check_models_config() self._check_datasets_config() - self._check_summarizer_config() + # self._check_summarizer_config() # summarizer has default value, so it is not required def _check_models_config(self): models = self.config.get('models', []) diff --git a/ais_bench/benchmark/configs/datasets/ARC_c/README.md b/ais_bench/benchmark/configs/datasets/ARC_c/README.md index dcff1b00..34b67a71 100644 --- a/ais_bench/benchmark/configs/datasets/ARC_c/README.md +++ b/ais_bench/benchmark/configs/datasets/ARC_c/README.md @@ -28,8 +28,8 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|ARC_c_gen_0_shot_chat_prompt|ARC Challenge Set数据集生成式任务|accuracy|0-shot|对话格式|[ARC_c_gen_0_shot_chat_prompt.py](ARC_c_gen_0_shot_chat_prompt.py)| -|ARC_c_gen_25_shot_chat_prompt|ARC Challenge Set数据集生成式任务|accuracy|25-shot|对话格式|[ARC_c_gen_25_shot_chat_prompt.py](ARC_c_gen_25_shot_chat_prompt.py)| -|ARC_c_ppl_0_shot_str|ARC Challenge Set数据集PPL任务|accuracy|0-shot|字符串格式|[ARC_c_ppl_0_shot_str.py](ARC_c_ppl_0_shot_str.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||ARC_c_gen_0_shot_chat_prompt|ARC Challenge Set数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.ARC_c.ARC_c_gen_0_shot_chat_prompt import ARC_c_datasets as datasets`|[ARC_c_gen_0_shot_chat_prompt.py](ARC_c_gen_0_shot_chat_prompt.py)| +||ARC_c_gen_25_shot_chat_prompt|ARC Challenge Set数据集生成式任务|accuracy|25-shot|对话格式|`from ais_bench.benchmark.configs.datasets.ARC_c.ARC_c_gen_25_shot_chat_prompt import ARC_c_datasets as datasets`|[ARC_c_gen_25_shot_chat_prompt.py](ARC_c_gen_25_shot_chat_prompt.py)| +||ARC_c_ppl_0_shot_str|ARC Challenge Set数据集PPL任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.ARC_c.ARC_c_ppl_0_shot_str import ARC_c_datasets as datasets`|[ARC_c_ppl_0_shot_str.py](ARC_c_ppl_0_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/ARC_c/README_en.md b/ais_bench/benchmark/configs/datasets/ARC_c/README_en.md index 988c27ce..ad93483e 100644 --- a/ais_bench/benchmark/configs/datasets/ARC_c/README_en.md +++ b/ais_bench/benchmark/configs/datasets/ARC_c/README_en.md @@ -28,8 +28,8 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| ARC_c_gen_0_shot_chat_prompt | Generative task for the ARC Challenge Set dataset | Accuracy | 0-shot | Chat format | [ARC_c_gen_0_shot_chat_prompt.py](ARC_c_gen_0_shot_chat_prompt.py) | -| ARC_c_gen_25_shot_chat_prompt | Generative task for the ARC Challenge Set dataset | Accuracy | 25-shot | Chat format | [ARC_c_gen_25_shot_chat_prompt.py](ARC_c_gen_25_shot_chat_prompt.py) | -| ARC_c_ppl_0_shot_str | PPL task for ARC Challenge Set dataset | Accuracy | 0-shot | String format | [ARC_c_ppl_0_shot_str.py](ARC_c_ppl_0_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| ARC_c_gen_0_shot_chat_prompt | Generative task for the ARC Challenge Set dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.ARC_c.ARC_c_gen_0_shot_chat_prompt import ARC_c_datasets as datasets`| [ARC_c_gen_0_shot_chat_prompt.py](ARC_c_gen_0_shot_chat_prompt.py) | +|| ARC_c_gen_25_shot_chat_prompt | Generative task for the ARC Challenge Set dataset | Accuracy | 25-shot | Chat format |`from ais_bench.benchmark.configs.datasets.ARC_c.ARC_c_gen_25_shot_chat_prompt import ARC_c_datasets as datasets`| [ARC_c_gen_25_shot_chat_prompt.py](ARC_c_gen_25_shot_chat_prompt.py) | +|| ARC_c_ppl_0_shot_str | PPL task for ARC Challenge Set dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.ARC_c.ARC_c_ppl_0_shot_str import ARC_c_datasets as datasets`| [ARC_c_ppl_0_shot_str.py](ARC_c_ppl_0_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ARC_e/README.md b/ais_bench/benchmark/configs/datasets/ARC_e/README.md index 4f0867ad..7b2eb231 100644 --- a/ais_bench/benchmark/configs/datasets/ARC_e/README.md +++ b/ais_bench/benchmark/configs/datasets/ARC_e/README.md @@ -27,8 +27,8 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|ARC_e_gen_0_shot_chat_prompt|ARC Easy Set数据集生成式任务|accuracy|0-shot|对话格式|[ARC_e_gen_0_shot_chat_prompt.py](ARC_e_gen_0_shot_chat_prompt.py)| -|ARC_e_gen_25_shot_chat_prompt|ARC Easy Set数据集生成式任务|accuracy|25-shot|对话格式|[ARC_e_gen_25_shot_chat_prompt.py](ARC_e_gen_25_shot_chat_prompt.py)| -|ARC_e_ppl_0_shot_str|ARC Easy Set数据集PPL任务|accuracy|0-shot|字符串模式|[ARC_e_ppl_0_shot_str.py](ARC_e_ppl_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||ARC_e_gen_0_shot_chat_prompt|ARC Easy Set数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.ARC_e.ARC_e_gen_0_shot_chat_prompt import ARC_e_datasets as datasets`|[ARC_e_gen_0_shot_chat_prompt.py](ARC_e_gen_0_shot_chat_prompt.py)| +||ARC_e_gen_25_shot_chat_prompt|ARC Easy Set数据集生成式任务|accuracy|25-shot|对话格式|`from ais_bench.benchmark.configs.datasets.ARC_e.ARC_e_gen_25_shot_chat_prompt import ARC_e_datasets as datasets`|[ARC_e_gen_25_shot_chat_prompt.py](ARC_e_gen_25_shot_chat_prompt.py)| +||ARC_e_ppl_0_shot_str|ARC Easy Set数据集PPL任务|accuracy|0-shot|字符串模式|`from ais_bench.benchmark.configs.datasets.ARC_e.ARC_e_ppl_0_shot_str import ARC_e_datasets as datasets`|[ARC_e_ppl_0_shot_str.py](ARC_e_ppl_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ARC_e/README_en.md b/ais_bench/benchmark/configs/datasets/ARC_e/README_en.md index 1b23f2cf..a922b456 100644 --- a/ais_bench/benchmark/configs/datasets/ARC_e/README_en.md +++ b/ais_bench/benchmark/configs/datasets/ARC_e/README_en.md @@ -27,8 +27,8 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| ARC_e_gen_0_shot_chat_prompt | Generative task for the ARC Easy Set dataset | Accuracy | 0-shot | Chat format | [ARC_e_gen_0_shot_chat_prompt.py](ARC_e_gen_0_shot_chat_prompt.py) | -| ARC_e_gen_25_shot_chat_prompt | Generative task for the ARC Easy Set dataset | Accuracy | 25-shot | Chat format | [ARC_e_gen_25_shot_chat_prompt.py](ARC_e_gen_25_shot_chat_prompt.py) | -| ARC_e_ppl_0_shot_str | PPL task for the ARC Easy Set dataset | Accuracy | 0-shot | String Format | [ARC_e_ppl_0_shot_str.py](ARC_e_ppl_0_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| ARC_e_gen_0_shot_chat_prompt | Generative task for the ARC Easy Set dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.ARC_e.ARC_e_gen_0_shot_chat_prompt import ARC_e_datasets as datasets`| [ARC_e_gen_0_shot_chat_prompt.py](ARC_e_gen_0_shot_chat_prompt.py) | +|| ARC_e_gen_25_shot_chat_prompt | Generative task for the ARC Easy Set dataset | Accuracy | 25-shot | Chat format |`from ais_bench.benchmark.configs.datasets.ARC_e.ARC_e_gen_25_shot_chat_prompt import ARC_e_datasets as datasets`| [ARC_e_gen_25_shot_chat_prompt.py](ARC_e_gen_25_shot_chat_prompt.py) | +|| ARC_e_ppl_0_shot_str | PPL task for the ARC Easy Set dataset | Accuracy | 0-shot | String Format |`from ais_bench.benchmark.configs.datasets.ARC_e.ARC_e_ppl_0_shot_str import ARC_e_datasets as datasets`| [ARC_e_ppl_0_shot_str.py](ARC_e_ppl_0_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README.md b/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README.md index f9b442f4..578d3ebc 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README.md @@ -39,6 +39,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|FewCLUE_bustm_ppl_0_shot_chat|FewCLUE_bustm数据集PPL任务|accuracy|0-shot|对话格式|[FewCLUE_bustm_ppl_0_shot_chat.py](FewCLUE_bustm_ppl_0_shot_chat.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||FewCLUE_bustm_ppl_0_shot_chat|FewCLUE_bustm数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.FewCLUE_bustm.FewCLUE_bustm_ppl_0_shot_chat import bustm_datasets as datasets`|[FewCLUE_bustm_ppl_0_shot_chat.py](FewCLUE_bustm_ppl_0_shot_chat.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README_en.md b/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README_en.md index 32a8821f..e5cb26fd 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README_en.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_bustm/README_en.md @@ -39,6 +39,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| FewCLUE_bustm_ppl_0_shot_chat | PPL task for the FewCLUE_bustm dataset | Accuracy | 0-shot | Chat format | [FewCLUE_bustm_ppl_0_shot_chat.py](FewCLUE_bustm_ppl_0_shot_chat.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| FewCLUE_bustm_ppl_0_shot_chat | PPL task for the FewCLUE_bustm dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.FewCLUE_bustm.FewCLUE_bustm_ppl_0_shot_chat import bustm_datasets as datasets`| [FewCLUE_bustm_ppl_0_shot_chat.py](FewCLUE_bustm_ppl_0_shot_chat.py) | diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README.md b/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README.md index 08ac377b..c35b26d2 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README.md @@ -39,6 +39,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|FewCLUE_chid_ppl_0_shot_str|FewCLUE_chid数据集PPL任务|accuracy|0-shot|字符串格式|[FewCLUE_chid_ppl_0_shot_str.py](FewCLUE_chid_ppl_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||FewCLUE_chid_ppl_0_shot_str|FewCLUE_chid数据集PPL任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.FewCLUE_chid.FewCLUE_chid_ppl_0_shot_str import chid_datasets as datasets`|[FewCLUE_chid_ppl_0_shot_str.py](FewCLUE_chid_ppl_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README_en.md b/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README_en.md index 975b2aee..89ce4831 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README_en.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_chid/README_en.md @@ -39,6 +39,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| FewCLUE_chid_ppl_0_shot_str | PPL task for the FewCLUE_chid dataset | Accuracy | 0-shot | String format | [FewCLUE_chid_ppl_0_shot_str.py](FewCLUE_chid_ppl_0_shot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| FewCLUE_chid_ppl_0_shot_str | PPL task for the FewCLUE_chid dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.FewCLUE_chid.FewCLUE_chid_ppl_0_shot_str import chid_datasets as datasets`| [FewCLUE_chid_ppl_0_shot_str.py](FewCLUE_chid_ppl_0_shot_str.py) | diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README.md b/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README.md index 32c8eb02..7d8fba2d 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README.md @@ -41,6 +41,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|FewCLUE_cluewsc_ppl_0_shot_chat|FewCLUE_cluewsc数据集PPL任务|accuracy|0-shot|对话格式|[FewCLUE_cluewsc_ppl_0_shot_chat.py](FewCLUE_cluewsc_ppl_0_shot_chat.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||FewCLUE_cluewsc_ppl_0_shot_chat|FewCLUE_cluewsc数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.FewCLUE_cluewsc.FewCLUE_cluewsc_ppl_0_shot_chat import cluewsc_datasets as datasets`|[FewCLUE_cluewsc_ppl_0_shot_chat.py](FewCLUE_cluewsc_ppl_0_shot_chat.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README_en.md b/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README_en.md index 52f41c7a..cc900b02 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README_en.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_cluewsc/README_en.md @@ -41,6 +41,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| FewCLUE_cluewsc_ppl_0_shot_chat | PPL task for the FewCLUE_cluewsc dataset | Accuracy | 0-shot | Chat format | [FewCLUE_cluewsc_ppl_0_shot_chat.py](FewCLUE_cluewsc_ppl_0_shot_chat.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| FewCLUE_cluewsc_ppl_0_shot_chat | PPL task for the FewCLUE_cluewsc dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.FewCLUE_cluewsc.FewCLUE_cluewsc_ppl_0_shot_chat import cluewsc_datasets as datasets`| [FewCLUE_cluewsc_ppl_0_shot_chat.py](FewCLUE_cluewsc_ppl_0_shot_chat.py) | diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README.md b/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README.md index c1102098..f8a35819 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README.md @@ -39,7 +39,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|FewCLUE_csl_ppl_0_shot_str|FewCLUE_csl数据集PPL任务|accuracy|0-shot|字符串格式|[FewCLUE_csl_ppl_0_shot_str.py](FewCLUE_csl_ppl_0_shot_str.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||FewCLUE_csl_ppl_0_shot_str|FewCLUE_csl数据集PPL任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.FewCLUE_csl.FewCLUE_csl_ppl_0_shot_str import csl_datasets as datasets`|[FewCLUE_csl_ppl_0_shot_str.py](FewCLUE_csl_ppl_0_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README_en.md b/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README_en.md index af1bb9cd..ec7652fb 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README_en.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_csl/README_en.md @@ -39,7 +39,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| FewCLUE_csl_ppl_0_shot_str | PPL task for the FewCLUE_csl dataset | Accuracy | 0-shot | String format | [FewCLUE_csl_ppl_0_shot_str.py](FewCLUE_csl_ppl_0_shot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| FewCLUE_csl_ppl_0_shot_str | PPL task for the FewCLUE_csl dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.FewCLUE_csl.FewCLUE_csl_ppl_0_shot_str import csl_datasets as datasets`| [FewCLUE_csl_ppl_0_shot_str.py](FewCLUE_csl_ppl_0_shot_str.py) | diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README.md b/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README.md index dacc2c88..8ad577d9 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README.md @@ -39,7 +39,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|FewCLUE_eprstmt_ppl_0_shot_chat|FewCLUE_eprstmt数据集PPL任务|accuracy|0-shot|对话格式|[FewCLUE_eprstmt_ppl_0_shot_chat.py](FewCLUE_eprstmt_ppl_0_shot_chat.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||FewCLUE_eprstmt_ppl_0_shot_chat|FewCLUE_eprstmt数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.FewCLUE_eprstmt.FewCLUE_eprstmt_ppl_0_shot_chat import eprstmt_datasets as datasets`|[FewCLUE_eprstmt_ppl_0_shot_chat.py](FewCLUE_eprstmt_ppl_0_shot_chat.py)| diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README_en.md b/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README_en.md index 1405475c..6ad4e038 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README_en.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_eprstmt/README_en.md @@ -39,7 +39,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| FewCLUE_eprstmt_ppl_0_shot_chat | PPL task for the FewCLUE_eprstmt dataset | Accuracy | 0-shot | Chat format | [FewCLUE_eprstmt_ppl_0_shot_chat.py](FewCLUE_eprstmt_ppl_0_shot_chat.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| FewCLUE_eprstmt_ppl_0_shot_chat | PPL task for the FewCLUE_eprstmt dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.FewCLUE_eprstmt.FewCLUE_eprstmt_ppl_0_shot_chat import eprstmt_datasets as datasets`| [FewCLUE_eprstmt_ppl_0_shot_chat.py](FewCLUE_eprstmt_ppl_0_shot_chat.py) | diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README.md b/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README.md index cc3280f6..e825b262 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README.md @@ -39,6 +39,6 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|FewCLUE_tnews_ppl_0_shot_chat|FewCLUE_tnews数据集PPL任务|accuracy|0-shot|对话格式|[FewCLUE_tnews_ppl_0_shot_chat.py](FewCLUE_tnews_ppl_0_shot_chat.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||FewCLUE_tnews_ppl_0_shot_chat|FewCLUE_tnews数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.FewCLUE_tnews.FewCLUE_tnews_ppl_0_shot_chat import tnews_datasets as datasets`|[FewCLUE_tnews_ppl_0_shot_chat.py](FewCLUE_tnews_ppl_0_shot_chat.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README_en.md b/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README_en.md index 8ada0c8b..3f57c089 100644 --- a/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README_en.md +++ b/ais_bench/benchmark/configs/datasets/FewCLUE_tnews/README_en.md @@ -39,7 +39,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| FewCLUE_tnews_ppl_0_shot_chat | PPL task for the FewCLUE_tnews dataset | Accuracy | 0-shot | Chat format | [FewCLUE_tnews_ppl_0_shot_chat.py](FewCLUE_tnews_ppl_0_shot_chat.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| FewCLUE_tnews_ppl_0_shot_chat | PPL task for the FewCLUE_tnews dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.FewCLUE_tnews.FewCLUE_tnews_ppl_0_shot_chat import tnews_datasets as datasets`| [FewCLUE_tnews_ppl_0_shot_chat.py](FewCLUE_tnews_ppl_0_shot_chat.py) | diff --git a/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README.md b/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README.md index 25fb8f59..48804291 100644 --- a/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README.md +++ b/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README.md @@ -23,9 +23,9 @@ rm SuperGLUE.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|SuperGLUE_BoolQ_gen_883d50_str|BoolQ数据集生成式任务|accuracy(naive_average)|0-shot|string|[SuperGLUE_BoolQ_gen_883d50_str.py](SuperGLUE_BoolQ_gen_883d50_str.py)| -|SuperGLUE_BoolQ_gen_0_shot_cot_str|BoolQ数据集生成式任务,prompt带逻辑链|accuracy(naive_average)|0-shot|string|[SuperGLUE_BoolQ_gen_0_shot_cot_str.py](SuperGLUE_BoolQ_gen_0_shot_cot_str.py)| -|SuperGLUE_BoolQ_gen_5_shot_str|BoolQ数据集生成式任务,few-shot|accuracy(naive_average)|5-shot|string|[SuperGLUE_BoolQ_gen_5_shot_str.py](SuperGLUE_BoolQ_gen_5_shot_str.py)| -|SuperGLUE_BoolQ_gen_0_shot_str|BoolQ数据集生成式任务,few-shot|accuracy(naive_average)|5-shot|string|[SuperGLUE_BoolQ_gen_0_shot_str.py](SuperGLUE_BoolQ_gen_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +|SuperGLUE_BoolQ_gen_883d50_str|BoolQ数据集生成式任务|accuracy(naive_average)|0-shot|string|`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_883d50_str import BoolQ_datasets as datasets`|[SuperGLUE_BoolQ_gen_883d50_str.py](SuperGLUE_BoolQ_gen_883d50_str.py)| +||SuperGLUE_BoolQ_gen_0_shot_cot_str|BoolQ数据集生成式任务,prompt带逻辑链|accuracy(naive_average)|0-shot|string|`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_0_shot_cot_str import BoolQ_datasets as datasets`|[SuperGLUE_BoolQ_gen_0_shot_cot_str.py](SuperGLUE_BoolQ_gen_0_shot_cot_str.py)| +||SuperGLUE_BoolQ_gen_5_shot_str|BoolQ数据集生成式任务,few-shot|accuracy(naive_average)|5-shot|string|`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_5_shot_str import BoolQ_datasets as datasets`|[SuperGLUE_BoolQ_gen_5_shot_str.py](SuperGLUE_BoolQ_gen_5_shot_str.py)| +||SuperGLUE_BoolQ_gen_0_shot_str|BoolQ数据集生成式任务,few-shot|accuracy(naive_average)|5-shot|string|`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_0_shot_str import BoolQ_datasets as datasets`|[SuperGLUE_BoolQ_gen_0_shot_str.py](SuperGLUE_BoolQ_gen_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README_en.md b/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README_en.md index 9f0a9b0a..87b095d3 100644 --- a/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README_en.md +++ b/ais_bench/benchmark/configs/datasets/SuperGLUE_BoolQ/README_en.md @@ -23,12 +23,12 @@ rm SuperGLUE.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| SuperGLUE_BoolQ_gen_883d50_str | Generative task for the BoolQ dataset | Accuracy (naive_average) | 0-shot | String | [SuperGLUE_BoolQ_gen_883d50_str.py](SuperGLUE_BoolQ_gen_883d50_str.py) | -| SuperGLUE_BoolQ_gen_0_shot_cot_str | Generative task for the BoolQ dataset, with a chain-of-thought in the prompt | Accuracy (naive_average) | 0-shot | String | [SuperGLUE_BoolQ_gen_0_shot_cot_str.py](SuperGLUE_BoolQ_gen_0_shot_cot_str.py) | -| SuperGLUE_BoolQ_gen_5_shot_str | Generative task for the BoolQ dataset (few-shot setting) | Accuracy (naive_average) | 5-shot | String | [SuperGLUE_BoolQ_gen_5_shot_str.py](SuperGLUE_BoolQ_gen_5_shot_str.py) | -| SuperGLUE_BoolQ_gen_0_shot_str | Generative task for the BoolQ dataset (note: there is a possible inconsistency between the "Few-Shot" setting and the task name; the "Few-Shot" column shows 5-shot, while the task name indicates 0-shot) | Accuracy (naive_average) | 5-shot | String | [SuperGLUE_BoolQ_gen_0_shot_str.py](SuperGLUE_BoolQ_gen_0_shot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +| SuperGLUE_BoolQ_gen_883d50_str | Generative task for the BoolQ dataset | Accuracy (naive_average) | 0-shot | String | `from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_883d50_str import BoolQ_datasets as datasets` | [SuperGLUE_BoolQ_gen_883d50_str.py](SuperGLUE_BoolQ_gen_883d50_str.py) | +|| SuperGLUE_BoolQ_gen_0_shot_cot_str | Generative task for the BoolQ dataset, with a chain-of-thought in the prompt | Accuracy (naive_average) | 0-shot | String |`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_0_shot_cot_str import BoolQ_datasets as datasets`| [SuperGLUE_BoolQ_gen_0_shot_cot_str.py](SuperGLUE_BoolQ_gen_0_shot_cot_str.py) | +|| SuperGLUE_BoolQ_gen_5_shot_str | Generative task for the BoolQ dataset (few-shot setting) | Accuracy (naive_average) | 5-shot | String |`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_5_shot_str import BoolQ_datasets as datasets`| [SuperGLUE_BoolQ_gen_5_shot_str.py](SuperGLUE_BoolQ_gen_5_shot_str.py) | +|| SuperGLUE_BoolQ_gen_0_shot_str | Generative task for the BoolQ dataset (note: there is a possible inconsistency between the "Few-Shot" setting and the task name; the "Few-Shot" column shows 5-shot, while the task name indicates 0-shot) | Accuracy (naive_average) | 5-shot | String |`from ais_bench.benchmark.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_gen_0_shot_str import BoolQ_datasets as datasets`| [SuperGLUE_BoolQ_gen_0_shot_str.py](SuperGLUE_BoolQ_gen_0_shot_str.py) | ### Note diff --git a/ais_bench/benchmark/configs/datasets/Xsum/README.md b/ais_bench/benchmark/configs/datasets/Xsum/README.md index 475bca44..b38098e4 100644 --- a/ais_bench/benchmark/configs/datasets/Xsum/README.md +++ b/ais_bench/benchmark/configs/datasets/Xsum/README.md @@ -27,7 +27,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|Xsum_gen_0_shot_chat|Xsum数据集生成式任务|accuracy|0-shot|对话格式|[Xsum_gen_0_shot_chat.py](Xsum_gen_0_shot_chat.py)| -|Xsum_gen_0_shot_str|Xsum数据集生成式任务|accuracy|0-shot|字符串格式|[Xsum_gen_0_shot_str.py](Xsum_gen_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||Xsum_gen_0_shot_chat|Xsum数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.Xsum.Xsum_gen_0_shot_chat import Xsum_datasets as datasets`|[Xsum_gen_0_shot_chat.py](Xsum_gen_0_shot_chat.py)| +||Xsum_gen_0_shot_str|Xsum数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.Xsum.Xsum_gen_0_shot_str import Xsum_datasets as datasets`|[Xsum_gen_0_shot_str.py](Xsum_gen_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/Xsum/README_en.md b/ais_bench/benchmark/configs/datasets/Xsum/README_en.md index 1963013f..f74edb96 100644 --- a/ais_bench/benchmark/configs/datasets/Xsum/README_en.md +++ b/ais_bench/benchmark/configs/datasets/Xsum/README_en.md @@ -27,7 +27,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| Xsum_gen_0_shot_chat | Generative task for the XSum dataset | Accuracy | 0-shot | Chat Format | [Xsum_gen_0_shot_chat.py](Xsum_gen_0_shot_chat.py) | -| Xsum_gen_0_shot_str | Generative task for the XSum dataset | Accuracy | 0-shot | String Format | [Xsum_gen_0_shot_str.py](Xsum_gen_0_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| Xsum_gen_0_shot_chat | Generative task for the XSum dataset | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.Xsum.Xsum_gen_0_shot_chat import Xsum_datasets as datasets`| [Xsum_gen_0_shot_chat.py](Xsum_gen_0_shot_chat.py) | +|| Xsum_gen_0_shot_str | Generative task for the XSum dataset | Accuracy | 0-shot | String Format |`from ais_bench.benchmark.configs.datasets.Xsum.Xsum_gen_0_shot_str import Xsum_datasets as datasets`| [Xsum_gen_0_shot_str.py](Xsum_gen_0_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/agieval/README.md b/ais_bench/benchmark/configs/datasets/agieval/README.md index 81c0966f..7a6afd60 100644 --- a/ais_bench/benchmark/configs/datasets/agieval/README.md +++ b/ais_bench/benchmark/configs/datasets/agieval/README.md @@ -45,6 +45,6 @@ rm -r OpenCompassData-core-20240207.zip └── sat-math.jsonl ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|agieval_gen_0_shot_chat_prompt|AGIEval数据集生成式任务,共包含21个子任务|accuracy|0-shot|对话格式|[agieval_gen_0_shot_chat_prompt.py](agieval_gen_0_shot_chat_prompt.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||agieval_gen_0_shot_chat_prompt|AGIEval数据集生成式任务,共包含21个子任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.agieval.agieval_gen_0_shot_chat_prompt import agieval_datasets as datasets`|[agieval_gen_0_shot_chat_prompt.py](agieval_gen_0_shot_chat_prompt.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/agieval/README_en.md b/ais_bench/benchmark/configs/datasets/agieval/README_en.md index 0a6bc98e..beb49f2a 100644 --- a/ais_bench/benchmark/configs/datasets/agieval/README_en.md +++ b/ais_bench/benchmark/configs/datasets/agieval/README_en.md @@ -45,7 +45,7 @@ rm -r OpenCompassData-core-20240207.zip └── sat-math.jsonl ``` ## Available Dataset Tasks -|Task Name|Description|Evaluation Metric|Few-shot|Prompt Format|Corresponding Source Code Configuration File Path| -| --- | --- | --- | --- | --- | --- | -|agieval_gen_0_shot_chat_prompt|AGIEval dataset generative task, containing a total of 21 subtasks|accuracy|0-shot|Chat format|[agieval_gen_0_shot_chat_prompt.py](agieval_gen_0_shot_chat_prompt.py)| +|Task Name|Description|Evaluation Metric|Few-shot|Prompt Format|Import Statement|Corresponding Source Code Configuration File Path| +| --- | --- | --- | --- | --- | --- | --- | +|agieval_gen_0_shot_chat_prompt|AGIEval dataset generative task, containing a total of 21 subtasks|accuracy|0-shot|Chat format|`from ais_bench.benchmark.configs.datasets.agieval.agieval_gen_0_shot_chat_prompt import agieval_datasets as datasets`|[agieval_gen_0_shot_chat_prompt.py](agieval_gen_0_shot_chat_prompt.py)| ``` diff --git a/ais_bench/benchmark/configs/datasets/aime2024/README.md b/ais_bench/benchmark/configs/datasets/aime2024/README.md index 736dbcfb..c7705502 100644 --- a/ais_bench/benchmark/configs/datasets/aime2024/README.md +++ b/ais_bench/benchmark/configs/datasets/aime2024/README.md @@ -24,7 +24,7 @@ rm aime.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|aime2024_gen_0_shot_str|aime2024数据集生成式任务|accuracy(pass@1)|0-shot|字符串格式|[aime2024_gen_0_shot_str.py](aime2024_gen_0_shot_str.py)| -|aime2024_gen_0_shot_chat_prompt|aime2024数据集生成式任务(对齐DeepSeek R1精度测试)|accuracy(pass@1)|0-shot|对话格式|[aime2024_gen_0_shot_chat_prompt.py](aime2024_gen_0_shot_chat_prompt.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||aime2024_gen_0_shot_str|aime2024数据集生成式任务|accuracy(pass@1)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_str import aime2024_datasets as datasets`|[aime2024_gen_0_shot_str.py](aime2024_gen_0_shot_str.py)| +||aime2024_gen_0_shot_chat_prompt|aime2024数据集生成式任务(对齐DeepSeek R1精度测试)|accuracy(pass@1)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets as datasets`|[aime2024_gen_0_shot_chat_prompt.py](aime2024_gen_0_shot_chat_prompt.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/aime2024/README_en.md b/ais_bench/benchmark/configs/datasets/aime2024/README_en.md index 2473c146..bda9be74 100644 --- a/ais_bench/benchmark/configs/datasets/aime2024/README_en.md +++ b/ais_bench/benchmark/configs/datasets/aime2024/README_en.md @@ -24,7 +24,7 @@ rm aime.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| aime2024_gen_0_shot_str | Generative task for the aime2024 dataset | accuracy (pass@1) | 0-shot | String format | [aime2024_gen_0_shot_str.py](aime2024_gen_0_shot_str.py) | -| aime2024_gen_0_shot_chat_prompt | Generative task for the aime2024 dataset (aligned with DeepSeek R1 accuracy test) | accuracy (pass@1) | 0-shot | Chat format | [aime2024_gen_0_shot_chat_prompt.py](aime2024_gen_0_shot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| aime2024_gen_0_shot_str | Generative task for the aime2024 dataset | accuracy (pass@1) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_str import aime2024_datasets as datasets`| [aime2024_gen_0_shot_str.py](aime2024_gen_0_shot_str.py) | +|| aime2024_gen_0_shot_chat_prompt | Generative task for the aime2024 dataset (aligned with DeepSeek R1 accuracy test) | accuracy (pass@1) | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets as datasets`| [aime2024_gen_0_shot_chat_prompt.py](aime2024_gen_0_shot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/aime2025/README.md b/ais_bench/benchmark/configs/datasets/aime2025/README.md index 86e7d442..8540d149 100644 --- a/ais_bench/benchmark/configs/datasets/aime2025/README.md +++ b/ais_bench/benchmark/configs/datasets/aime2025/README.md @@ -23,7 +23,7 @@ rm aime2025.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | |aime2025_gen|AIME2025|数据集生成式任务|准确率(accuracy)|0-shot|对话格式|aime2025_gen_0_shot_chat_prompt.py| -|aime2025_gen_0_shot_llmjudge|AIME2025|数据集生成式任务|准确率(accuracy), 裁判模型评价的结果|0-shot|对话格式|aime2025_gen_0_shot_llmjudge.py| \ No newline at end of file +||aime2025_gen_0_shot_llmjudge|AIME2025|数据集生成式任务|准确率(accuracy), 裁判模型评价的结果|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.aime2025.aime2025_gen_0_shot_llmjudge import aime2025_datasets as datasets`|aime2025_gen_0_shot_llmjudge.py| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/aime2025/README_en.md b/ais_bench/benchmark/configs/datasets/aime2025/README_en.md index 273dc9fb..62141aaf 100644 --- a/ais_bench/benchmark/configs/datasets/aime2025/README_en.md +++ b/ais_bench/benchmark/configs/datasets/aime2025/README_en.md @@ -23,7 +23,7 @@ rm aime2025.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | | aime2025_gen | Generative task for the AIME2025 dataset | Accuracy | 0-shot | Chat format | aime2025_gen_0_shot_chat_prompt.py | -| aime2025_gen_0_shot_llmjudge | AIME2025 | Generative task for the AIME2025 dataset | Accuracy evaluated by judge model | 0-shot | Chat format | aime2025_gen_0_shot_llmjudge.py | +|| aime2025_gen_0_shot_llmjudge | AIME2025 | Generative task for the AIME2025 dataset | Accuracy evaluated by judge model | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.aime2025.aime2025_gen_0_shot_llmjudge import aime2025_datasets as datasets`| aime2025_gen_0_shot_llmjudge.py | diff --git a/ais_bench/benchmark/configs/datasets/aime2026/README.md b/ais_bench/benchmark/configs/datasets/aime2026/README.md index 5ec37f6f..f200e452 100644 --- a/ais_bench/benchmark/configs/datasets/aime2026/README.md +++ b/ais_bench/benchmark/configs/datasets/aime2026/README.md @@ -40,7 +40,7 @@ Remember to put your answer inside \boxed{}. ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | |aime2026_gen|AIME2026 数据集生成式任务|准确率(accuracy)|0-shot|对话格式|aime2026_gen_0_shot_chat_prompt.py| -|aime2026_gen_0_shot_str|AIME2026 数据集生成式任务|准确率(accuracy)|0-shot|字符串格式|aime2026_gen_0_shot_str.py| +||aime2026_gen_0_shot_str|AIME2026 数据集生成式任务|准确率(accuracy)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.aime2026.aime2026_gen_0_shot_str import aime2026_datasets as datasets`|aime2026_gen_0_shot_str.py| diff --git a/ais_bench/benchmark/configs/datasets/aime2026/README_en.md b/ais_bench/benchmark/configs/datasets/aime2026/README_en.md index ab928216..f56ae0e4 100644 --- a/ais_bench/benchmark/configs/datasets/aime2026/README_en.md +++ b/ais_bench/benchmark/configs/datasets/aime2026/README_en.md @@ -40,7 +40,7 @@ Remember to put your answer inside \boxed{}. ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | | aime2026_gen | Generative task for the AIME2026 dataset | Accuracy | 0-shot | Chat format | aime2026_gen_0_shot_chat_prompt.py | -| aime2026_gen_0_shot_str | Generative task for the AIME2026 dataset | Accuracy | 0-shot | String format | aime2026_gen_0_shot_str.py | +|| aime2026_gen_0_shot_str | Generative task for the AIME2026 dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.aime2026.aime2026_gen_0_shot_str import aime2026_datasets as datasets`| aime2026_gen_0_shot_str.py | diff --git a/ais_bench/benchmark/configs/datasets/bbh/README.md b/ais_bench/benchmark/configs/datasets/bbh/README.md index aa1a5726..f8a073d0 100644 --- a/ais_bench/benchmark/configs/datasets/bbh/README.md +++ b/ais_bench/benchmark/configs/datasets/bbh/README.md @@ -78,6 +78,6 @@ rm BBH.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|bbh_gen_3_shot_cot_chat|BBH数据集生成式任务|score(accuracy)|3-shot|对话格式|[bbh_gen_3_shot_cot_chat.py](bbh_gen_3_shot_cot_chat.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||bbh_gen_3_shot_cot_chat|BBH数据集生成式任务|score(accuracy)|3-shot|对话格式|`from ais_bench.benchmark.configs.datasets.bbh.bbh_gen_3_shot_cot_chat import bbh_datasets as datasets`|[bbh_gen_3_shot_cot_chat.py](bbh_gen_3_shot_cot_chat.py)| diff --git a/ais_bench/benchmark/configs/datasets/bbh/README_en.md b/ais_bench/benchmark/configs/datasets/bbh/README_en.md index c7cb4a5b..143250a6 100644 --- a/ais_bench/benchmark/configs/datasets/bbh/README_en.md +++ b/ais_bench/benchmark/configs/datasets/bbh/README_en.md @@ -78,6 +78,6 @@ rm BBH.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| bbh_gen_3_shot_cot_chat | Generative task for the BBH dataset | Score (Accuracy) | 3-shot | Chat format | [bbh_gen_3_shot_cot_chat.py](bbh_gen_3_shot_cot_chat.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| bbh_gen_3_shot_cot_chat | Generative task for the BBH dataset | Score (Accuracy) | 3-shot | Chat format |`from ais_bench.benchmark.configs.datasets.bbh.bbh_gen_3_shot_cot_chat import bbh_datasets as datasets`| [bbh_gen_3_shot_cot_chat.py](bbh_gen_3_shot_cot_chat.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ceval/README.md b/ais_bench/benchmark/configs/datasets/ceval/README.md index 18357b98..1e720bf5 100644 --- a/ais_bench/benchmark/configs/datasets/ceval/README.md +++ b/ais_bench/benchmark/configs/datasets/ceval/README.md @@ -184,9 +184,9 @@ rm ceval-exam.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|ceval_gen_0_shot_str|C-Eval数据集生成式任务|accuracy|0-shot|字符串格式|[ceval_gen_0_shot_str.py](ceval_gen_0_shot_str.py)| -|ceval_gen_5_shot_str|C-Eval数据集生成式任务|accuracy|5-shot|字符串格式|[ceval_gen_5_shot_str.py](ceval_gen_5_shot_str.py)| -|ceval_gen_0_shot_cot_chat_prompt|C-Eval数据集生成式任务,prompt带逻辑链(对齐DeepSeek R1精度测试)|accuracy|0-shot|对话格式|[ceval_gen_0_shot_cot_chat_prompt.py](ceval_gen_0_shot_cot_chat_prompt.py)| -|ceval_ppl_0_shot_str|C-Eval数据集PPL任务|accuracy|0-shot|字符串格式|[ceval_ppl_0_shot_str.py](ceval_ppl_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||ceval_gen_0_shot_str|C-Eval数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_0_shot_str import ceval_datasets as datasets`|[ceval_gen_0_shot_str.py](ceval_gen_0_shot_str.py)| +||ceval_gen_5_shot_str|C-Eval数据集生成式任务|accuracy|5-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets`|[ceval_gen_5_shot_str.py](ceval_gen_5_shot_str.py)| +||ceval_gen_0_shot_cot_chat_prompt|C-Eval数据集生成式任务,prompt带逻辑链(对齐DeepSeek R1精度测试)|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_0_shot_cot_chat_prompt import ceval_datasets as datasets`|[ceval_gen_0_shot_cot_chat_prompt.py](ceval_gen_0_shot_cot_chat_prompt.py)| +||ceval_ppl_0_shot_str|C-Eval数据集PPL任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.ceval.ceval_ppl_0_shot_str import ceval_datasets as datasets`|[ceval_ppl_0_shot_str.py](ceval_ppl_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ceval/README_en.md b/ais_bench/benchmark/configs/datasets/ceval/README_en.md index 181c0f65..ae9eb57e 100644 --- a/ais_bench/benchmark/configs/datasets/ceval/README_en.md +++ b/ais_bench/benchmark/configs/datasets/ceval/README_en.md @@ -184,9 +184,9 @@ rm ceval-exam.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| ceval_gen_0_shot_str | Generative task for the C-Eval dataset | Accuracy | 0-shot | String format | [ceval_gen_0_shot_str.py](ceval_gen_0_shot_str.py) | -| ceval_gen_5_shot_str | Generative task for the C-Eval dataset | Accuracy | 5-shot | String format | [ceval_gen_5_shot_str.py](ceval_gen_5_shot_str.py) | -| ceval_gen_0_shot_cot_chat_prompt | Generative task for the C-Eval dataset with logical chain in prompt (aligned with DeepSeek R1 accuracy test) | Accuracy | 0-shot | Chat format | [ceval_gen_0_shot_cot_chat_prompt.py](ceval_gen_0_shot_cot_chat_prompt.py) | -| ceval_ppl_0_shot_str | PPL task for the C-Eval dataset | Accuracy | 0-shot | String format | [ceval_ppl_0_shot_str.py](ceval_ppl_0_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| ceval_gen_0_shot_str | Generative task for the C-Eval dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_0_shot_str import ceval_datasets as datasets`| [ceval_gen_0_shot_str.py](ceval_gen_0_shot_str.py) | +|| ceval_gen_5_shot_str | Generative task for the C-Eval dataset | Accuracy | 5-shot | String format |`from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets`| [ceval_gen_5_shot_str.py](ceval_gen_5_shot_str.py) | +|| ceval_gen_0_shot_cot_chat_prompt | Generative task for the C-Eval dataset with logical chain in prompt (aligned with DeepSeek R1 accuracy test) | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_0_shot_cot_chat_prompt import ceval_datasets as datasets`| [ceval_gen_0_shot_cot_chat_prompt.py](ceval_gen_0_shot_cot_chat_prompt.py) | +|| ceval_ppl_0_shot_str | PPL task for the C-Eval dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.ceval.ceval_ppl_0_shot_str import ceval_datasets as datasets`| [ceval_ppl_0_shot_str.py](ceval_ppl_0_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/cmmlu/README.md b/ais_bench/benchmark/configs/datasets/cmmlu/README.md index bb843067..b5ea3437 100644 --- a/ais_bench/benchmark/configs/datasets/cmmlu/README.md +++ b/ais_bench/benchmark/configs/datasets/cmmlu/README.md @@ -157,8 +157,8 @@ rm cmmlu.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|cmmlu_gen_0_shot_cot_chat_prompt|CMMLU数据集生成式任务, prompt带逻辑链|accuracy|0-shot|对话格式|[cmmlu_gen_0_shot_cot_chat_prompt.py](cmmlu_gen_0_shot_cot_chat_prompt.py)| -|cmmlu_gen_5_shot_cot_chat_prompt|CMMLU数据集生成式任务, prompt带逻辑链|accuracy|5-shot|对话格式|[cmmlu_gen_5_shot_cot_chat_prompt.py](cmmlu_gen_5_shot_cot_chat_prompt.py)| -|cmmlu_ppl_0_shot_cot_chat_prompt|CMMLU数据集PPL任务,prompt带逻辑链|accuracy|0-shot|对话格式|[cmmlu_ppl_0_shot_cot_chat_prompt.py](cmmlu_ppl_0_shot_cot_chat_prompt.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||cmmlu_gen_0_shot_cot_chat_prompt|CMMLU数据集生成式任务, prompt带逻辑链|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.cmmlu.cmmlu_gen_0_shot_cot_chat_prompt import cmmlu_datasets as datasets`|[cmmlu_gen_0_shot_cot_chat_prompt.py](cmmlu_gen_0_shot_cot_chat_prompt.py)| +||cmmlu_gen_5_shot_cot_chat_prompt|CMMLU数据集生成式任务, prompt带逻辑链|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.cmmlu.cmmlu_gen_5_shot_cot_chat_prompt import cmmlu_datasets as datasets`|[cmmlu_gen_5_shot_cot_chat_prompt.py](cmmlu_gen_5_shot_cot_chat_prompt.py)| +||cmmlu_ppl_0_shot_cot_chat_prompt|CMMLU数据集PPL任务,prompt带逻辑链|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.cmmlu.cmmlu_ppl_0_shot_cot_chat_prompt import cmmlu_datasets as datasets`|[cmmlu_ppl_0_shot_cot_chat_prompt.py](cmmlu_ppl_0_shot_cot_chat_prompt.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/cmmlu/README_en.md b/ais_bench/benchmark/configs/datasets/cmmlu/README_en.md index 4a55905a..a01e5f82 100644 --- a/ais_bench/benchmark/configs/datasets/cmmlu/README_en.md +++ b/ais_bench/benchmark/configs/datasets/cmmlu/README_en.md @@ -157,8 +157,8 @@ rm cmmlu.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| cmmlu_gen_0_shot_cot_chat_prompt | Generative task for the CMMLU dataset with logical chain in prompt | Accuracy | 0-shot | Chat format | [cmmlu_gen_0_shot_cot_chat_prompt.py](cmmlu_gen_0_shot_cot_chat_prompt.py) | -| cmmlu_gen_5_shot_cot_chat_prompt | Generative task for the CMMLU dataset with logical chain in prompt | Accuracy | 5-shot | Chat format | [cmmlu_gen_5_shot_cot_chat_prompt.py](cmmlu_gen_5_shot_cot_chat_prompt.py) | -| cmmlu_ppl_0_shot_cot_chat_prompt | PPL task for the CMMLU dataset with logical chain in prompt | Accuracy | 0-shot | Chat format | [cmmlu_ppl_0_shot_cot_chat_prompt.py](cmmlu_ppl_0_shot_cot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| cmmlu_gen_0_shot_cot_chat_prompt | Generative task for the CMMLU dataset with logical chain in prompt | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.cmmlu.cmmlu_gen_0_shot_cot_chat_prompt import cmmlu_datasets as datasets`| [cmmlu_gen_0_shot_cot_chat_prompt.py](cmmlu_gen_0_shot_cot_chat_prompt.py) | +|| cmmlu_gen_5_shot_cot_chat_prompt | Generative task for the CMMLU dataset with logical chain in prompt | Accuracy | 5-shot | Chat format |`from ais_bench.benchmark.configs.datasets.cmmlu.cmmlu_gen_5_shot_cot_chat_prompt import cmmlu_datasets as datasets`| [cmmlu_gen_5_shot_cot_chat_prompt.py](cmmlu_gen_5_shot_cot_chat_prompt.py) | +|| cmmlu_ppl_0_shot_cot_chat_prompt | PPL task for the CMMLU dataset with logical chain in prompt | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.cmmlu.cmmlu_ppl_0_shot_cot_chat_prompt import cmmlu_datasets as datasets`| [cmmlu_ppl_0_shot_cot_chat_prompt.py](cmmlu_ppl_0_shot_cot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/dapo_math/README.md b/ais_bench/benchmark/configs/datasets/dapo_math/README.md index 891092c1..3085ad56 100644 --- a/ais_bench/benchmark/configs/datasets/dapo_math/README.md +++ b/ais_bench/benchmark/configs/datasets/dapo_math/README.md @@ -33,10 +33,10 @@ rm -rf dapo-math-17k/data ``` ## 可用数据集任务 -| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 对应源码配置文件路径 | -| --- | --- | --- | --- | --- | --- | -| dapo_math_gen_0_shot_str | DAPO-math-17k 数据集生成式任务,使用 Minerva 方式提取答案 | accuracy | 0-shot | 字符串格式 | [dapo_math_gen_0_shot_str.py](dapo_math_gen_0_shot_str.py) | -| dapo_math_gen_0_shot_cot_str | DAPO-math-17k 数据集生成式任务,使用严格 boxed 方式提取答案 | accuracy | 0-shot | 字符串格式 | [dapo_math_gen_0_shot_cot_str.py](dapo_math_gen_0_shot_cot_str.py) | +| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| dapo_math_gen_0_shot_str | DAPO-math-17k 数据集生成式任务,使用 Minerva 方式提取答案 | accuracy | 0-shot | 字符串格式 | `from ais_bench.benchmark.configs.datasets.dapo_math.dapo_math_gen_0_shot_str import dapo_math_datasets as datasets` | [dapo_math_gen_0_shot_str.py](dapo_math_gen_0_shot_str.py) | +| dapo_math_gen_0_shot_cot_str | DAPO-math-17k 数据集生成式任务,使用严格 boxed 方式提取答案 | accuracy | 0-shot | 字符串格式 | `from ais_bench.benchmark.configs.datasets.dapo_math.dapo_math_gen_0_shot_cot_str import dapo_math_datasets as datasets` | [dapo_math_gen_0_shot_cot_str.py](dapo_math_gen_0_shot_cot_str.py) | ## 评估方式说明 数据集支持两种答案提取和评估方式: diff --git a/ais_bench/benchmark/configs/datasets/dapo_math/README_en.md b/ais_bench/benchmark/configs/datasets/dapo_math/README_en.md index 13209768..76942430 100644 --- a/ais_bench/benchmark/configs/datasets/dapo_math/README_en.md +++ b/ais_bench/benchmark/configs/datasets/dapo_math/README_en.md @@ -33,10 +33,10 @@ rm -rf dapo-math-17k/data ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| dapo_math_gen_0_shot_str | Generative task for DAPO-math-17k dataset, using Minerva method to extract answers | accuracy | 0-shot | String Format | [dapo_math_gen_0_shot_str.py](dapo_math_gen_0_shot_str.py) | -| dapo_math_gen_0_shot_cot_str | Generative task for DAPO-math-17k dataset, using strict boxed method to extract answers | accuracy | 0-shot | String Format | [dapo_math_gen_0_shot_cot_str.py](dapo_math_gen_0_shot_cot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| dapo_math_gen_0_shot_str | Generative task for DAPO-math-17k dataset, using Minerva method to extract answers | accuracy | 0-shot | String Format |`from ais_bench.benchmark.configs.datasets.dapo_math.dapo_math_gen_0_shot_str import dapo_math_datasets as datasets`| [dapo_math_gen_0_shot_str.py](dapo_math_gen_0_shot_str.py) | +|| dapo_math_gen_0_shot_cot_str | Generative task for DAPO-math-17k dataset, using strict boxed method to extract answers | accuracy | 0-shot | String Format |`from ais_bench.benchmark.configs.datasets.dapo_math.dapo_math_gen_0_shot_cot_str import dapo_math_datasets as datasets`| [dapo_math_gen_0_shot_cot_str.py](dapo_math_gen_0_shot_cot_str.py) | ## Evaluation Method Description The dataset supports two answer extraction and evaluation methods: diff --git a/ais_bench/benchmark/configs/datasets/demo/README.md b/ais_bench/benchmark/configs/datasets/demo/README.md index 1b464b8f..ba351340 100644 --- a/ais_bench/benchmark/configs/datasets/demo/README.md +++ b/ais_bench/benchmark/configs/datasets/demo/README.md @@ -24,7 +24,7 @@ rm gsm8k.zip └── train_socratic.jsonl ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|demo_gsm8k_gen_4_shot_cot_chat_prompt|gsm8k数据集生成式任务(只取8条数据),带逻辑链|accuracy|4-shot|字符串格式|[demo_gsm8k_gen_4_shot_cot_chat_prompt.py](demo_gsm8k_gen_4_shot_cot_chat_prompt.py)| -|demo_gsm8k_gen_0_shot_cot_str_perf|gsm8k数据集生成式任务(只取8条数据),带逻辑链|性能评测|0-shot|字符串格式|[demo_gsm8k_gen_0_shot_cot_str_perf.py](demo_gsm8k_gen_0_shot_cot_str_perf.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | +|demo_gsm8k_gen_4_shot_cot_chat_prompt|gsm8k数据集生成式任务(只取8条数据),带逻辑链|accuracy|4-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets` |[demo_gsm8k_gen_4_shot_cot_chat_prompt.py](demo_gsm8k_gen_4_shot_cot_chat_prompt.py)| +|demo_gsm8k_gen_0_shot_cot_str_perf|gsm8k数据集生成式任务(只取8条数据),带逻辑链|性能评测|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_0_shot_cot_str_perf import gsm8k_datasets as datasets` |[demo_gsm8k_gen_0_shot_cot_str_perf.py](demo_gsm8k_gen_0_shot_cot_str_perf.py)| diff --git a/ais_bench/benchmark/configs/datasets/demo/README_en.md b/ais_bench/benchmark/configs/datasets/demo/README_en.md index 219d2dda..9d6d0e32 100644 --- a/ais_bench/benchmark/configs/datasets/demo/README_en.md +++ b/ais_bench/benchmark/configs/datasets/demo/README_en.md @@ -25,10 +25,10 @@ rm gsm8k.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| demo_gsm8k_gen_4_shot_cot_chat_prompt | Generative task for the GSM8K dataset (only 8 entries used) with logical chain | Accuracy | 4-shot | String format | [demo_gsm8k_gen_4_shot_cot_chat_prompt.py](demo_gsm8k_gen_0_shot_cot_str_perf.py) | -| demo_gsm8k_gen_0_shot_cot_str_perf | Generative task for the GSM8K dataset (only 8 entries used) with logical chain | Performance Evaluation | 0-shot | String format | [demo_gsm8k_gen_0_shot_cot_str_perf.py](demo_gsm8k_gen_0_shot_cot_str_perf.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +| demo_gsm8k_gen_4_shot_cot_chat_prompt | Generative task for the GSM8K dataset (only 8 entries used) with logical chain | Accuracy | 4-shot | String format | `from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets` | [demo_gsm8k_gen_4_shot_cot_chat_prompt.py](demo_gsm8k_gen_4_shot_cot_chat_prompt.py) | +| demo_gsm8k_gen_0_shot_cot_str_perf | Generative task for the GSM8K dataset (only 8 entries used) with logical chain | Performance Evaluation | 0-shot | String format | `from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_0_shot_cot_str_perf import gsm8k_datasets as datasets` | [demo_gsm8k_gen_0_shot_cot_str_perf.py](demo_gsm8k_gen_0_shot_cot_str_perf.py) | ### Translation Notes diff --git a/ais_bench/benchmark/configs/datasets/docvqa/README.md b/ais_bench/benchmark/configs/datasets/docvqa/README.md index 57c6a99c..d9d7d6ba 100644 --- a/ais_bench/benchmark/configs/datasets/docvqa/README.md +++ b/ais_bench/benchmark/configs/datasets/docvqa/README.md @@ -24,6 +24,6 @@ wget https://opencompass.openxlab.space/utils/VLMEval/DocVQA_VAL.tsv ## 可用数据集任务 #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|docvqa_gen|docvqa数据集生成式任务|anls|0-shot|字符串格式|[docvqa_gen.py](docvqa_gen.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||docvqa_gen|docvqa数据集生成式任务|anls|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.docvqa.docvqa_gen import docvqa_datasets as datasets`|[docvqa_gen.py](docvqa_gen.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/docvqa/README_en.md b/ais_bench/benchmark/configs/datasets/docvqa/README_en.md index bd76bfd1..7e67a0d9 100644 --- a/ais_bench/benchmark/configs/datasets/docvqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/docvqa/README_en.md @@ -24,6 +24,6 @@ wget https://opencompass.openxlab.space/utils/VLMEval/DocVQA_VAL.tsv ## Available Dataset Tasks #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|docvqa_gen|docvqa dataset generative task|anls|0-shot|String format|[docvqa_gen.py](docvqa_gen.py)| +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||docvqa_gen|docvqa dataset generative task|anls|0-shot|String format|`from ais_bench.benchmark.configs.datasets.docvqa.docvqa_gen import docvqa_datasets as datasets`|[docvqa_gen.py](docvqa_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/drop/README.md b/ais_bench/benchmark/configs/datasets/drop/README.md index 30e69a87..d1dcbafe 100644 --- a/ais_bench/benchmark/configs/datasets/drop/README.md +++ b/ais_bench/benchmark/configs/datasets/drop/README.md @@ -22,7 +22,7 @@ rm drop_simple_eval.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|drop_gen_0_shot_str|drop数据集生成式任务|accuracy(pass@1)|0-shot|字符串格式|[drop_gen_0_shot_str.py](drop_gen_0_shot_str.py)| -|drop_gen_3_shot_str|drop数据集生成式任务|accuracy(pass@1)|3-shot|字符串格式|[drop_gen_3_shot_str.py](drop_gen_3_shot_str.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||drop_gen_0_shot_str|drop数据集生成式任务|accuracy(pass@1)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.drop.drop_gen_0_shot_str import drop_datasets as datasets`|[drop_gen_0_shot_str.py](drop_gen_0_shot_str.py)| +||drop_gen_3_shot_str|drop数据集生成式任务|accuracy(pass@1)|3-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.drop.drop_gen_3_shot_str import drop_datasets as datasets`|[drop_gen_3_shot_str.py](drop_gen_3_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/drop/README_en.md b/ais_bench/benchmark/configs/datasets/drop/README_en.md index c5bf7d16..743de05d 100644 --- a/ais_bench/benchmark/configs/datasets/drop/README_en.md +++ b/ais_bench/benchmark/configs/datasets/drop/README_en.md @@ -22,7 +22,7 @@ rm drop_simple_eval.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| drop_gen_0_shot_str | Generative task for the DROP dataset | Accuracy (pass@1) | 0-shot | String format | [drop_gen_0_shot_str.py](drop_gen_0_shot_str.py) | -| drop_gen_3_shot_str | Generative task for the DROP dataset | Accuracy (pass@1) | 3-shot | String format | [drop_gen_3_shot_str.py](drop_gen_3_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| drop_gen_0_shot_str | Generative task for the DROP dataset | Accuracy (pass@1) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.drop.drop_gen_0_shot_str import drop_datasets as datasets`| [drop_gen_0_shot_str.py](drop_gen_0_shot_str.py) | +|| drop_gen_3_shot_str | Generative task for the DROP dataset | Accuracy (pass@1) | 3-shot | String format |`from ais_bench.benchmark.configs.datasets.drop.drop_gen_3_shot_str import drop_datasets as datasets`| [drop_gen_3_shot_str.py](drop_gen_3_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/gpqa/README.md b/ais_bench/benchmark/configs/datasets/gpqa/README.md index b5825611..347508e3 100644 --- a/ais_bench/benchmark/configs/datasets/gpqa/README.md +++ b/ais_bench/benchmark/configs/datasets/gpqa/README.md @@ -25,8 +25,8 @@ rm gpqa.zip └── license.txt ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|gpqa_gen_0_shot_str|gpqa数据集生成式任务|accuracy(pass@1)|0-shot|字符串格式|[gpqa_gen_0_shot_str.py](gpqa_gen_0_shot_str.py)| -|gpqa_gen_0_shot_cot_chat_prompt|gpqa数据集生成式任务(对齐DeepSeek R1精度测试)|accuracy(pass@1)|0-shot|对话格式|[gpqa_gen_0_shot_cot_chat_prompt.py](gpqa_gen_0_shot_cot_chat_prompt.py)| -|gpqa_ppl_0_shot_str|gpqa数据集PPL任务|accuracy(pass@1)|0-shot|字符串格式|[gpqa_ppl_0_shot_str.py](gpqa_ppl_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||gpqa_gen_0_shot_str|gpqa数据集生成式任务|accuracy(pass@1)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.gpqa.gpqa_gen_0_shot_str import gpqa_datasets as datasets`|[gpqa_gen_0_shot_str.py](gpqa_gen_0_shot_str.py)| +||gpqa_gen_0_shot_cot_chat_prompt|gpqa数据集生成式任务(对齐DeepSeek R1精度测试)|accuracy(pass@1)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.gpqa.gpqa_gen_0_shot_cot_chat_prompt import gpqa_datasets as datasets`|[gpqa_gen_0_shot_cot_chat_prompt.py](gpqa_gen_0_shot_cot_chat_prompt.py)| +||gpqa_ppl_0_shot_str|gpqa数据集PPL任务|accuracy(pass@1)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.gpqa.gpqa_ppl_0_shot_str import gpqa_datasets as datasets`|[gpqa_ppl_0_shot_str.py](gpqa_ppl_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/gpqa/README_en.md b/ais_bench/benchmark/configs/datasets/gpqa/README_en.md index 56180a89..f464abd2 100644 --- a/ais_bench/benchmark/configs/datasets/gpqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/gpqa/README_en.md @@ -26,11 +26,11 @@ rm gpqa.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| gpqa_gen_0_shot_str | Generative task for the GPQA dataset | Accuracy (pass@1) | 0-shot | String format | [gpqa_gen_0_shot_str.py](gpqa_gen_0_shot_str.py) | -| gpqa_gen_0_shot_cot_chat_prompt | Generative task for the GPQA dataset (aligned with DeepSeek R1 accuracy test) | Accuracy (pass@1) | 0-shot | Chat format | [gpqa_gen_0_shot_cot_chat_prompt.py](gpqa_gen_0_shot_cot_chat_prompt.py) | -| gpqa_ppl_0_shot_str | PPL task for the GPQA dataset | Accuracy (pass@1) | 0-shot | String format | [gpqa_ppl_0_shot_str.py](gpqa_ppl_0_shot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| gpqa_gen_0_shot_str | Generative task for the GPQA dataset | Accuracy (pass@1) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.gpqa.gpqa_gen_0_shot_str import gpqa_datasets as datasets`| [gpqa_gen_0_shot_str.py](gpqa_gen_0_shot_str.py) | +|| gpqa_gen_0_shot_cot_chat_prompt | Generative task for the GPQA dataset (aligned with DeepSeek R1 accuracy test) | Accuracy (pass@1) | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.gpqa.gpqa_gen_0_shot_cot_chat_prompt import gpqa_datasets as datasets`| [gpqa_gen_0_shot_cot_chat_prompt.py](gpqa_gen_0_shot_cot_chat_prompt.py) | +|| gpqa_ppl_0_shot_str | PPL task for the GPQA dataset | Accuracy (pass@1) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.gpqa.gpqa_ppl_0_shot_str import gpqa_datasets as datasets`| [gpqa_ppl_0_shot_str.py](gpqa_ppl_0_shot_str.py) | ### Translation Notes 1. **Term Consistency**: Technical terms such as "生成式任务" (generative task), "评估指标" (evaluation metric), and "对齐DeepSeek R1精度测试" (aligned with DeepSeek R1 accuracy test) follow standard expressions in AI dataset documentation to ensure clarity for technical users. diff --git a/ais_bench/benchmark/configs/datasets/gsm8k/README.md b/ais_bench/benchmark/configs/datasets/gsm8k/README.md index c6cb9e69..f181c0e9 100644 --- a/ais_bench/benchmark/configs/datasets/gsm8k/README.md +++ b/ais_bench/benchmark/configs/datasets/gsm8k/README.md @@ -25,10 +25,10 @@ rm gsm8k.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|gsm8k_gen_4_shot_cot_str|gsm8k数据集生成式任务,带逻辑链|accuracy|4-shot|字符串格式|[gsm8k_gen_4_shot_cot_str.py](gsm8k_gen_4_shot_cot_str.py)| -|gsm8k_gen_4_shot_cot_chat_prompt|gsm8k数据集生成式任务,带逻辑链|accuracy|4-shot|对话格式|[gsm8k_gen_4_shot_cot_chat_prompt.py](gsm8k_gen_4_shot_cot_chat_prompt.py)| -|gsm8k_gen_0_shot_cot_str|gsm8k数据集生成式任务|accuracy|0-shot|字符串格式|[gsm8k_gen_0_shot_cot_str.py](gsm8k_gen_0_shot_cot_str.py)| -|gsm8k_gen_0_shot_cot_chat_prompt|gsm8k数据集生成式任务|accuracy|0-shot|对话格式|[gsm8k_gen_0_shot_cot_chat_prompt.py](gsm8k_gen_0_shot_cot_chat_prompt.py)| -|gsm8k_gen_0_shot_cot_str_perf|gsm8k数据集生成式任务(用于性能测评)|性能测评|0-shot|字符串格式|[gsm8k_gen_0_shot_cot_str_perf.py](gsm8k_gen_0_shot_cot_str_perf.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||gsm8k_gen_4_shot_cot_str|gsm8k数据集生成式任务,带逻辑链|accuracy|4-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets as datasets`|[gsm8k_gen_4_shot_cot_str.py](gsm8k_gen_4_shot_cot_str.py)| +||gsm8k_gen_4_shot_cot_chat_prompt|gsm8k数据集生成式任务,带逻辑链|accuracy|4-shot|对话格式|`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets`|[gsm8k_gen_4_shot_cot_chat_prompt.py](gsm8k_gen_4_shot_cot_chat_prompt.py)| +||gsm8k_gen_0_shot_cot_str|gsm8k数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as datasets`|[gsm8k_gen_0_shot_cot_str.py](gsm8k_gen_0_shot_cot_str.py)| +||gsm8k_gen_0_shot_cot_chat_prompt|gsm8k数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as datasets`|[gsm8k_gen_0_shot_cot_chat_prompt.py](gsm8k_gen_0_shot_cot_chat_prompt.py)| +||gsm8k_gen_0_shot_cot_str_perf|gsm8k数据集生成式任务(用于性能测评)|性能测评|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str_perf import gsm8k_datasets as datasets`|[gsm8k_gen_0_shot_cot_str_perf.py](gsm8k_gen_0_shot_cot_str_perf.py)| diff --git a/ais_bench/benchmark/configs/datasets/gsm8k/README_en.md b/ais_bench/benchmark/configs/datasets/gsm8k/README_en.md index 4631d7c0..79662b6a 100644 --- a/ais_bench/benchmark/configs/datasets/gsm8k/README_en.md +++ b/ais_bench/benchmark/configs/datasets/gsm8k/README_en.md @@ -25,10 +25,10 @@ rm gsm8k.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| gsm8k_gen_4_shot_cot_str | Generative task for the GSM8K dataset with logical chain | Accuracy | 4-shot | String format | [gsm8k_gen_4_shot_cot_str.py](gsm8k_gen_4_shot_cot_str.py) | -| gsm8k_gen_4_shot_cot_chat_prompt | Generative task for the GSM8K dataset with logical chain | Accuracy | 4-shot | Chat format | [gsm8k_gen_4_shot_cot_chat_prompt.py](gsm8k_gen_4_shot_cot_chat_prompt.py) | -| gsm8k_gen_0_shot_cot_str | Generative task for the GSM8K dataset | Accuracy | 0-shot | String format | [gsm8k_gen_0_shot_cot_str.py](gsm8k_gen_0_shot_cot_str.py) | -| gsm8k_gen_0_shot_cot_chat_prompt | Generative task for the GSM8K dataset | Accuracy | 0-shot | Chat format | [gsm8k_gen_0_shot_cot_chat_prompt.py](gsm8k_gen_0_shot_cot_chat_prompt.py) | -| gsm8k_gen_0_shot_cot_str_perf | Generative task for the GSM8K dataset (for performance evaluation) | Performance Evaluation | 0-shot | String format | [gsm8k_gen_0_shot_cot_str_perf.py](gsm8k_gen_0_shot_cot_str_perf.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| gsm8k_gen_4_shot_cot_str | Generative task for the GSM8K dataset with logical chain | Accuracy | 4-shot | String format |`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets as datasets`| [gsm8k_gen_4_shot_cot_str.py](gsm8k_gen_4_shot_cot_str.py) | +|| gsm8k_gen_4_shot_cot_chat_prompt | Generative task for the GSM8K dataset with logical chain | Accuracy | 4-shot | Chat format |`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets`| [gsm8k_gen_4_shot_cot_chat_prompt.py](gsm8k_gen_4_shot_cot_chat_prompt.py) | +|| gsm8k_gen_0_shot_cot_str | Generative task for the GSM8K dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as datasets`| [gsm8k_gen_0_shot_cot_str.py](gsm8k_gen_0_shot_cot_str.py) | +|| gsm8k_gen_0_shot_cot_chat_prompt | Generative task for the GSM8K dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as datasets`| [gsm8k_gen_0_shot_cot_chat_prompt.py](gsm8k_gen_0_shot_cot_chat_prompt.py) | +|| gsm8k_gen_0_shot_cot_str_perf | Generative task for the GSM8K dataset (for performance evaluation) | Performance Evaluation | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str_perf import gsm8k_datasets as datasets`| [gsm8k_gen_0_shot_cot_str_perf.py](gsm8k_gen_0_shot_cot_str_perf.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/hellaswag/README.md b/ais_bench/benchmark/configs/datasets/hellaswag/README.md index e5590adb..c5dcb24a 100644 --- a/ais_bench/benchmark/configs/datasets/hellaswag/README.md +++ b/ais_bench/benchmark/configs/datasets/hellaswag/README.md @@ -24,8 +24,8 @@ rm hellaswag.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|hellaswag_gen_0_shot_chat_prompt|hellaswag数据集生成式任务|accuracy|0-shot|对话格式|[hellaswag_gen_0_shot_chat_prompt.py](hellaswag_gen_0_shot_chat_prompt.py)| -|hellaswag_gen_10_shot_chat_prompt|hellaswag数据集生成式任务|accuracy|10-shot|对话格式|[hellaswag_gen_10_shot_chat_prompt.py](hellaswag_gen_10_shot_chat_prompt.py)| -|hellaswag_ppl_0_shot_chat_prompt|hellaswag数据集PPL任务|accuracy|0-shot|对话格式|[hellaswag_ppl_0_shot_chat_prompt.py](hellaswag_ppl_0_shot_chat_prompt.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||hellaswag_gen_0_shot_chat_prompt|hellaswag数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.hellaswag.hellaswag_gen_0_shot_chat_prompt import hellaswag_datasets as datasets`|[hellaswag_gen_0_shot_chat_prompt.py](hellaswag_gen_0_shot_chat_prompt.py)| +||hellaswag_gen_10_shot_chat_prompt|hellaswag数据集生成式任务|accuracy|10-shot|对话格式|`from ais_bench.benchmark.configs.datasets.hellaswag.hellaswag_gen_10_shot_chat_prompt import hellaswag_datasets as datasets`|[hellaswag_gen_10_shot_chat_prompt.py](hellaswag_gen_10_shot_chat_prompt.py)| +||hellaswag_ppl_0_shot_chat_prompt|hellaswag数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.hellaswag.hellaswag_ppl_0_shot_chat_prompt import hellaswag_datasets as datasets`|[hellaswag_ppl_0_shot_chat_prompt.py](hellaswag_ppl_0_shot_chat_prompt.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/hellaswag/README_en.md b/ais_bench/benchmark/configs/datasets/hellaswag/README_en.md index f840d06a..5044932a 100644 --- a/ais_bench/benchmark/configs/datasets/hellaswag/README_en.md +++ b/ais_bench/benchmark/configs/datasets/hellaswag/README_en.md @@ -24,8 +24,8 @@ rm hellaswag.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| hellaswag_gen_0_shot_chat_prompt | Generative task for the HellaSwag dataset | Accuracy | 0-shot | Chat format | [hellaswag_gen_0_shot_chat_prompt.py](hellaswag_gen_0_shot_chat_prompt.py) | -| hellaswag_gen_10_shot_chat_prompt | Generative task for the HellaSwag dataset | Accuracy | 10-shot | Chat format | [hellaswag_gen_10_shot_chat_prompt.py](hellaswag_gen_10_shot_chat_prompt.py) | -| hellaswag_ppl_0_shot_chat_prompt | PPL task for the hellaswag dataset | Accuracy | 0-shot | Chat format | [hellaswag_ppl_0_shot_chat_prompt.py](hellaswag_ppl_0_shot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| hellaswag_gen_0_shot_chat_prompt | Generative task for the HellaSwag dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.hellaswag.hellaswag_gen_0_shot_chat_prompt import hellaswag_datasets as datasets`| [hellaswag_gen_0_shot_chat_prompt.py](hellaswag_gen_0_shot_chat_prompt.py) | +|| hellaswag_gen_10_shot_chat_prompt | Generative task for the HellaSwag dataset | Accuracy | 10-shot | Chat format |`from ais_bench.benchmark.configs.datasets.hellaswag.hellaswag_gen_10_shot_chat_prompt import hellaswag_datasets as datasets`| [hellaswag_gen_10_shot_chat_prompt.py](hellaswag_gen_10_shot_chat_prompt.py) | +|| hellaswag_ppl_0_shot_chat_prompt | PPL task for the hellaswag dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.hellaswag.hellaswag_ppl_0_shot_chat_prompt import hellaswag_datasets as datasets`| [hellaswag_ppl_0_shot_chat_prompt.py](hellaswag_ppl_0_shot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/hle/README.md b/ais_bench/benchmark/configs/datasets/hle/README.md index b0752eb9..033dc3ba 100644 --- a/ais_bench/benchmark/configs/datasets/hle/README.md +++ b/ais_bench/benchmark/configs/datasets/hle/README.md @@ -24,7 +24,7 @@ HLE(Humanity's Last Exam)是 Center for AI Safety 发布的前沿多模态 ## 可用数据集任务 -| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 对应源码配置文件路径 | -| --- | --- | --- | --- | --- | --- | -| hle | HLE 数据集 | 准确率 (accuracy)、置信度校准误差 (calibration_error) | 0-shot | 对话格式 | hle_llmjudge.py | +| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| hle | HLE 数据集 | 准确率 (accuracy)、置信度校准误差 (calibration_error) | 0-shot | 对话格式 | `from ais_bench.benchmark.configs.datasets.hle.hle_llmjudge import hle_datasets as datasets` | [hle_llmjudge.py](hle_llmjudge.py) | diff --git a/ais_bench/benchmark/configs/datasets/hle/README_en.md b/ais_bench/benchmark/configs/datasets/hle/README_en.md index aa9ef7c4..f03e3516 100644 --- a/ais_bench/benchmark/configs/datasets/hle/README_en.md +++ b/ais_bench/benchmark/configs/datasets/hle/README_en.md @@ -24,7 +24,7 @@ HLE (Humanity's Last Exam) is a frontier multimodal benchmark dataset released b ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| hle | HLE dataset | Accuracy, Calibration Error | 0-shot | Chat format | hle_llmjudge.py | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +| hle | HLE dataset | Accuracy, Calibration Error | 0-shot | Chat format | `from ais_bench.benchmark.configs.datasets.hle.hle_llmjudge import hle_datasets as datasets` | [hle_llmjudge.py](hle_llmjudge.py) | diff --git a/ais_bench/benchmark/configs/datasets/humaneval/README.md b/ais_bench/benchmark/configs/datasets/humaneval/README.md index f709d2be..2705b56a 100644 --- a/ais_bench/benchmark/configs/datasets/humaneval/README.md +++ b/ais_bench/benchmark/configs/datasets/humaneval/README.md @@ -28,6 +28,6 @@ rm humaneval.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|humaneval_gen_0_shot|humaneval数据集生成式任务|pass@1|0-shot|字符串格式|[humaneval_gen_0_shot.py](humaneval_gen_0_shot.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||humaneval_gen_0_shot|humaneval数据集生成式任务|pass@1|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.humaneval.humaneval_gen_0_shot import humaneval_datasets as datasets`|[humaneval_gen_0_shot.py](humaneval_gen_0_shot.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/humaneval/README_en.md b/ais_bench/benchmark/configs/datasets/humaneval/README_en.md index e6784e03..f1f933c7 100644 --- a/ais_bench/benchmark/configs/datasets/humaneval/README_en.md +++ b/ais_bench/benchmark/configs/datasets/humaneval/README_en.md @@ -28,6 +28,6 @@ rm humaneval.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| humaneval_gen_0_shot | Generative task for the HumanEval dataset | pass@1 | 0-shot | String format | [humaneval_gen_0_shot.py](humaneval_gen_0_shot.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| humaneval_gen_0_shot | Generative task for the HumanEval dataset | pass@1 | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.humaneval.humaneval_gen_0_shot import humaneval_datasets as datasets`| [humaneval_gen_0_shot.py](humaneval_gen_0_shot.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/humanevalx/README.md b/ais_bench/benchmark/configs/datasets/humanevalx/README.md index 014f0df4..90ce9fa3 100644 --- a/ais_bench/benchmark/configs/datasets/humanevalx/README.md +++ b/ais_bench/benchmark/configs/datasets/humanevalx/README.md @@ -32,6 +32,6 @@ rm humanevalx.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|humanevalx_gen_0_shot|humanevalx数据集生成式任务|pass@1|0-shot|字符串格式|[humanevalx_gen_0_shot.py](humanevalx_gen_0_shot.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||humanevalx_gen_0_shot|humanevalx数据集生成式任务|pass@1|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.humanevalx.humanevalx_gen_0_shot import humanevalx_datasets as datasets`|[humanevalx_gen_0_shot.py](humanevalx_gen_0_shot.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/humanevalx/README_en.md b/ais_bench/benchmark/configs/datasets/humanevalx/README_en.md index 344c9f35..6fd502f1 100644 --- a/ais_bench/benchmark/configs/datasets/humanevalx/README_en.md +++ b/ais_bench/benchmark/configs/datasets/humanevalx/README_en.md @@ -32,6 +32,6 @@ rm humanevalx.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| humanevalx_gen_0_shot | Generative task for the HumanEvalX dataset | pass@1 | 0-shot | String format | [humanevalx_gen_0_shot.py](humanevalx_gen_0_shot.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| humanevalx_gen_0_shot | Generative task for the HumanEvalX dataset | pass@1 | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.humanevalx.humanevalx_gen_0_shot import humanevalx_datasets as datasets`| [humanevalx_gen_0_shot.py](humanevalx_gen_0_shot.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ifeval/README.md b/ais_bench/benchmark/configs/datasets/ifeval/README.md index 2bbeb9b0..952b21e0 100644 --- a/ais_bench/benchmark/configs/datasets/ifeval/README.md +++ b/ais_bench/benchmark/configs/datasets/ifeval/README.md @@ -29,6 +29,6 @@ rm ifeval.zip ## 可用数据集任务 ### ifeval_0_shot_gen_str -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|ifeval_0_shot_gen_str|ifeval数据集生成式任务|accuracy|0-shot|字符串格式|[ifeval_0_shot_gen_str.py](ifeval_0_shot_gen_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||ifeval_0_shot_gen_str|ifeval数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.ifeval.ifeval_0_shot_gen_str import ifeval_datasets as datasets`|[ifeval_0_shot_gen_str.py](ifeval_0_shot_gen_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ifeval/README_en.md b/ais_bench/benchmark/configs/datasets/ifeval/README_en.md index e0e1590d..460d50e5 100644 --- a/ais_bench/benchmark/configs/datasets/ifeval/README_en.md +++ b/ais_bench/benchmark/configs/datasets/ifeval/README_en.md @@ -29,6 +29,6 @@ rm ifeval.zip ## Available Dataset Tasks ### ifeval_0_shot_gen_str -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| ifeval_0_shot_gen_str | Generative task for the IFEval dataset | Accuracy | 0-shot | String format | [ifeval_0_shot_gen_str.py](ifeval_0_shot_gen_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| ifeval_0_shot_gen_str | Generative task for the IFEval dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.ifeval.ifeval_0_shot_gen_str import ifeval_datasets as datasets`| [ifeval_0_shot_gen_str.py](ifeval_0_shot_gen_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/infovqa/README.md b/ais_bench/benchmark/configs/datasets/infovqa/README.md index 323d0b08..12890b93 100644 --- a/ais_bench/benchmark/configs/datasets/infovqa/README.md +++ b/ais_bench/benchmark/configs/datasets/infovqa/README.md @@ -24,6 +24,6 @@ wget https://opencompass.openxlab.space/utils/VLMEval/InfoVQA_VAL.tsv ## 可用数据集任务 #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|infovqa_gen|infovqa数据集生成式任务|anls|0-shot|字符串格式|[infovqa_gen.py](infovqa_gen.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||infovqa_gen|infovqa数据集生成式任务|anls|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.infovqa.infovqa_gen import infovqa_datasets as datasets`|[infovqa_gen.py](infovqa_gen.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/infovqa/README_en.md b/ais_bench/benchmark/configs/datasets/infovqa/README_en.md index 4574d31c..aa6bc94b 100644 --- a/ais_bench/benchmark/configs/datasets/infovqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/infovqa/README_en.md @@ -24,6 +24,6 @@ wget https://opencompass.openxlab.space/utils/VLMEval/InfoVQA_VAL.tsv ## Available Dataset Tasks #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|infovqa_gen|infovqa dataset generative task|anls|0-shot|String format|[infovqa_gen.py](infovqa_gen.py)| +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||infovqa_gen|infovqa dataset generative task|anls|0-shot|String format|`from ais_bench.benchmark.configs.datasets.infovqa.infovqa_gen import infovqa_datasets as datasets`|[infovqa_gen.py](infovqa_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/lambada/README.md b/ais_bench/benchmark/configs/datasets/lambada/README.md index e0cee1f2..c0ae73b6 100644 --- a/ais_bench/benchmark/configs/datasets/lambada/README.md +++ b/ais_bench/benchmark/configs/datasets/lambada/README.md @@ -24,7 +24,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|lambada_gen_0_shot_chat|lambada数据集生成式任务|accuracy|0-shot|对话格式|[lambada_gen_0_shot_chat.py](lambada_gen_0_shot_chat.py)| -|lambada_gen_0_shot_str|lambada数据集生成式任务|accuracy|0-shot|字符串格式|[lambada_gen_0_shot_str.py](lambada_gen_0_shot_str.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||lambada_gen_0_shot_chat|lambada数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.lambada.lambada_gen_0_shot_chat import lambada_datasets as datasets`|[lambada_gen_0_shot_chat.py](lambada_gen_0_shot_chat.py)| +||lambada_gen_0_shot_str|lambada数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.lambada.lambada_gen_0_shot_str import lambada_datasets as datasets`|[lambada_gen_0_shot_str.py](lambada_gen_0_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/lambada/README_en.md b/ais_bench/benchmark/configs/datasets/lambada/README_en.md index f6069a6e..cffe0c2c 100644 --- a/ais_bench/benchmark/configs/datasets/lambada/README_en.md +++ b/ais_bench/benchmark/configs/datasets/lambada/README_en.md @@ -25,10 +25,10 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| lambada_gen_0_shot_chat | Generative task for the LAMBADA dataset | Accuracy | 0-shot | Chat format | [lambada_gen_0_shot_chat.py](lambada_gen_0_shot_chat.py) | -| lambada_gen_0_shot_str | Generative task for the LAMBADA dataset | Accuracy | 0-shot | String format | [lambada_gen_0_shot_str.py](lambada_gen_0_shot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| lambada_gen_0_shot_chat | Generative task for the LAMBADA dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.lambada.lambada_gen_0_shot_chat import lambada_datasets as datasets`| [lambada_gen_0_shot_chat.py](lambada_gen_0_shot_chat.py) | +|| lambada_gen_0_shot_str | Generative task for the LAMBADA dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.lambada.lambada_gen_0_shot_str import lambada_datasets as datasets`| [lambada_gen_0_shot_str.py](lambada_gen_0_shot_str.py) | ### Translation Notes diff --git a/ais_bench/benchmark/configs/datasets/lcsts/README.md b/ais_bench/benchmark/configs/datasets/lcsts/README.md index a6b476f6..a93bb725 100644 --- a/ais_bench/benchmark/configs/datasets/lcsts/README.md +++ b/ais_bench/benchmark/configs/datasets/lcsts/README.md @@ -25,7 +25,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|lcsts_gen_0_shot_chat|lcsts数据集生成式任务|accuracy|0-shot|对话格式|[lcsts_gen_0_shot_chat.py](lcsts_gen_0_shot_chat.py)| -|lcsts_gen_0_shot_str|lcsts数据集生成式任务|accuracy|0-shot|字符串格式|[lcsts_gen_0_shot_str.py](lcsts_gen_0_shot_str.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||lcsts_gen_0_shot_chat|lcsts数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.lcsts.lcsts_gen_0_shot_chat import lcsts_datasets as datasets`|[lcsts_gen_0_shot_chat.py](lcsts_gen_0_shot_chat.py)| +||lcsts_gen_0_shot_str|lcsts数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.lcsts.lcsts_gen_0_shot_str import lcsts_datasets as datasets`|[lcsts_gen_0_shot_str.py](lcsts_gen_0_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/lcsts/README_en.md b/ais_bench/benchmark/configs/datasets/lcsts/README_en.md index 82bfa706..68f0adce 100644 --- a/ais_bench/benchmark/configs/datasets/lcsts/README_en.md +++ b/ais_bench/benchmark/configs/datasets/lcsts/README_en.md @@ -26,7 +26,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| lcsts_gen_0_shot_chat | Generative task for the LCSTS dataset | Accuracy | 0-shot | Chat format | [lcsts_gen_0_shot_chat.py](lcsts_gen_0_shot_chat.py) | -| lcsts_gen_0_shot_str | Generative task for the LCSTS dataset | Accuracy | 0-shot | String format | [lcsts_gen_0_shot_str.py](lcsts_gen_0_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| lcsts_gen_0_shot_chat | Generative task for the LCSTS dataset | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.lcsts.lcsts_gen_0_shot_chat import lcsts_datasets as datasets`| [lcsts_gen_0_shot_chat.py](lcsts_gen_0_shot_chat.py) | +|| lcsts_gen_0_shot_str | Generative task for the LCSTS dataset | Accuracy | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.lcsts.lcsts_gen_0_shot_str import lcsts_datasets as datasets`| [lcsts_gen_0_shot_str.py](lcsts_gen_0_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/livecodebench/README.md b/ais_bench/benchmark/configs/datasets/livecodebench/README.md index 9971dd2d..076ed326 100644 --- a/ais_bench/benchmark/configs/datasets/livecodebench/README.md +++ b/ais_bench/benchmark/configs/datasets/livecodebench/README.md @@ -31,8 +31,8 @@ git clone https://huggingface.co/datasets/livecodebench/code_generation_lite ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|livecodebench_0_shot_chat_v4_v5|code_generation_lite数据集的生成式任务,与DeepSeek-R1测评使用数据集一致:LiveCodeBench(2024-08 – 2025-01)|pass@1|0-shot|对话格式|[livecodebench_0_shot_chat_v4_v5.py](livecodebench_0_shot_chat_v4_v5.py)| -|livecodebench_0_shot_chat_v4_v5_v6|code_generation_lite数据集的生成式任务, 与DeepSeek-V3.1和DeepSeek-V3.2测评使用数据集一致:LiveCodeBench(2024-08 – 2025-05)|pass@1|0-shot|对话格式|[livecodebench_0_shot_chat_v4_v5_v6.py](livecodebench_0_shot_chat_v4_v5_v6.py)| -|livecodebench_0_shot_chat_v6|code_generation_lite数据集的生成式任务, 与Qwen3测评使用数据集一致:LiveCodeBench(2025-05)|pass@1|0-shot|对话格式|[livecodebench_0_shot_chat_v6.py](livecodebench_0_shot_chat_v6.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||livecodebench_0_shot_chat_v4_v5|code_generation_lite数据集的生成式任务,与DeepSeek-R1测评使用数据集一致:LiveCodeBench(2024-08 – 2025-01)|pass@1|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.livecodebench.livecodebench_0_shot_chat_v4_v5 import LCB_datasets as datasets`|[livecodebench_0_shot_chat_v4_v5.py](livecodebench_0_shot_chat_v4_v5.py)| +||livecodebench_0_shot_chat_v4_v5_v6|code_generation_lite数据集的生成式任务, 与DeepSeek-V3.1和DeepSeek-V3.2测评使用数据集一致:LiveCodeBench(2024-08 – 2025-05)|pass@1|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.livecodebench.livecodebench_0_shot_chat_v4_v5_v6 import LCB_datasets as datasets`|[livecodebench_0_shot_chat_v4_v5_v6.py](livecodebench_0_shot_chat_v4_v5_v6.py)| +||livecodebench_0_shot_chat_v6|code_generation_lite数据集的生成式任务, 与Qwen3测评使用数据集一致:LiveCodeBench(2025-05)|pass@1|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.livecodebench.livecodebench_0_shot_chat_v6 import LCB_datasets as datasets`|[livecodebench_0_shot_chat_v6.py](livecodebench_0_shot_chat_v6.py)| diff --git a/ais_bench/benchmark/configs/datasets/livecodebench/README_en.md b/ais_bench/benchmark/configs/datasets/livecodebench/README_en.md index f3834e86..e1651f1d 100644 --- a/ais_bench/benchmark/configs/datasets/livecodebench/README_en.md +++ b/ais_bench/benchmark/configs/datasets/livecodebench/README_en.md @@ -31,8 +31,8 @@ git clone https://huggingface.co/datasets/livecodebench/code_generation_lite ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|livecodebench_0_shot_chat_v4_v5|Generative task for the code_generation_lite dataset, same with DeepSeek-R1 Evaluation: LiveCodeBench(2024-08 – 2025-01)|pass@1|0-shot|Chat format|[livecodebench_0_shot_chat_v4_v5.py](livecodebench_0_shot_chat_v4_v5.py)| -|livecodebench_0_shot_chat_v4_v5_v6|Generative task for the code_generation_lite dataset, same with DeepSeek-V3.1 and DeepSeek-V3.2 Evaluation: LiveCodeBench(2024-08 – 2025-05)|pass@1|0-shot|Chat format|[livecodebench_0_shot_chat_v4_v5_v6.py](livecodebench_0_shot_chat_v4_v5_v6.py)| -|livecodebench_0_shot_chat_v6|Generative task for the code_generation_lite dataset, same with Qwen3 Evaluation: LiveCodeBench(2025-05)|pass@1|0-shot|Chat format|[livecodebench_0_shot_chat_v6.py](livecodebench_0_shot_chat_v6.py)| +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||livecodebench_0_shot_chat_v4_v5|Generative task for the code_generation_lite dataset, same with DeepSeek-R1 Evaluation: LiveCodeBench(2024-08 – 2025-01)|pass@1|0-shot|Chat format|`from ais_bench.benchmark.configs.datasets.livecodebench.livecodebench_0_shot_chat_v4_v5 import LCB_datasets as datasets`|[livecodebench_0_shot_chat_v4_v5.py](livecodebench_0_shot_chat_v4_v5.py)| +||livecodebench_0_shot_chat_v4_v5_v6|Generative task for the code_generation_lite dataset, same with DeepSeek-V3.1 and DeepSeek-V3.2 Evaluation: LiveCodeBench(2024-08 – 2025-05)|pass@1|0-shot|Chat format|`from ais_bench.benchmark.configs.datasets.livecodebench.livecodebench_0_shot_chat_v4_v5_v6 import LCB_datasets as datasets`|[livecodebench_0_shot_chat_v4_v5_v6.py](livecodebench_0_shot_chat_v4_v5_v6.py)| +||livecodebench_0_shot_chat_v6|Generative task for the code_generation_lite dataset, same with Qwen3 Evaluation: LiveCodeBench(2025-05)|pass@1|0-shot|Chat format|`from ais_bench.benchmark.configs.datasets.livecodebench.livecodebench_0_shot_chat_v6 import LCB_datasets as datasets`|[livecodebench_0_shot_chat_v6.py](livecodebench_0_shot_chat_v6.py)| diff --git a/ais_bench/benchmark/configs/datasets/longbench/README.md b/ais_bench/benchmark/configs/datasets/longbench/README.md index 86a27eab..3f48cc0e 100644 --- a/ais_bench/benchmark/configs/datasets/longbench/README.md +++ b/ais_bench/benchmark/configs/datasets/longbench/README.md @@ -48,9 +48,9 @@ LongBench包含14个英文任务、5个中文任务和2个代码任务,大部 └── LongBench.py ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|longbench|longbench|准确率(accuracy)|0-shot|对话格式|[longbench.py](longbench.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||longbench|longbench|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.longbench.longbench import longbench_datasets as datasets`|[longbench.py](longbench.py)| |longbench_2wikimqa_gen|longbench_2wikimqa_gen|准确率(accuracy)|0-shot|对话格式|[longbench_2wikimqa_gen.py](longbench2wikimqa/longbench_2wikimqa_gen.py)| |longbench_dureader_gen|longbench_dureader_gen|准确率(accuracy)|0-shot|对话格式|[longbench_dureader_gen.py](longbenchdureader/longbench_dureader_gen.py)| |longbench_gov_report_gen|longbench_gov_report_gen|准确率(accuracy)|0-shot|对话格式|[longbench_gov_report_gen.py](longbenchgov_report/longbench_gov_report_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/longbench/README_en.md b/ais_bench/benchmark/configs/datasets/longbench/README_en.md index a43f2ec2..a3d488c6 100644 --- a/ais_bench/benchmark/configs/datasets/longbench/README_en.md +++ b/ais_bench/benchmark/configs/datasets/longbench/README_en.md @@ -52,9 +52,9 @@ It is recommended to download the dataset from Hugging Face: [https://huggingfac ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| longbench | LongBench main task | Accuracy | 0-shot | Chat format | [longbench.py](longbench.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| longbench | LongBench main task | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.longbench.longbench import longbench_datasets as datasets`| [longbench.py](longbench.py) | | longbench_2wikimqa_gen | LongBench 2WikiMQA generative task | Accuracy | 0-shot | Chat format | [longbench_2wikimqa_gen.py](longbench2wikimqa/longbench_2wikimqa_gen.py) | | longbench_dureader_gen | LongBench DuReader generative task | Accuracy | 0-shot | Chat format | [longbench_dureader_gen.py](longbenchdureader/longbench_dureader_gen.py) | | longbench_gov_report_gen | LongBench GovReport generative task | Accuracy | 0-shot | Chat format | [longbench_gov_report_gen.py](longbenchgov_report/longbench_gov_report_gen.py) | diff --git a/ais_bench/benchmark/configs/datasets/longbenchv2/README.md b/ais_bench/benchmark/configs/datasets/longbenchv2/README.md index c619ec8b..f862df94 100644 --- a/ais_bench/benchmark/configs/datasets/longbenchv2/README.md +++ b/ais_bench/benchmark/configs/datasets/longbenchv2/README.md @@ -17,6 +17,6 @@ LongBench v2包含503道富有挑战性的多项选择题,涵盖六大任务 └── data.json ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|longbenchv2_gen|longbenchv2|准确率(accuracy)|0-shot|对话格式|[longbenchv2_gen.py](longbenchv2_gen.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +|longbenchv2_gen|longbenchv2|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.longbenchv2.longbenchv2_gen import LongBenchv2_datasets as datasets`|[longbenchv2_gen.py](longbenchv2_gen.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/longbenchv2/README_en.md b/ais_bench/benchmark/configs/datasets/longbenchv2/README_en.md index 7255db3d..f799046d 100644 --- a/ais_bench/benchmark/configs/datasets/longbenchv2/README_en.md +++ b/ais_bench/benchmark/configs/datasets/longbenchv2/README_en.md @@ -21,6 +21,6 @@ It is recommended to download the dataset from Hugging Face: [https://huggingfac ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| longbenchv2_gen | LongBench v2 task | Accuracy | 0-shot | Chat format | [longbenchv2_gen.py](longbenchv2_gen.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +| longbenchv2_gen | LongBench v2 task | Accuracy | 0-shot | Chat format | `from ais_bench.benchmark.configs.datasets.longbenchv2.longbenchv2_gen import LongBenchv2_datasets as datasets` | [longbenchv2_gen.py](longbenchv2_gen.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/math/README.md b/ais_bench/benchmark/configs/datasets/math/README.md index df077d8a..ff62a511 100644 --- a/ais_bench/benchmark/configs/datasets/math/README.md +++ b/ais_bench/benchmark/configs/datasets/math/README.md @@ -34,8 +34,8 @@ rm math.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|math_prm800k_500_0shot_cot_gen|MATH500数据集生成式任务, 默认max out tokens长度取32768,prompt带逻辑链|accuracy(pass@1)|0-shot|字符串格式|[math_prm800k_500_0shot_cot_gen.py](math_prm800k_500_0shot_cot_gen.py)| -|math_prm800k_500_5shot_cot_gen|MATH500数据集生成式任务, 默认max out tokens长度取32768,prompt带逻辑链|accuracy(pass@1)|5-shot|字符串格式|[math_prm800k_500_5shot_cot_gen.py](math_prm800k_500_5shot_cot_gen.py)| -|math500_gen_0_shot_cot_chat_prompt|MATH500数据集生成式任务,prompt带逻辑链(对齐DeepSeek R1精度测试)|accuracy(pass@1)|0-shot|对话格式|[math500_gen_0_shot_cot_chat_prompt.py](math500_gen_0_shot_cot_chat_prompt.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||math_prm800k_500_0shot_cot_gen|MATH500数据集生成式任务, 默认max out tokens长度取32768,prompt带逻辑链|accuracy(pass@1)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.math.math_prm800k_500_0shot_cot_gen import math_datasets as datasets`|[math_prm800k_500_0shot_cot_gen.py](math_prm800k_500_0shot_cot_gen.py)| +||math_prm800k_500_5shot_cot_gen|MATH500数据集生成式任务, 默认max out tokens长度取32768,prompt带逻辑链|accuracy(pass@1)|5-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.math.math_prm800k_500_5shot_cot_gen import math_datasets as datasets`|[math_prm800k_500_5shot_cot_gen.py](math_prm800k_500_5shot_cot_gen.py)| +||math500_gen_0_shot_cot_chat_prompt|MATH500数据集生成式任务,prompt带逻辑链(对齐DeepSeek R1精度测试)|accuracy(pass@1)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as datasets`|[math500_gen_0_shot_cot_chat_prompt.py](math500_gen_0_shot_cot_chat_prompt.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/math/README_en.md b/ais_bench/benchmark/configs/datasets/math/README_en.md index 0c6c025c..56b868db 100644 --- a/ais_bench/benchmark/configs/datasets/math/README_en.md +++ b/ais_bench/benchmark/configs/datasets/math/README_en.md @@ -33,8 +33,8 @@ rm math.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| math_prm800k_500_0shot_cot_gen | Generative task for the MATH500 dataset. The default maximum output token length is 32768, with a logical chain in the prompt. | Accuracy (pass@1) | 0-shot | String format | [math_prm800k_500_0shot_cot_gen.py](math_prm800k_500_0shot_cot_gen.py) | -| math_prm800k_500_5shot_cot_gen | Generative task for the MATH500 dataset. The default maximum output token length is 32768, with a logical chain in the prompt. | Accuracy (pass@1) | 5-shot | String format | [math_prm800k_500_5shot_cot_gen.py](math_prm800k_500_5shot_cot_gen.py) | -| math500_gen_0_shot_cot_chat_prompt | Generative task for the MATH500 dataset, with a logical chain in the prompt (aligned with DeepSeek R1 accuracy test) | Accuracy (pass@1) | 0-shot | Chat format | [math500_gen_0_shot_cot_chat_prompt.py](math500_gen_0_shot_cot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| math_prm800k_500_0shot_cot_gen | Generative task for the MATH500 dataset. The default maximum output token length is 32768, with a logical chain in the prompt. | Accuracy (pass@1) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.math.math_prm800k_500_0shot_cot_gen import math_datasets as datasets`| [math_prm800k_500_0shot_cot_gen.py](math_prm800k_500_0shot_cot_gen.py) | +|| math_prm800k_500_5shot_cot_gen | Generative task for the MATH500 dataset. The default maximum output token length is 32768, with a logical chain in the prompt. | Accuracy (pass@1) | 5-shot | String format |`from ais_bench.benchmark.configs.datasets.math.math_prm800k_500_5shot_cot_gen import math_datasets as datasets`| [math_prm800k_500_5shot_cot_gen.py](math_prm800k_500_5shot_cot_gen.py) | +|| math500_gen_0_shot_cot_chat_prompt | Generative task for the MATH500 dataset, with a logical chain in the prompt (aligned with DeepSeek R1 accuracy test) | Accuracy (pass@1) | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as datasets`| [math500_gen_0_shot_cot_chat_prompt.py](math500_gen_0_shot_cot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/mathvision/README.md b/ais_bench/benchmark/configs/datasets/mathvision/README.md index 446c775d..4b76f741 100644 --- a/ais_bench/benchmark/configs/datasets/mathvision/README.md +++ b/ais_bench/benchmark/configs/datasets/mathvision/README.md @@ -32,6 +32,6 @@ mathvision ``` ## 可用数据集任务 -| 任务名称 | 简介 | 评估指标 | few-shot | prompt 格式 | 对应源码配置文件路径 | -| --- | --- | --- | --- | --- | --- | -| mathvision_gen | MathVision 数据集生成式多模态数学推理任务,支持选择题和自由作答题;选择题要求最后一行输出 `ANSWER: [LETTER]`,自由作答题要求最终答案放在 `\boxed{}` 中 | Accuracy | 0-shot | 多模态对话格式(文本 + 图片) | [mathvision_gen.py](mathvision_gen.py) | +| 任务名称 | 简介 | 评估指标 | few-shot | prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| mathvision_gen | MathVision 数据集生成式多模态数学推理任务,支持选择题和自由作答题;选择题要求最后一行输出 `ANSWER: [LETTER]`,自由作答题要求最终答案放在 `\boxed{}` 中 | Accuracy | 0-shot | 多模态对话格式(文本 + 图片) | `from ais_bench.benchmark.configs.datasets.mathvision.mathvision_gen import mathvision_datasets as datasets` | [mathvision_gen.py](mathvision_gen.py) | diff --git a/ais_bench/benchmark/configs/datasets/mathvision/README_en.md b/ais_bench/benchmark/configs/datasets/mathvision/README_en.md index b0dc3be6..608ac975 100644 --- a/ais_bench/benchmark/configs/datasets/mathvision/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mathvision/README_en.md @@ -32,6 +32,6 @@ mathvision ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| mathvision_gen | Generative multimodal mathematical reasoning task for MathVision. It supports both multiple-choice and open-answer questions. Multiple-choice questions require the final line to be `ANSWER: [LETTER]`, while open-answer questions require the final answer in `\boxed{}` | Accuracy | 0-shot | Multimodal chat format (text + image) | [mathvision_gen.py](mathvision_gen.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| mathvision_gen | Generative multimodal mathematical reasoning task for MathVision. It supports both multiple-choice and open-answer questions. Multiple-choice questions require the final line to be `ANSWER: [LETTER]`, while open-answer questions require the final answer in `\boxed{}` | Accuracy | 0-shot | Multimodal chat format (text + image) |`from ais_bench.benchmark.configs.datasets.mathvision.mathvision_gen import mathvision_datasets as datasets`| [mathvision_gen.py](mathvision_gen.py) | diff --git a/ais_bench/benchmark/configs/datasets/mbpp/README.md b/ais_bench/benchmark/configs/datasets/mbpp/README.md index d66ae589..1641879e 100644 --- a/ais_bench/benchmark/configs/datasets/mbpp/README.md +++ b/ais_bench/benchmark/configs/datasets/mbpp/README.md @@ -25,7 +25,7 @@ rm mbpp.zip ## 可用数据集任务 ### mbpp_passk_gen_3_shot_chat_prompt #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mbpp_passk_gen_3_shot_chat_prompt|mbpp数据集生成式任务,支持测pass@k(默认pass@1)|pass@1|3-shot|对话格式|[mbpp_passk_gen_3_shot_chat_prompt.py](mbpp_passk_gen_3_shot_chat_prompt.py)| -|sanitized_mbpp_passk_gen_3_shot_chat_prompt|sanitized mbpp数据集生成式任务,支持测pass@k(默认pass@1)|pass@1|3-shot|对话格式|[sanitized_mbpp_passk_gen_3_shot_chat_prompt.py](sanitized_mbpp_passk_gen_3_shot_chat_prompt.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mbpp_passk_gen_3_shot_chat_prompt|mbpp数据集生成式任务,支持测pass@k(默认pass@1)|pass@1|3-shot|对话格式|`from ais_bench.benchmark.configs.datasets.mbpp.mbpp_passk_gen_3_shot_chat_prompt import mbpp_datasets as datasets`|[mbpp_passk_gen_3_shot_chat_prompt.py](mbpp_passk_gen_3_shot_chat_prompt.py)| +||sanitized_mbpp_passk_gen_3_shot_chat_prompt|sanitized mbpp数据集生成式任务,支持测pass@k(默认pass@1)|pass@1|3-shot|对话格式|`from ais_bench.benchmark.configs.datasets.mbpp.sanitized_mbpp_passk_gen_3_shot_chat_prompt import sanitized_mbpp_datasets as datasets`|[sanitized_mbpp_passk_gen_3_shot_chat_prompt.py](sanitized_mbpp_passk_gen_3_shot_chat_prompt.py)| diff --git a/ais_bench/benchmark/configs/datasets/mbpp/README_en.md b/ais_bench/benchmark/configs/datasets/mbpp/README_en.md index e87c3a92..7fed4c5e 100644 --- a/ais_bench/benchmark/configs/datasets/mbpp/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mbpp/README_en.md @@ -25,7 +25,7 @@ rm mbpp.zip ## Available Dataset Tasks ### mbpp_passk_gen_3_shot_chat_prompt #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| mbpp_passk_gen_3_shot_chat_prompt | Generative task for the mbpp dataset, supporting pass@k evaluation (default: pass@1) | pass@1 | 3-shot | Chat format | [mbpp_passk_gen_3_shot_chat_prompt.py](mbpp_passk_gen_3_shot_chat_prompt.py) | -| sanitized_mbpp_passk_gen_3_shot_chat_prompt | Generative task for the sanitized mbpp dataset, supporting pass@k evaluation (default: pass@1) | pass@1 | 3-shot | Chat format | [sanitized_mbpp_passk_gen_3_shot_chat_prompt.py](sanitized_mbpp_passk_gen_3_shot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| mbpp_passk_gen_3_shot_chat_prompt | Generative task for the mbpp dataset, supporting pass@k evaluation (default: pass@1) | pass@1 | 3-shot | Chat format |`from ais_bench.benchmark.configs.datasets.mbpp.mbpp_passk_gen_3_shot_chat_prompt import mbpp_datasets as datasets`| [mbpp_passk_gen_3_shot_chat_prompt.py](mbpp_passk_gen_3_shot_chat_prompt.py) | +|| sanitized_mbpp_passk_gen_3_shot_chat_prompt | Generative task for the sanitized mbpp dataset, supporting pass@k evaluation (default: pass@1) | pass@1 | 3-shot | Chat format |`from ais_bench.benchmark.configs.datasets.mbpp.sanitized_mbpp_passk_gen_3_shot_chat_prompt import sanitized_mbpp_datasets as datasets`| [sanitized_mbpp_passk_gen_3_shot_chat_prompt.py](sanitized_mbpp_passk_gen_3_shot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/mgsm/README.md b/ais_bench/benchmark/configs/datasets/mgsm/README.md index 1b797477..a7257724 100644 --- a/ais_bench/benchmark/configs/datasets/mgsm/README.md +++ b/ais_bench/benchmark/configs/datasets/mgsm/README.md @@ -34,7 +34,7 @@ git clone https://huggingface.co/datasets/juletxara/mgsm ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mgsm_gen_0_shot_cot_chat_prompt|mgsm数据集生成式任务,prompt带逻辑链|accuracy|0-shot|对话格式|[mgsm_gen_0_shot_cot_chat_prompt.py](mgsm_gen_0_shot_cot_chat_prompt.py)| -|mgsm_gen_8_shot_cot_chat_prompt|mgsm数据集生成式任务,prompt带逻辑链|accuracy|8-shot|对话格式|[mgsm_gen_8_shot_cot_chat_prompt.py](mgsm_gen_8_shot_cot_chat_prompt.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mgsm_gen_0_shot_cot_chat_prompt|mgsm数据集生成式任务,prompt带逻辑链|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.mgsm.mgsm_gen_0_shot_cot_chat_prompt import mgsm_datasets as datasets`|[mgsm_gen_0_shot_cot_chat_prompt.py](mgsm_gen_0_shot_cot_chat_prompt.py)| +||mgsm_gen_8_shot_cot_chat_prompt|mgsm数据集生成式任务,prompt带逻辑链|accuracy|8-shot|对话格式|`from ais_bench.benchmark.configs.datasets.mgsm.mgsm_gen_8_shot_cot_chat_prompt import mgsm_datasets as datasets`|[mgsm_gen_8_shot_cot_chat_prompt.py](mgsm_gen_8_shot_cot_chat_prompt.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/mgsm/README_en.md b/ais_bench/benchmark/configs/datasets/mgsm/README_en.md index 0b05143e..d31f4e10 100644 --- a/ais_bench/benchmark/configs/datasets/mgsm/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mgsm/README_en.md @@ -34,7 +34,7 @@ git clone https://huggingface.co/datasets/juletxara/mgsm ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| mgsm_gen_0_shot_cot_chat_prompt | Generative task for the mgsm dataset, with a logical chain in the prompt | Accuracy | 0-shot | Chat format | [mgsm_gen_0_shot_cot_chat_prompt.py](mgsm_gen_0_shot_cot_chat_prompt.py) | -| mgsm_gen_8_shot_cot_chat_prompt | Generative task for the mgsm dataset, with a logical chain in the prompt | Accuracy | 8-shot | Chat format | [mgsm_gen_8_shot_cot_chat_prompt.py](mgsm_gen_8_shot_cot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| mgsm_gen_0_shot_cot_chat_prompt | Generative task for the mgsm dataset, with a logical chain in the prompt | Accuracy | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.mgsm.mgsm_gen_0_shot_cot_chat_prompt import mgsm_datasets as datasets`| [mgsm_gen_0_shot_cot_chat_prompt.py](mgsm_gen_0_shot_cot_chat_prompt.py) | +|| mgsm_gen_8_shot_cot_chat_prompt | Generative task for the mgsm dataset, with a logical chain in the prompt | Accuracy | 8-shot | Chat format |`from ais_bench.benchmark.configs.datasets.mgsm.mgsm_gen_8_shot_cot_chat_prompt import mgsm_datasets as datasets`| [mgsm_gen_8_shot_cot_chat_prompt.py](mgsm_gen_8_shot_cot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/mmlu/README.md b/ais_bench/benchmark/configs/datasets/mmlu/README.md index 0ff33e51..b3db99e5 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu/README.md +++ b/ais_bench/benchmark/configs/datasets/mmlu/README.md @@ -198,10 +198,10 @@ rm mmlu.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mmlu_gen_5_shot_str|MMLU数据集生成式任务|accuracy(naive_average)|5-shot|字符串格式|[mmlu_gen_5_shot_str.py](mmlu_gen_5_shot_str.py)| -|mmlu_gen_5_shot_chat_prompt|MMLU数据集生成式任务|accuracy(naive_average)|5-shot|对话格式|[mmlu_gen_5_shot_chat_prompt.py](mmlu_gen_5_shot_chat_prompt.py)| -|mmlu_ppl_0_shot_str|MMLU数据集PPL任务|accuracy(naive_average)|0-shot|字符串格式|[mmlu_ppl_0_shot_str.py](mmlu_ppl_0_shot_str.py)| -|mmlu_gen_0_shot_cot_chat_prompt|MMLU数据集生成式任务,prompt带逻辑链(对齐DeepSeek R1精度测试)|accuracy(naive_average)|0-shot|对话格式|[mmlu_gen_0_shot_cot_chat_prompt.py](mmlu_gen_0_shot_cot_chat_prompt.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmlu_gen_5_shot_str|MMLU数据集生成式任务|accuracy(naive_average)|5-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_str import mmlu_datasets as datasets`|[mmlu_gen_5_shot_str.py](mmlu_gen_5_shot_str.py)| +||mmlu_gen_5_shot_chat_prompt|MMLU数据集生成式任务|accuracy(naive_average)|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_chat_prompt import mmlu_datasets as datasets`|[mmlu_gen_5_shot_chat_prompt.py](mmlu_gen_5_shot_chat_prompt.py)| +||mmlu_ppl_0_shot_str|MMLU数据集PPL任务|accuracy(naive_average)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_ppl_0_shot_str import mmlu_datasets as datasets`|[mmlu_ppl_0_shot_str.py](mmlu_ppl_0_shot_str.py)| +||mmlu_gen_0_shot_cot_chat_prompt|MMLU数据集生成式任务,prompt带逻辑链(对齐DeepSeek R1精度测试)|accuracy(naive_average)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_0_shot_cot_chat_prompt import mmlu_datasets as datasets`|[mmlu_gen_0_shot_cot_chat_prompt.py](mmlu_gen_0_shot_cot_chat_prompt.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmlu/README_en.md b/ais_bench/benchmark/configs/datasets/mmlu/README_en.md index ad962b03..fa968fec 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mmlu/README_en.md @@ -198,10 +198,10 @@ rm mmlu.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| mmlu_gen_5_shot_str | Generative task for the MMLU dataset | Accuracy (naive_average) | 5-shot | String format | [mmlu_gen_5_shot_str.py](mmlu_gen_5_shot_str.py) | -| mmlu_gen_5_shot_chat_prompt | Generative task for the MMLU dataset, with a logical chain in the prompt| Accuracy (naive_average) | 5-shot | Chat format | [mmlu_gen_5_shot_chat_prompt.py](mmlu_gen_5_shot_chat_prompt.py) | -| mmlu_ppl_0_shot_str | MMLU dataset PPL task | Accuracy (naive_average) | 0-shot | String format | [mmlu_ppl_0_shot_str.py](mmlu_ppl_0_shot_str.py) | -| mmlu_gen_0_shot_cot_chat_prompt | Generative task for the MMLU dataset, with a logical chain in the prompt (aligned with DeepSeek R1 accuracy test) | Accuracy (naive_average) | 0-shot | Chat format | [mmlu_gen_0_shot_cot_chat_prompt.py](mmlu_gen_0_shot_cot_chat_prompt.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| mmlu_gen_5_shot_str | Generative task for the MMLU dataset | Accuracy (naive_average) | 5-shot | String format |`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_str import mmlu_datasets as datasets`| [mmlu_gen_5_shot_str.py](mmlu_gen_5_shot_str.py) | +|| mmlu_gen_5_shot_chat_prompt | Generative task for the MMLU dataset, with a logical chain in the prompt| Accuracy (naive_average) | 5-shot | Chat format |`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_chat_prompt import mmlu_datasets as datasets`| [mmlu_gen_5_shot_chat_prompt.py](mmlu_gen_5_shot_chat_prompt.py) | +|| mmlu_ppl_0_shot_str | MMLU dataset PPL task | Accuracy (naive_average) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_ppl_0_shot_str import mmlu_datasets as datasets`| [mmlu_ppl_0_shot_str.py](mmlu_ppl_0_shot_str.py) | +|| mmlu_gen_0_shot_cot_chat_prompt | Generative task for the MMLU dataset, with a logical chain in the prompt (aligned with DeepSeek R1 accuracy test) | Accuracy (naive_average) | 0-shot | Chat format |`from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_0_shot_cot_chat_prompt import mmlu_datasets as datasets`| [mmlu_gen_0_shot_cot_chat_prompt.py](mmlu_gen_0_shot_cot_chat_prompt.py) | diff --git a/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md b/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md index 1f18821d..e6359082 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md +++ b/ais_bench/benchmark/configs/datasets/mmlu_pro/README.md @@ -25,7 +25,7 @@ rm mmlu_pro.zip ## 可用数据集任务 ### mmlu_pro_gen_0_shot_str #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mmlu_pro_gen_0_shot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|[mmlu_pro_gen_0_shot_str.py](mmlu_pro_gen_0_shot_str.py)| -|mmlu_pro_gen_5_shot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|[mmlu_pro_gen_5_shot_str.py](mmlu_pro_gen_5_shot_str.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmlu_pro_gen_0_shot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_0_shot_str import mmlu_pro_datasets as datasets`|[mmlu_pro_gen_0_shot_str.py](mmlu_pro_gen_0_shot_str.py)| +||mmlu_pro_gen_5_shot_str|mmlu-pro数据集生成式任务|pass@1|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_5_shot_str import mmlu_pro_datasets as datasets`|[mmlu_pro_gen_5_shot_str.py](mmlu_pro_gen_5_shot_str.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md b/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md index 857b90f0..e39b70cd 100644 --- a/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mmlu_pro/README_en.md @@ -25,10 +25,10 @@ rm mmlu_pro.zip ## Available Dataset Tasks ### mmlu_pro_gen_0_shot_str #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| mmlu_pro_gen_0_shot_str | Generative task for the mmlu-pro dataset | pass@1 | 0-shot | String format | [mmlu_pro_gen_0_shot_str.py](mmlu_pro_gen_0_shot_str.py) | -| mmlu_pro_gen_5_shot_str | Generative task for the mmlu-pro dataset | pass@1 | 5-shot | String format | [mmlu_pro_gen_5_shot_str.py](mmlu_pro_gen_5_shot_str.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| mmlu_pro_gen_0_shot_str | Generative task for the mmlu-pro dataset | pass@1 | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_0_shot_str import mmlu_pro_datasets as datasets`| [mmlu_pro_gen_0_shot_str.py](mmlu_pro_gen_0_shot_str.py) | +|| mmlu_pro_gen_5_shot_str | Generative task for the mmlu-pro dataset | pass@1 | 5-shot | String format |`from ais_bench.benchmark.configs.datasets.mmlu_pro.mmlu_pro_gen_5_shot_str import mmlu_pro_datasets as datasets`| [mmlu_pro_gen_5_shot_str.py](mmlu_pro_gen_5_shot_str.py) | ### Note on Accuracy Correction diff --git a/ais_bench/benchmark/configs/datasets/mmmu/README.md b/ais_bench/benchmark/configs/datasets/mmmu/README.md index e9fb09dc..064763a9 100644 --- a/ais_bench/benchmark/configs/datasets/mmmu/README.md +++ b/ais_bench/benchmark/configs/datasets/mmmu/README.md @@ -29,6 +29,6 @@ git clone https://www.modelscope.cn/datasets/AI-ModelScope/MMMU.git mmmu ## 可用数据集任务 ### mmmu_gen #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mmmu_gen|MMMU 数据集生成式任务:选择题使用CoT单选模板,开放题使用 `ANSWER: [ANSWER]` 模板|acc|0-shot|多模态对话格式|[mmmu_gen.py](mmmu_gen.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmmu_gen|MMMU 数据集生成式任务:选择题使用CoT单选模板,开放题使用 `ANSWER: [ANSWER]` 模板|acc|0-shot|多模态对话格式|`from ais_bench.benchmark.configs.datasets.mmmu.mmmu_gen import mmmu_datasets as datasets`|[mmmu_gen.py](mmmu_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmmu/README_en.md b/ais_bench/benchmark/configs/datasets/mmmu/README_en.md index 70f9bcbc..74ad3d67 100644 --- a/ais_bench/benchmark/configs/datasets/mmmu/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mmmu/README_en.md @@ -29,6 +29,6 @@ git clone https://www.modelscope.cn/datasets/AI-ModelScope/MMMU.git mmmu ## Available Dataset Tasks ### mmmu_gen #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|mmmu_gen|Generative MMMU task: multiple-choice questions use the CoT single-answer template, while open questions use the `ANSWER: [ANSWER]` template|acc|0-shot|Multimodal chat format|[mmmu_gen.py](mmmu_gen.py)| +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmmu_gen|Generative MMMU task: multiple-choice questions use the CoT single-answer template, while open questions use the `ANSWER: [ANSWER]` template|acc|0-shot|Multimodal chat format|`from ais_bench.benchmark.configs.datasets.mmmu.mmmu_gen import mmmu_datasets as datasets`|[mmmu_gen.py](mmmu_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmmu_pro/README.md b/ais_bench/benchmark/configs/datasets/mmmu_pro/README.md index 3b10af12..da9da631 100644 --- a/ais_bench/benchmark/configs/datasets/mmmu_pro/README.md +++ b/ais_bench/benchmark/configs/datasets/mmmu_pro/README.md @@ -28,9 +28,9 @@ wget https://opencompass.openxlab.space/utils/VLMEval/MMMU_Pro_V.tsv ## 可用数据集任务 #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mmmu_pro_options10_cot_gen|mmmu_pro options10数据集思维链生成式任务|acc|0-shot|字符串格式|[mmmu_pro_options10_cot_gen.py](mmmu_pro_options10_cot_gen.py)| -|mmmu_pro_options10_gen|mmmu_pro options10数据集生成式任务|acc|0-shot|字符串格式|[mmmu_pro_options10_gen.py](mmmu_pro_options10_gen.py)| -|mmmu_pro_vision_cot_gen|mmmu_pro vision数据集思维链生成式任务|acc|0-shot|字符串格式|[mmmu_pro_vision_cot_gen.py](mmmu_pro_vision_cot_gen.py)| -|mmmu_pro_vision_gen|mmmu_pro vision数据集生成式任务|acc|0-shot|字符串格式|[mmmu_pro_vision_gen.py](mmmu_pro_vision_gen.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmmu_pro_options10_cot_gen|mmmu_pro options10数据集思维链生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_options10_cot_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_options10_cot_gen.py](mmmu_pro_options10_cot_gen.py)| +||mmmu_pro_options10_gen|mmmu_pro options10数据集生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_options10_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_options10_gen.py](mmmu_pro_options10_gen.py)| +||mmmu_pro_vision_cot_gen|mmmu_pro vision数据集思维链生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_vision_cot_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_vision_cot_gen.py](mmmu_pro_vision_cot_gen.py)| +||mmmu_pro_vision_gen|mmmu_pro vision数据集生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_vision_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_vision_gen.py](mmmu_pro_vision_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmmu_pro/README_en.md b/ais_bench/benchmark/configs/datasets/mmmu_pro/README_en.md index 3262a911..a7bf27a3 100644 --- a/ais_bench/benchmark/configs/datasets/mmmu_pro/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mmmu_pro/README_en.md @@ -28,9 +28,9 @@ wget https://opencompass.openxlab.space/utils/VLMEval/MMMU_Pro_V.tsv ## Available Dataset Tasks #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|mmmu_pro_options10_cot_gen|mmmu_pro options10 dataset thinking chain generative task|acc|0-shot|String format|[mmmu_pro_options10_cot_gen.py](mmmu_pro_options10_cot_gen.py)| -|mmmu_pro_options10_gen|mmmu_pro options10 dataset generative task|acc|0-shot|String format|[mmmu_pro_options10_gen.py](mmmu_pro_options10_gen.py)| -|mmmu_pro_vision_cot_gen|mmmu_pro vision dataset thinking chain generative task|acc|0-shot|String format|[mmmu_pro_vision_cot_gen.py](mmmu_pro_vision_cot_gen.py)| -|mmmu_pro_vision_gen|mmmu_pro vision dataset generative task|acc|0-shot|String format|[mmmu_pro_vision_gen.py](mmmu_pro_vision_gen.py)| +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmmu_pro_options10_cot_gen|mmmu_pro options10 dataset thinking chain generative task|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_options10_cot_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_options10_cot_gen.py](mmmu_pro_options10_cot_gen.py)| +||mmmu_pro_options10_gen|mmmu_pro options10 dataset generative task|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_options10_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_options10_gen.py](mmmu_pro_options10_gen.py)| +||mmmu_pro_vision_cot_gen|mmmu_pro vision dataset thinking chain generative task|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_vision_cot_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_vision_cot_gen.py](mmmu_pro_vision_cot_gen.py)| +||mmmu_pro_vision_gen|mmmu_pro vision dataset generative task|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.mmmu_pro.mmmu_pro_vision_gen import mmmu_pro_datasets as datasets`|[mmmu_pro_vision_gen.py](mmmu_pro_vision_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmstar/README.md b/ais_bench/benchmark/configs/datasets/mmstar/README.md index 5a00f29e..6cb84407 100644 --- a/ais_bench/benchmark/configs/datasets/mmstar/README.md +++ b/ais_bench/benchmark/configs/datasets/mmstar/README.md @@ -24,7 +24,7 @@ wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.t ## 可用数据集任务 ### mmstar_gen #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mmstar_gen|mmstar数据集生成式任务|acc|0-shot|字符串格式|[mmstar_gen.py](mmstar_gen.py)| -|mmstar_gen_cot|mmstar数据集思维链生成式任务|acc|0-shot|字符串格式|[mmstar_gen_cot.py](mmstar_gen_cot.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmstar_gen|mmstar数据集生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmstar.mmstar_gen import mmstar_datasets as datasets`|[mmstar_gen.py](mmstar_gen.py)| +||mmstar_gen_cot|mmstar数据集思维链生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.mmstar.mmstar_gen_cot import mmstar_datasets as datasets`|[mmstar_gen_cot.py](mmstar_gen_cot.py)| diff --git a/ais_bench/benchmark/configs/datasets/mmstar/README_en.md b/ais_bench/benchmark/configs/datasets/mmstar/README_en.md index 482c0019..757f91cb 100644 --- a/ais_bench/benchmark/configs/datasets/mmstar/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mmstar/README_en.md @@ -24,7 +24,7 @@ wget https://www.modelscope.cn/datasets/evalscope/MMStar/resolve/master/MMStar.t ## Available Dataset Tasks ### mmstar_gen #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|mmstar_gen|Generative task for the mmstar dataset|acc|0-shot|String format|[mmstar_gen.py](mmstar_gen.py)| -|mmstar_gen_cot|COT Generative task for the mmstar dataset|acc|0-shot|String format|[mmstar_gen_cot.py](mmstar_gen_cot.py)| \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||mmstar_gen|Generative task for the mmstar dataset|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.mmstar.mmstar_gen import mmstar_datasets as datasets`|[mmstar_gen.py](mmstar_gen.py)| +||mmstar_gen_cot|COT Generative task for the mmstar dataset|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.mmstar.mmstar_gen_cot import mmstar_datasets as datasets`|[mmstar_gen_cot.py](mmstar_gen_cot.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/mooncake_trace/README.md b/ais_bench/benchmark/configs/datasets/mooncake_trace/README.md index 91e58316..7b92589a 100644 --- a/ais_bench/benchmark/configs/datasets/mooncake_trace/README.md +++ b/ais_bench/benchmark/configs/datasets/mooncake_trace/README.md @@ -128,9 +128,9 @@ Mooncake Trace 数据集是一个用于性能评测的 trace 数据集,支持 ## 可用数据集任务 -| 任务名称 | 简介 | 评估指标 | few-shot | prompt格式 | 对应源码配置文件路径 | -| --- | --- | --- | --- | --- | --- | -| mooncake-trace | Mooncake trace 数据集生成式任务 | 性能测评 | 0-shot | 字符串格式 | [mooncake_trace_gen.py](mooncake_trace_gen.py) | +| 任务名称 | 简介 | 评估指标 | few-shot | prompt格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| mooncake-trace | Mooncake trace 数据集生成式任务 | 性能测评 | 0-shot | 字符串格式 | `from ais_bench.benchmark.configs.datasets.mooncake_trace.mooncake_trace_gen import mooncake_trace_datasets as datasets` | [mooncake_trace_gen.py](mooncake_trace_gen.py) | ## 使用示例 diff --git a/ais_bench/benchmark/configs/datasets/mooncake_trace/README_en.md b/ais_bench/benchmark/configs/datasets/mooncake_trace/README_en.md index 1c2d1751..3dc405dd 100644 --- a/ais_bench/benchmark/configs/datasets/mooncake_trace/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mooncake_trace/README_en.md @@ -126,9 +126,9 @@ When using `hash_ids`, `input_length` must satisfy: ## Available Dataset Tasks -| Task Name | Description | Metrics | few-shot | Prompt Format | Config Path | -| --- | --- | --- | --- | --- | --- | -| mooncake-trace | Mooncake trace generative task | Performance | 0-shot | String | [mooncake_trace_gen.py](mooncake_trace_gen.py) | +| Task Name | Description | Metrics | few-shot | Prompt Format | Import Statement | Config Path | +| --- | --- | --- | --- | --- | --- | --- | +| mooncake-trace | Mooncake trace generative task | Performance | 0-shot | String | `from ais_bench.benchmark.configs.datasets.mooncake_trace.mooncake_trace_gen import mooncake_trace_datasets as datasets` | [mooncake_trace_gen.py](mooncake_trace_gen.py) | ## Usage Examples diff --git a/ais_bench/benchmark/configs/datasets/mtbench/README.md b/ais_bench/benchmark/configs/datasets/mtbench/README.md index 0c9cc7e2..6252cd33 100644 --- a/ais_bench/benchmark/configs/datasets/mtbench/README.md +++ b/ais_bench/benchmark/configs/datasets/mtbench/README.md @@ -29,9 +29,9 @@ wget https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts/blob/main/ra ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|mtbench_gen|mtbench生成式任务|暂不支持精度评测|0-shot|列表格式|[mtbench_gen.py](mtbench_gen.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||mtbench_gen|mtbench生成式任务|暂不支持精度评测|0-shot|列表格式|`from ais_bench.benchmark.configs.datasets.mtbench.mtbench_gen import mtbench_datasets as datasets`|[mtbench_gen.py](mtbench_gen.py)| *注意:该多轮对话数据集的测评支持vLLM、SGLang、MindIE Service等服务化,使用时需指定--models为vllm_api_stream_chat_multiturn* \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/mtbench/README_en.md b/ais_bench/benchmark/configs/datasets/mtbench/README_en.md index f5f5b0ad..6b7056f2 100644 --- a/ais_bench/benchmark/configs/datasets/mtbench/README_en.md +++ b/ais_bench/benchmark/configs/datasets/mtbench/README_en.md @@ -29,9 +29,9 @@ wget https://huggingface.co/datasets/HuggingFaceH4/mt_bench_prompts/blob/main/ra ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| mtbench_gen | Generative task for MTBench | Accuracy evaluation not supported temporarily | 0-shot | List format | [mtbench_gen.py](mtbench_gen.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| mtbench_gen | Generative task for MTBench | Accuracy evaluation not supported temporarily | 0-shot | List format |`from ais_bench.benchmark.configs.datasets.mtbench.mtbench_gen import mtbench_datasets as datasets`| [mtbench_gen.py](mtbench_gen.py) | *Note: The evaluation of this multi-turn conversation dataset supports service deployment frameworks such as vLLM, SGLang, and MindIE Service. When using it, you need to specify `--models` as `vllm_api_stream_chat_multiturn`.* diff --git a/ais_bench/benchmark/configs/datasets/needlebench_v2/README.md b/ais_bench/benchmark/configs/datasets/needlebench_v2/README.md index 7d738cd6..de004022 100644 --- a/ais_bench/benchmark/configs/datasets/needlebench_v2/README.md +++ b/ais_bench/benchmark/configs/datasets/needlebench_v2/README.md @@ -47,35 +47,35 @@ NeedleBench V2引入了更平衡的评分系统。总体评分现在是通过三 └── zh_tech.jsonl ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|atc_0shot_nocot_2_power_en|atc_0shot_nocot_2_power_en|准确率(accuracy)|0-shot|对话格式|[atc_0shot_nocot_2_power_en.py](atc/atc_0shot_nocot_2_power_en.py)| -|needlebench_v2_4k|needlebench_v2_4k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_4k.py](needlebench_v2_4k/needlebench_v2_4k.py)| -|needlebench_v2_multi_reasoning_4k|needlebench_v2_multi_reasoning_4k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_4k.py](needlebench_v2_4k/needlebench_v2_multi_reasoning_4k.py)| -|needlebench_v2_multi_retrieval_4k|needlebench_v2_multi_retrieval_4k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_4k.py](needlebench_v2_4k/needlebench_v2_multi_retrieval_4k.py)| -|needlebench_v2_single_4k|needlebench_v2_single_4k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_4k.py](needlebench_v2_4k/needlebench_v2_single_4k.py)| -|needlebench_v2_8k|needlebench_v2_8k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_8k.py](needlebench_v2_8k/needlebench_v2_8k.py)| -|needlebench_v2_multi_reasoning_8k|needlebench_v2_multi_reasoning_8k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_8k.py](needlebench_v2_8k/needlebench_v2_multi_reasoning_8k.py)| -|needlebench_v2_multi_retrieval_8k|needlebench_v2_multi_retrieval_8k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_8k.py)| -|needlebench_v2_single_8k|needlebench_v2_single_8k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_8k.py](needlebench_v2_8k/needlebench_v2_single_8k.py)| -|needlebench_v2_multi_retrieval_compare_batch_8k|needlebench_v2_multi_retrieval_compare_batch_8k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_compare_batch_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_compare_batch_8k.py)| -|needlebench_v2_32k|needlebench_v2_32k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_32k.py](needlebench_v2_32k/needlebench_v2_32k.py)| -|needlebench_v2_multi_reasoning_32k|needlebench_v2_multi_reasoning_32k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_32k.py](needlebench_v2_32k/needlebench_v2_multi_reasoning_32k.py)| -|needlebench_v2_multi_retrieval_32k|needlebench_v2_multi_retrieval_32k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_32k.py](needlebench_v2_32k/needlebench_v2_multi_retrieval_32k.py)| -|needlebench_v2_single_32k|needlebench_v2_single_32k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_32k.py](needlebench_v2_32k/needlebench_v2_single_32k.py)| -|needlebench_v2_128k|needlebench_v2_128k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_128k.py](needlebench_v2_128k/needlebench_v2_128k.py)| -|needlebench_v2_multi_reasoning_128k|needlebench_v2_multi_reasoning_128k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_128k.py](needlebench_v2_128k/needlebench_v2_multi_reasoning_128k.py)| -|needlebench_v2_multi_retrieval_128k|needlebench_v2_multi_retrieval_128k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_128k.py](needlebench_v2_128k/needlebench_v2_multi_retrieval_128k.py)| -|needlebench_v2_single_128k|needlebench_v2_single_128k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_128k.py](needlebench_v2_128k/needlebench_v2_single_128k.py)| -|needlebench_v2_200k|needlebench_v2_200k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_200k.py](needlebench_v2_200k/needlebench_v2_200k.py)| -|needlebench_v2_multi_reasoning_200k|needlebench_v2_multi_reasoning_200k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_200k.py](needlebench_v2_200k/needlebench_v2_multi_reasoning_200k.py)| -|needlebench_v2_multi_retrieval_200k|needlebench_v2_multi_retrieval_200k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_200k.py](needlebench_v2_200k/needlebench_v2_multi_retrieval_200k.py)| -|needlebench_v2_single_200k|needlebench_v2_single_200k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_200k.py](needlebench_v2_200k/needlebench_v2_single_200k.py)| -|needlebench_v2_256k|needlebench_v2_256k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_256k.py](needlebench_v2_256k/needlebench_v2_256k.py)| -|needlebench_v2_multi_reasoning_256k|needlebench_v2_multi_reasoning_256k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_256k.py](needlebench_v2_256k/needlebench_v2_multi_reasoning_256k.py)| -|needlebench_v2_multi_retrieval_256k|needlebench_v2_multi_retrieval_256k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_256k.py](needlebench_v2_256k/needlebench_v2_multi_retrieval_256k.py)| -|needlebench_v2_single_256k|needlebench_v2_single_256k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_256k.py](needlebench_v2_256k/needlebench_v2_single_256k.py)| -|needlebench_v2_1000k|needlebench_v2_1000k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_1000k.py](needlebench_v2_1000k/needlebench_v2_1000k.py)| -|needlebench_v2_multi_reasoning_1000k|needlebench_v2_multi_reasoning_1000k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_reasoning_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_reasoning_1000k.py)| -|needlebench_v2_multi_retrieval_1000k|needlebench_v2_multi_retrieval_1000k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_multi_retrieval_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_retrieval_1000k.py)| -|needlebench_v2_single_1000k|needlebench_v2_single_1000k|准确率(accuracy)|0-shot|对话格式|[needlebench_v2_single_1000k.py](needlebench_v2_1000k/needlebench_v2_single_1000k.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||atc_0shot_nocot_2_power_en|atc_0shot_nocot_2_power_en|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_datasets as datasets`|[atc_0shot_nocot_2_power_en.py](atc/atc_0shot_nocot_2_power_en.py)| +||needlebench_v2_4k|needlebench_v2_4k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_4k import needlebench_datasets as datasets`|[needlebench_v2_4k.py](needlebench_v2_4k/needlebench_v2_4k.py)| +||needlebench_v2_multi_reasoning_4k|needlebench_v2_multi_reasoning_4k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_4k.py](needlebench_v2_4k/needlebench_v2_multi_reasoning_4k.py)| +||needlebench_v2_multi_retrieval_4k|needlebench_v2_multi_retrieval_4k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_retrieval_4k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_4k.py](needlebench_v2_4k/needlebench_v2_multi_retrieval_4k.py)| +||needlebench_v2_single_4k|needlebench_v2_single_4k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_single_4k import needlebench_en_datasets as datasets`|[needlebench_v2_single_4k.py](needlebench_v2_4k/needlebench_v2_single_4k.py)| +||needlebench_v2_8k|needlebench_v2_8k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_8k import needlebench_datasets as datasets`|[needlebench_v2_8k.py](needlebench_v2_8k/needlebench_v2_8k.py)| +||needlebench_v2_multi_reasoning_8k|needlebench_v2_multi_reasoning_8k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_8k.py](needlebench_v2_8k/needlebench_v2_multi_reasoning_8k.py)| +||needlebench_v2_multi_retrieval_8k|needlebench_v2_multi_retrieval_8k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_8k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_8k.py)| +||needlebench_v2_single_8k|needlebench_v2_single_8k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_single_8k import needlebench_en_datasets as datasets`|[needlebench_v2_single_8k.py](needlebench_v2_8k/needlebench_v2_single_8k.py)| +||needlebench_v2_multi_retrieval_compare_batch_8k|needlebench_v2_multi_retrieval_compare_batch_8k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_compare_batch_8k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_compare_batch_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_compare_batch_8k.py)| +||needlebench_v2_32k|needlebench_v2_32k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_32k import needlebench_datasets as datasets`|[needlebench_v2_32k.py](needlebench_v2_32k/needlebench_v2_32k.py)| +||needlebench_v2_multi_reasoning_32k|needlebench_v2_multi_reasoning_32k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_32k.py](needlebench_v2_32k/needlebench_v2_multi_reasoning_32k.py)| +||needlebench_v2_multi_retrieval_32k|needlebench_v2_multi_retrieval_32k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_retrieval_32k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_32k.py](needlebench_v2_32k/needlebench_v2_multi_retrieval_32k.py)| +||needlebench_v2_single_32k|needlebench_v2_single_32k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_en_datasets as datasets`|[needlebench_v2_single_32k.py](needlebench_v2_32k/needlebench_v2_single_32k.py)| +||needlebench_v2_128k|needlebench_v2_128k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_128k import needlebench_datasets as datasets`|[needlebench_v2_128k.py](needlebench_v2_128k/needlebench_v2_128k.py)| +||needlebench_v2_multi_reasoning_128k|needlebench_v2_multi_reasoning_128k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_128k.py](needlebench_v2_128k/needlebench_v2_multi_reasoning_128k.py)| +||needlebench_v2_multi_retrieval_128k|needlebench_v2_multi_retrieval_128k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_retrieval_128k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_128k.py](needlebench_v2_128k/needlebench_v2_multi_retrieval_128k.py)| +||needlebench_v2_single_128k|needlebench_v2_single_128k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_single_128k import needlebench_en_datasets as datasets`|[needlebench_v2_single_128k.py](needlebench_v2_128k/needlebench_v2_single_128k.py)| +||needlebench_v2_200k|needlebench_v2_200k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_200k import needlebench_datasets as datasets`|[needlebench_v2_200k.py](needlebench_v2_200k/needlebench_v2_200k.py)| +||needlebench_v2_multi_reasoning_200k|needlebench_v2_multi_reasoning_200k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_200k.py](needlebench_v2_200k/needlebench_v2_multi_reasoning_200k.py)| +||needlebench_v2_multi_retrieval_200k|needlebench_v2_multi_retrieval_200k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_retrieval_200k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_200k.py](needlebench_v2_200k/needlebench_v2_multi_retrieval_200k.py)| +||needlebench_v2_single_200k|needlebench_v2_single_200k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_single_200k import needlebench_en_datasets as datasets`|[needlebench_v2_single_200k.py](needlebench_v2_200k/needlebench_v2_single_200k.py)| +||needlebench_v2_256k|needlebench_v2_256k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_256k import needlebench_datasets as datasets`|[needlebench_v2_256k.py](needlebench_v2_256k/needlebench_v2_256k.py)| +||needlebench_v2_multi_reasoning_256k|needlebench_v2_multi_reasoning_256k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_256k.py](needlebench_v2_256k/needlebench_v2_multi_reasoning_256k.py)| +||needlebench_v2_multi_retrieval_256k|needlebench_v2_multi_retrieval_256k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_retrieval_256k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_256k.py](needlebench_v2_256k/needlebench_v2_multi_retrieval_256k.py)| +||needlebench_v2_single_256k|needlebench_v2_single_256k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_single_256k import needlebench_en_datasets as datasets`|[needlebench_v2_single_256k.py](needlebench_v2_256k/needlebench_v2_single_256k.py)| +||needlebench_v2_1000k|needlebench_v2_1000k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_1000k import needlebench_datasets as datasets`|[needlebench_v2_1000k.py](needlebench_v2_1000k/needlebench_v2_1000k.py)| +||needlebench_v2_multi_reasoning_1000k|needlebench_v2_multi_reasoning_1000k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_2needle_en_datasets as datasets`|[needlebench_v2_multi_reasoning_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_reasoning_1000k.py)| +||needlebench_v2_multi_retrieval_1000k|needlebench_v2_multi_retrieval_1000k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_retrieval_1000k import needlebench_en_datasets as datasets`|[needlebench_v2_multi_retrieval_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_retrieval_1000k.py)| +||needlebench_v2_single_1000k|needlebench_v2_single_1000k|准确率(accuracy)|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_single_1000k import needlebench_en_datasets as datasets`|[needlebench_v2_single_1000k.py](needlebench_v2_1000k/needlebench_v2_single_1000k.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/needlebench_v2/README_en.md b/ais_bench/benchmark/configs/datasets/needlebench_v2/README_en.md index 6344d8d6..04128543 100644 --- a/ais_bench/benchmark/configs/datasets/needlebench_v2/README_en.md +++ b/ais_bench/benchmark/configs/datasets/needlebench_v2/README_en.md @@ -51,35 +51,35 @@ It is recommended to download the dataset from Hugging Face: [https://huggingfac ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| atc_0shot_nocot_2_power_en | atc_0shot_nocot_2_power_en | Accuracy | 0-shot | Chat Format | [atc/atc_0shot_nocot_2_power_en.py]() | -| needlebench_v2_4k | needlebench_v2_4k | Accuracy | 0-shot | Chat Format | [needlebench_v2_4k.py](needlebench_v2_4k/needlebench_v2_4k.py) | -| needlebench_v2_multi_reasoning_4k | needlebench_v2_multi_reasoning_4k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_4k.py](needlebench_v2_4k/needlebench_v2_multi_reasoning_4k.py) | -| needlebench_v2_multi_retrieval_4k | needlebench_v2_multi_retrieval_4k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_4k.py](needlebench_v2_4k/needlebench_v2_multi_retrieval_4k.py) | -| needlebench_v2_single_4k | needlebench_v2_single_4k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_4k.py](needlebench_v2_4k/needlebench_v2_single_4k.py) | -| needlebench_v2_8k | needlebench_v2_8k | Accuracy | 0-shot | Chat Format | [needlebench_v2_8k.py](needlebench_v2_8k/needlebench_v2_8k.py) | -| needlebench_v2_multi_reasoning_8k | needlebench_v2_multi_reasoning_8k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_8k.py](needlebench_v2_8k/needlebench_v2_multi_reasoning_8k.py) | -| needlebench_v2_multi_retrieval_8k | needlebench_v2_multi_retrieval_8k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_8k.py) | -| needlebench_v2_single_8k | needlebench_v2_single_8k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_8k.py](needlebench_v2_8k/needlebench_v2_single_8k.py) | -| needlebench_v2_multi_retrieval_compare_batch_8k | needlebench_v2_multi_retrieval_compare_batch_8k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_compare_batch_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_compare_batch_8k.py) | -| needlebench_v2_32k | needlebench_v2_32k | Accuracy | 0-shot | Chat Format | [needlebench_v2_32k.py](needlebench_v2_32k/needlebench_v2_32k.py) | -| needlebench_v2_multi_reasoning_32k | needlebench_v2_multi_reasoning_32k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_32k.py](needlebench_v2_32k/needlebench_v2_multi_reasoning_32k.py) | -| needlebench_v2_multi_retrieval_32k | needlebench_v2_multi_retrieval_32k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_32k.py](needlebench_v2_32k/needlebench_v2_multi_retrieval_32k.py) | -| needlebench_v2_single_32k | needlebench_v2_single_32k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_32k.py](needlebench_v2_32k/needlebench_v2_single_32k.py) | -| needlebench_v2_128k | needlebench_v2_128k | Accuracy | 0-shot | Chat Format | [needlebench_v2_128k.py](needlebench_v2_128k/needlebench_v2_128k.py) | -| needlebench_v2_multi_reasoning_128k | needlebench_v2_multi_reasoning_128k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_128k.py](needlebench_v2_128k/needlebench_v2_multi_reasoning_128k.py) | -| needlebench_v2_multi_retrieval_128k | needlebench_v2_multi_retrieval_128k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_128k.py](needlebench_v2_128k/needlebench_v2_multi_retrieval_128k.py) | -| needlebench_v2_single_128k | needlebench_v2_single_128k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_128k.py](needlebench_v2_128k/needlebench_v2_single_128k.py) | -| needlebench_v2_200k | needlebench_v2_200k | Accuracy | 0-shot | Chat Format | [needlebench_v2_200k.py](needlebench_v2_200k/needlebench_v2_200k.py) | -| needlebench_v2_multi_reasoning_200k | needlebench_v2_multi_reasoning_200k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_200k.py](needlebench_v2_200k/needlebench_v2_multi_reasoning_200k.py) | -| needlebench_v2_multi_retrieval_200k | needlebench_v2_multi_retrieval_200k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_200k.py](needlebench_v2_200k/needlebench_v2_multi_retrieval_200k.py) | -| needlebench_v2_single_200k | needlebench_v2_single_200k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_200k.py](needlebench_v2_200k/needlebench_v2_single_200k.py) | -| needlebench_v2_256k | needlebench_v2_256k | Accuracy | 0-shot | Chat Format | [needlebench_v2_256k.py](needlebench_v2_256k/needlebench_v2_256k.py) | -| needlebench_v2_multi_reasoning_256k | needlebench_v2_multi_reasoning_256k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_256k.py](needlebench_v2_256k/needlebench_v2_multi_reasoning_256k.py) | -| needlebench_v2_multi_retrieval_256k | needlebench_v2_multi_retrieval_256k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_256k.py](needlebench_v2_256k/needlebench_v2_multi_retrieval_256k.py) | -| needlebench_v2_single_256k | needlebench_v2_single_256k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_256k.py](needlebench_v2_256k/needlebench_v2_single_256k.py) | -| needlebench_v2_1000k | needlebench_v2_1000k | Accuracy | 0-shot | Chat Format | [needlebench_v2_1000k.py](needlebench_v2_1000k/needlebench_v2_1000k.py) | -| needlebench_v2_multi_reasoning_1000k | needlebench_v2_multi_reasoning_1000k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_reasoning_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_reasoning_1000k.py) | -| needlebench_v2_multi_retrieval_1000k | needlebench_v2_multi_retrieval_1000k | Accuracy | 0-shot | Chat Format | [needlebench_v2_multi_retrieval_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_retrieval_1000k.py) | -| needlebench_v2_single_1000k | needlebench_v2_single_1000k | Accuracy | 0-shot | Chat Format | [needlebench_v2_single_1000k.py](needlebench_v2_1000k/needlebench_v2_single_1000k.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| atc_0shot_nocot_2_power_en | atc_0shot_nocot_2_power_en | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_datasets as datasets`| [atc/atc_0shot_nocot_2_power_en.py]() | +|| needlebench_v2_4k | needlebench_v2_4k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_4k import needlebench_datasets as datasets`| [needlebench_v2_4k.py](needlebench_v2_4k/needlebench_v2_4k.py) | +|| needlebench_v2_multi_reasoning_4k | needlebench_v2_multi_reasoning_4k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_4k.py](needlebench_v2_4k/needlebench_v2_multi_reasoning_4k.py) | +|| needlebench_v2_multi_retrieval_4k | needlebench_v2_multi_retrieval_4k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_retrieval_4k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_4k.py](needlebench_v2_4k/needlebench_v2_multi_retrieval_4k.py) | +|| needlebench_v2_single_4k | needlebench_v2_single_4k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_single_4k import needlebench_en_datasets as datasets`| [needlebench_v2_single_4k.py](needlebench_v2_4k/needlebench_v2_single_4k.py) | +|| needlebench_v2_8k | needlebench_v2_8k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_8k import needlebench_datasets as datasets`| [needlebench_v2_8k.py](needlebench_v2_8k/needlebench_v2_8k.py) | +|| needlebench_v2_multi_reasoning_8k | needlebench_v2_multi_reasoning_8k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_8k.py](needlebench_v2_8k/needlebench_v2_multi_reasoning_8k.py) | +|| needlebench_v2_multi_retrieval_8k | needlebench_v2_multi_retrieval_8k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_8k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_8k.py) | +|| needlebench_v2_single_8k | needlebench_v2_single_8k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_single_8k import needlebench_en_datasets as datasets`| [needlebench_v2_single_8k.py](needlebench_v2_8k/needlebench_v2_single_8k.py) | +|| needlebench_v2_multi_retrieval_compare_batch_8k | needlebench_v2_multi_retrieval_compare_batch_8k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_compare_batch_8k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_compare_batch_8k.py](needlebench_v2_8k/needlebench_v2_multi_retrieval_compare_batch_8k.py) | +|| needlebench_v2_32k | needlebench_v2_32k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_32k import needlebench_datasets as datasets`| [needlebench_v2_32k.py](needlebench_v2_32k/needlebench_v2_32k.py) | +|| needlebench_v2_multi_reasoning_32k | needlebench_v2_multi_reasoning_32k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_32k.py](needlebench_v2_32k/needlebench_v2_multi_reasoning_32k.py) | +|| needlebench_v2_multi_retrieval_32k | needlebench_v2_multi_retrieval_32k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_retrieval_32k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_32k.py](needlebench_v2_32k/needlebench_v2_multi_retrieval_32k.py) | +|| needlebench_v2_single_32k | needlebench_v2_single_32k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_en_datasets as datasets`| [needlebench_v2_single_32k.py](needlebench_v2_32k/needlebench_v2_single_32k.py) | +|| needlebench_v2_128k | needlebench_v2_128k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_128k import needlebench_datasets as datasets`| [needlebench_v2_128k.py](needlebench_v2_128k/needlebench_v2_128k.py) | +|| needlebench_v2_multi_reasoning_128k | needlebench_v2_multi_reasoning_128k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_128k.py](needlebench_v2_128k/needlebench_v2_multi_reasoning_128k.py) | +|| needlebench_v2_multi_retrieval_128k | needlebench_v2_multi_retrieval_128k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_retrieval_128k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_128k.py](needlebench_v2_128k/needlebench_v2_multi_retrieval_128k.py) | +|| needlebench_v2_single_128k | needlebench_v2_single_128k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_single_128k import needlebench_en_datasets as datasets`| [needlebench_v2_single_128k.py](needlebench_v2_128k/needlebench_v2_single_128k.py) | +|| needlebench_v2_200k | needlebench_v2_200k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_200k import needlebench_datasets as datasets`| [needlebench_v2_200k.py](needlebench_v2_200k/needlebench_v2_200k.py) | +|| needlebench_v2_multi_reasoning_200k | needlebench_v2_multi_reasoning_200k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_200k.py](needlebench_v2_200k/needlebench_v2_multi_reasoning_200k.py) | +|| needlebench_v2_multi_retrieval_200k | needlebench_v2_multi_retrieval_200k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_retrieval_200k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_200k.py](needlebench_v2_200k/needlebench_v2_multi_retrieval_200k.py) | +|| needlebench_v2_single_200k | needlebench_v2_single_200k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_single_200k import needlebench_en_datasets as datasets`| [needlebench_v2_single_200k.py](needlebench_v2_200k/needlebench_v2_single_200k.py) | +|| needlebench_v2_256k | needlebench_v2_256k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_256k import needlebench_datasets as datasets`| [needlebench_v2_256k.py](needlebench_v2_256k/needlebench_v2_256k.py) | +|| needlebench_v2_multi_reasoning_256k | needlebench_v2_multi_reasoning_256k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_256k.py](needlebench_v2_256k/needlebench_v2_multi_reasoning_256k.py) | +|| needlebench_v2_multi_retrieval_256k | needlebench_v2_multi_retrieval_256k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_retrieval_256k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_256k.py](needlebench_v2_256k/needlebench_v2_multi_retrieval_256k.py) | +|| needlebench_v2_single_256k | needlebench_v2_single_256k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_single_256k import needlebench_en_datasets as datasets`| [needlebench_v2_single_256k.py](needlebench_v2_256k/needlebench_v2_single_256k.py) | +|| needlebench_v2_1000k | needlebench_v2_1000k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_1000k import needlebench_datasets as datasets`| [needlebench_v2_1000k.py](needlebench_v2_1000k/needlebench_v2_1000k.py) | +|| needlebench_v2_multi_reasoning_1000k | needlebench_v2_multi_reasoning_1000k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_2needle_en_datasets as datasets`| [needlebench_v2_multi_reasoning_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_reasoning_1000k.py) | +|| needlebench_v2_multi_retrieval_1000k | needlebench_v2_multi_retrieval_1000k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_retrieval_1000k import needlebench_en_datasets as datasets`| [needlebench_v2_multi_retrieval_1000k.py](needlebench_v2_1000k/needlebench_v2_multi_retrieval_1000k.py) | +|| needlebench_v2_single_1000k | needlebench_v2_single_1000k | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_single_1000k import needlebench_en_datasets as datasets`| [needlebench_v2_single_1000k.py](needlebench_v2_1000k/needlebench_v2_single_1000k.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/ocrbench_v2/README.md b/ais_bench/benchmark/configs/datasets/ocrbench_v2/README.md index 17fabcc9..bae99145 100644 --- a/ais_bench/benchmark/configs/datasets/ocrbench_v2/README.md +++ b/ais_bench/benchmark/configs/datasets/ocrbench_v2/README.md @@ -34,9 +34,9 @@ pip3 install -r requirements/datasets/ocrbench_v2.txt ``` ## 可用数据集任务 -| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 对应源码配置文件路径 | -| --- | --- | --- | --- | --- | --- | -| ocrbench_v2_gen_0_shot_chat | OCRBench_v2 数据集生成式任务,支持多模态输入(图像+文本) | 多种指标(根据任务类型) | 0-shot | 对话格式(多模态) | [ocrbench_v2_gen_0_shot_chat.py](ocrbench_v2_gen_0_shot_chat.py) | +| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| ocrbench_v2_gen_0_shot_chat | OCRBench_v2 数据集生成式任务,支持多模态输入(图像+文本) | 多种指标(根据任务类型) | 0-shot | 对话格式(多模态) | `from ais_bench.benchmark.configs.datasets.ocrbench_v2.ocrbench_v2_gen_0_shot_chat import ocrbench_v2_datasets as datasets` | [ocrbench_v2_gen_0_shot_chat.py](ocrbench_v2_gen_0_shot_chat.py) | ## 支持的任务类型 OCRBench_v2 数据集涵盖以下任务类型: diff --git a/ais_bench/benchmark/configs/datasets/ocrbench_v2/README_en.md b/ais_bench/benchmark/configs/datasets/ocrbench_v2/README_en.md index 736fbb0d..fd8ddecc 100644 --- a/ais_bench/benchmark/configs/datasets/ocrbench_v2/README_en.md +++ b/ais_bench/benchmark/configs/datasets/ocrbench_v2/README_en.md @@ -34,9 +34,9 @@ pip3 install -r requirements/datasets/ocrbench_v2.txt ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| ocrbench_v2_gen_0_shot_chat | Generative task for OCRBench_v2 dataset, supporting multimodal input (image + text) | Multiple metrics (depending on task type) | 0-shot | Chat format (multimodal) | [ocrbench_v2_gen_0_shot_chat.py](ocrbench_v2_gen_0_shot_chat.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| ocrbench_v2_gen_0_shot_chat | Generative task for OCRBench_v2 dataset, supporting multimodal input (image + text) | Multiple metrics (depending on task type) | 0-shot | Chat format (multimodal) |`from ais_bench.benchmark.configs.datasets.ocrbench_v2.ocrbench_v2_gen_0_shot_chat import ocrbench_v2_datasets as datasets`| [ocrbench_v2_gen_0_shot_chat.py](ocrbench_v2_gen_0_shot_chat.py) | ## Supported Task Types The OCRBench_v2 dataset covers the following task types: diff --git a/ais_bench/benchmark/configs/datasets/omnidocbench/README.md b/ais_bench/benchmark/configs/datasets/omnidocbench/README.md index 9aba7058..2b2b69bf 100644 --- a/ais_bench/benchmark/configs/datasets/omnidocbench/README.md +++ b/ais_bench/benchmark/configs/datasets/omnidocbench/README.md @@ -29,9 +29,9 @@ git clone https://huggingface.co/datasets/opendatalab/OmniDocBench ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|omnidocbench_gen|OmniDocBench数据集生成式任务|accuracy (pass@1)|0-shot|字符串格式|[omnidocbench_gen.py](omnidocbench_gen.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||omnidocbench_gen|OmniDocBench数据集生成式任务|accuracy (pass@1)|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.omnidocbench.omnidocbench_gen import omnidocbench_datasets as datasets`|[omnidocbench_gen.py](omnidocbench_gen.py)| ## 使用约束 - 当前仅支持Edit_dist指标(用于测评DeepSeek-OCR模型),其他指标暂不支持,overall为各个维度的Edit_dist评分的均值 diff --git a/ais_bench/benchmark/configs/datasets/omnidocbench/README_en.md b/ais_bench/benchmark/configs/datasets/omnidocbench/README_en.md index dff6339c..3aa8efbf 100644 --- a/ais_bench/benchmark/configs/datasets/omnidocbench/README_en.md +++ b/ais_bench/benchmark/configs/datasets/omnidocbench/README_en.md @@ -29,9 +29,9 @@ git clone https://huggingface.co/datasets/opendatalab/OmniDocBench ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| omnidocbench_gen | Generative task for the OmniDocBench dataset | accuracy (pass@1) | 0-shot | String format | [omnidocbench_gen.py](omnidocbench_gen.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| omnidocbench_gen | Generative task for the OmniDocBench dataset | accuracy (pass@1) | 0-shot | String format |`from ais_bench.benchmark.configs.datasets.omnidocbench.omnidocbench_gen import omnidocbench_datasets as datasets`| [omnidocbench_gen.py](omnidocbench_gen.py) | ## Usage Constraints: - Currently, only the Edit_dist metric is supported (used to evaluate the DeepSeek-OCR model); other metrics are not supported yet. The "overall" score is the average of the Edit_dist scores across all dimensions. diff --git a/ais_bench/benchmark/configs/datasets/piqa/README.md b/ais_bench/benchmark/configs/datasets/piqa/README.md index 22697229..fd416409 100644 --- a/ais_bench/benchmark/configs/datasets/piqa/README.md +++ b/ais_bench/benchmark/configs/datasets/piqa/README.md @@ -37,8 +37,8 @@ rm physicaliqa-train-dev.zip ## 可用数据集任务 ### piqa_gen_0_shot_chat_prompt #### 基本信息 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|piqa_gen_0_shot_chat_prompt|piqa数据集生成式任务|accuracy|0-shot|对话格式|[piqa_gen_0_shot_chat_prompt.py](piqa_gen_0_shot_chat_prompt.py)| -|piqa_gen_0_shot_str|piqa数据集生成式任务|accuracy|0-shot|字符串格式|[piqa_gen_0_shot_str.py](piqa_gen_0_shot_str.py)| -|piqa_ppl_0_shot_str|piqa数据集PPL任务|accuracy|0-shot|字符串格式|[piqa_ppl_0_shot_str.py](piqa_ppl_0_shot_str.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||piqa_gen_0_shot_chat_prompt|piqa数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.piqa.piqa_gen_0_shot_chat_prompt import piqa_datasets as datasets`|[piqa_gen_0_shot_chat_prompt.py](piqa_gen_0_shot_chat_prompt.py)| +||piqa_gen_0_shot_str|piqa数据集生成式任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.piqa.piqa_gen_0_shot_str import piqa_datasets as datasets`|[piqa_gen_0_shot_str.py](piqa_gen_0_shot_str.py)| +||piqa_ppl_0_shot_str|piqa数据集PPL任务|accuracy|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.piqa.piqa_ppl_0_shot_str import piqa_datasets as datasets`|[piqa_ppl_0_shot_str.py](piqa_ppl_0_shot_str.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/piqa/README_en.md b/ais_bench/benchmark/configs/datasets/piqa/README_en.md index 049bb73a..e325c940 100644 --- a/ais_bench/benchmark/configs/datasets/piqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/piqa/README_en.md @@ -29,8 +29,8 @@ rm physicaliqa-train-dev.zip ## Available Dataset Tasks ### piqa_gen_0_shot_chat_prompt #### Basic Information -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -| piqa_gen_0_shot_chat_prompt | Generative task for the piqa dataset | Accuracy | 0-shot | Chat Format | [piqa_gen_0_shot_chat_prompt.py](piqa_gen_0_shot_chat_prompt.py) | -| piqa_gen_0_shot_str | Generative task for the piqa dataset | Accuracy | 0-shot | String Format | [piqa_gen_0_shot_str.py](piqa_gen_0_shot_str.py) | -| piqa_ppl_0_shot_str | PPL task for the piqa dataset | Accuracy | 0-shot | String Format | [piqa_ppl_0_shot_str.py](piqa_ppl_0_shot_str.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| piqa_gen_0_shot_chat_prompt | Generative task for the piqa dataset | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.piqa.piqa_gen_0_shot_chat_prompt import piqa_datasets as datasets`| [piqa_gen_0_shot_chat_prompt.py](piqa_gen_0_shot_chat_prompt.py) | +|| piqa_gen_0_shot_str | Generative task for the piqa dataset | Accuracy | 0-shot | String Format |`from ais_bench.benchmark.configs.datasets.piqa.piqa_gen_0_shot_str import piqa_datasets as datasets`| [piqa_gen_0_shot_str.py](piqa_gen_0_shot_str.py) | +|| piqa_ppl_0_shot_str | PPL task for the piqa dataset | Accuracy | 0-shot | String Format |`from ais_bench.benchmark.configs.datasets.piqa.piqa_ppl_0_shot_str import piqa_datasets as datasets`| [piqa_ppl_0_shot_str.py](piqa_ppl_0_shot_str.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/race/README.md b/ais_bench/benchmark/configs/datasets/race/README.md index a6f015d2..60ed6962 100644 --- a/ais_bench/benchmark/configs/datasets/race/README.md +++ b/ais_bench/benchmark/configs/datasets/race/README.md @@ -30,10 +30,10 @@ rm -r OpenCompassData-core-20240207.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|race_middle_gen_5_shot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|[race_middle_gen_5_shot_chat.py](race_middle_gen_5_shot_chat.py)| -|race_middle_gen_5_shot_cot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|[race_middle_gen_5_shot_cot_chat.py](race_middle_gen_5_shot_cot_chat.py)| -|race_high_gen_5_shot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|[race_high_gen_5_shot_chat.py](race_high_gen_5_shot_chat.py)| -|race_high_gen_5_shot_cot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|[race_high_gen_5_shot_cot_chat.py](race_high_gen_5_shot_cot_chat.py)| -|race_ppl_0_shot_chat|race数据集PPL任务|accuracy|0-shot|对话格式|[race_ppl_0_shot_chat.py](race_ppl_0_shot_chat.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||race_middle_gen_5_shot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.race.race_middle_gen_5_shot_chat import race_datasets as datasets`|[race_middle_gen_5_shot_chat.py](race_middle_gen_5_shot_chat.py)| +||race_middle_gen_5_shot_cot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.race.race_middle_gen_5_shot_cot_chat import race_datasets as datasets`|[race_middle_gen_5_shot_cot_chat.py](race_middle_gen_5_shot_cot_chat.py)| +||race_high_gen_5_shot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.race.race_high_gen_5_shot_chat import race_datasets as datasets`|[race_high_gen_5_shot_chat.py](race_high_gen_5_shot_chat.py)| +||race_high_gen_5_shot_cot_chat|race数据集生成式任务|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.race.race_high_gen_5_shot_cot_chat import race_datasets as datasets`|[race_high_gen_5_shot_cot_chat.py](race_high_gen_5_shot_cot_chat.py)| +||race_ppl_0_shot_chat|race数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.race.race_ppl_0_shot_chat import race_datasets as datasets`|[race_ppl_0_shot_chat.py](race_ppl_0_shot_chat.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/race/README_en.md b/ais_bench/benchmark/configs/datasets/race/README_en.md index 492877fd..e7ade525 100644 --- a/ais_bench/benchmark/configs/datasets/race/README_en.md +++ b/ais_bench/benchmark/configs/datasets/race/README_en.md @@ -30,10 +30,10 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| race_middle_gen_5_shot_chat | Generative task for the RACE dataset (middle school level) | Accuracy | 5-shot | Chat Format | [race_middle_gen_5_shot_chat.py](race_middle_gen_5_shot_chat.py) | -| race_middle_gen_5_shot_cot_chat | Generative task for the RACE dataset (middle school level) with chain-of-thought in prompt | Accuracy | 5-shot | Chat Format | [race_middle_gen_5_shot_cot_chat.py](race_middle_gen_5_shot_cot_chat.py) | -| race_high_gen_5_shot_chat | Generative task for the RACE dataset (senior high school level) | Accuracy | 5-shot | Chat Format | [race_high_gen_5_shot_chat.py](race_high_gen_5_shot_chat.py) | -| race_high_gen_5_shot_cot_chat | Generative task for the RACE dataset (senior high school level) with chain-of-thought in prompt | Accuracy | 5-shot | Chat Format | [race_high_gen_5_shot_cot_chat.py](race_high_gen_5_shot_cot_chat.py) | -| race_ppl_0_shot_chat | PPL task for the RACE dataset | Accuracy | 0-shot | Chat Format | [race_ppl_0_shot_chat.py](race_ppl_0_shot_chat.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| race_middle_gen_5_shot_chat | Generative task for the RACE dataset (middle school level) | Accuracy | 5-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.race.race_middle_gen_5_shot_chat import race_datasets as datasets`| [race_middle_gen_5_shot_chat.py](race_middle_gen_5_shot_chat.py) | +|| race_middle_gen_5_shot_cot_chat | Generative task for the RACE dataset (middle school level) with chain-of-thought in prompt | Accuracy | 5-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.race.race_middle_gen_5_shot_cot_chat import race_datasets as datasets`| [race_middle_gen_5_shot_cot_chat.py](race_middle_gen_5_shot_cot_chat.py) | +|| race_high_gen_5_shot_chat | Generative task for the RACE dataset (senior high school level) | Accuracy | 5-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.race.race_high_gen_5_shot_chat import race_datasets as datasets`| [race_high_gen_5_shot_chat.py](race_high_gen_5_shot_chat.py) | +|| race_high_gen_5_shot_cot_chat | Generative task for the RACE dataset (senior high school level) with chain-of-thought in prompt | Accuracy | 5-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.race.race_high_gen_5_shot_cot_chat import race_datasets as datasets`| [race_high_gen_5_shot_cot_chat.py](race_high_gen_5_shot_cot_chat.py) | +|| race_ppl_0_shot_chat | PPL task for the RACE dataset | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.race.race_ppl_0_shot_chat import race_datasets as datasets`| [race_ppl_0_shot_chat.py](race_ppl_0_shot_chat.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/realworldqa/README.md b/ais_bench/benchmark/configs/datasets/realworldqa/README.md index 2913cb44..a1d72e62 100644 --- a/ais_bench/benchmark/configs/datasets/realworldqa/README.md +++ b/ais_bench/benchmark/configs/datasets/realworldqa/README.md @@ -25,7 +25,7 @@ git clone https://huggingface.co/datasets/xai-community/realworldqa RealworldQA ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|realworldqa_gen|RealworldQA数据集生成式任务,⚠️该数据集任务下,会从Parquet文件中提取图片并保存到本地路径,然后将图片路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径图片。|accuracy|0-shot|列表格式(包含文本和图片两种数据)|[realworldqa_gen.py](realworldqa_gen.py)| -|realworldqa_gen_base64|RealworldQA数据集生成式任务,⚠️该数据集任务下,会将图片数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据。|accuracy|0-shot|列表格式(包含文本和图片两种数据)|[realworldqa_gen_base64.py](realworldqa_gen_base64.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||realworldqa_gen|RealworldQA数据集生成式任务,⚠️该数据集任务下,会从Parquet文件中提取图片并保存到本地路径,然后将图片路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径图片。|accuracy|0-shot|列表格式(包含文本和图片两种数据)|`from ais_bench.benchmark.configs.datasets.realworldqa.realworldqa_gen import realworldqa_datasets as datasets`|[realworldqa_gen.py](realworldqa_gen.py)| +||realworldqa_gen_base64|RealworldQA数据集生成式任务,⚠️该数据集任务下,会将图片数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据。|accuracy|0-shot|列表格式(包含文本和图片两种数据)|`from ais_bench.benchmark.configs.datasets.realworldqa.realworldqa_gen_base64 import realworldqa_datasets as datasets`|[realworldqa_gen_base64.py](realworldqa_gen_base64.py)| diff --git a/ais_bench/benchmark/configs/datasets/realworldqa/README_en.md b/ais_bench/benchmark/configs/datasets/realworldqa/README_en.md index 3a77c283..9ebe177c 100644 --- a/ais_bench/benchmark/configs/datasets/realworldqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/realworldqa/README_en.md @@ -25,7 +25,7 @@ git clone https://huggingface.co/datasets/xai-community/realworldqa RealworldQA ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| realworldqa_gen | Generative task for the RealworldQA dataset. ⚠️ For this dataset task, images will be extracted from Parquet files and saved to a local path, then the image paths will be passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the images at the specified path. | accuracy | 0-shot | List format (contains two types of data: text and image) | [realworldqa_gen.py](realworldqa_gen.py) | -| realworldqa_gen_base64 | Generative task for the RealworldQA dataset. ⚠️ For this dataset task, the image data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | accuracy | 0-shot | List format (contains two types of data: text and image) | [realworldqa_gen_base64.py](realworldqa_gen_base64.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| realworldqa_gen | Generative task for the RealworldQA dataset. ⚠️ For this dataset task, images will be extracted from Parquet files and saved to a local path, then the image paths will be passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the images at the specified path. | accuracy | 0-shot | List format (contains two types of data: text and image) |`from ais_bench.benchmark.configs.datasets.realworldqa.realworldqa_gen import realworldqa_datasets as datasets`| [realworldqa_gen.py](realworldqa_gen.py) | +|| realworldqa_gen_base64 | Generative task for the RealworldQA dataset. ⚠️ For this dataset task, the image data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | accuracy | 0-shot | List format (contains two types of data: text and image) |`from ais_bench.benchmark.configs.datasets.realworldqa.realworldqa_gen_base64 import realworldqa_datasets as datasets`| [realworldqa_gen_base64.py](realworldqa_gen_base64.py) | diff --git a/ais_bench/benchmark/configs/datasets/refcoco/README.md b/ais_bench/benchmark/configs/datasets/refcoco/README.md index 4f2624f1..5059b4a3 100644 --- a/ais_bench/benchmark/configs/datasets/refcoco/README.md +++ b/ais_bench/benchmark/configs/datasets/refcoco/README.md @@ -49,10 +49,10 @@ RefCOCO/ ## 可用数据集任务 -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Config File | -| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ---------------------------------- | ---------------------------------------------- | -| refcoco_gen | RefCOCO 生成式定位任务配置,使用文件路径图像输入(`file://{image}`),导出 `RefCOCO_val`、`RefCOCO_test`、`RefCOCO_testA`、`RefCOCO_testB` 四个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | [refcoco_gen.py](refcoco_gen.py) | -| refcoco_gen_base64 | RefCOCO 生成式定位任务配置,使用 base64 data URL 图像输入(`data:image/jpeg;base64,{image}`),导出 `RefCOCO_base64_val`、`RefCOCO_base64_test`、`RefCOCO_base64_testA`、`RefCOCO_base64_testB` 四个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | [refcoco_gen_base64.py](refcoco_gen_base64.py) | +| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| refcoco_gen | RefCOCO 生成式定位任务配置,使用文件路径图像输入(`file://{image}`),导出 `RefCOCO_val`、`RefCOCO_test`、`RefCOCO_testA`、`RefCOCO_testB` 四个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco.refcoco_gen import refcoco_datasets as datasets` | [refcoco_gen.py](refcoco_gen.py) | +| refcoco_gen_base64 | RefCOCO 生成式定位任务配置,使用 base64 data URL 图像输入(`data:image/jpeg;base64,{image}`),导出 `RefCOCO_base64_val`、`RefCOCO_base64_test`、`RefCOCO_base64_testA`、`RefCOCO_base64_testB` 四个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco.refcoco_gen_base64 import refcoco_datasets as datasets` | [refcoco_gen_base64.py](refcoco_gen_base64.py) | ## 数据集分类 diff --git a/ais_bench/benchmark/configs/datasets/refcoco/README_en.md b/ais_bench/benchmark/configs/datasets/refcoco/README_en.md index c18f6e9c..3b42a11b 100644 --- a/ais_bench/benchmark/configs/datasets/refcoco/README_en.md +++ b/ais_bench/benchmark/configs/datasets/refcoco/README_en.md @@ -51,10 +51,10 @@ RefCOCO/ ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Config File | -| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------- | -------- | ----------------------------------------- | ---------------------------------------------- | -| refcoco_gen | RefCOCO generative grounding config that uses file-path image input (`file://{image}`) and exports `RefCOCO_val`, `RefCOCO_test`, `RefCOCO_testA`, and `RefCOCO_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | [refcoco_gen.py](refcoco_gen.py) | -| refcoco_gen_base64 | RefCOCO generative grounding config that uses base64 data-URL image input (`data:image/jpeg;base64,{image}`) and exports `RefCOCO_base64_val`, `RefCOCO_base64_test`, `RefCOCO_base64_testA`, and `RefCOCO_base64_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | [refcoco_gen_base64.py](refcoco_gen_base64.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Config File | +| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------- | -------- | ----------------------------------------- | --- | ---------------------------------------------- | +| refcoco_gen | RefCOCO generative grounding config that uses file-path image input (`file://{image}`) and exports `RefCOCO_val`, `RefCOCO_test`, `RefCOCO_testA`, and `RefCOCO_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco.refcoco_gen import refcoco_datasets as datasets` | [refcoco_gen.py](refcoco_gen.py) | +| refcoco_gen_base64 | RefCOCO generative grounding config that uses base64 data-URL image input (`data:image/jpeg;base64,{image}`) and exports `RefCOCO_base64_val`, `RefCOCO_base64_test`, `RefCOCO_base64_testA`, and `RefCOCO_base64_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco.refcoco_gen_base64 import refcoco_datasets as datasets` | [refcoco_gen_base64.py](refcoco_gen_base64.py) | ## Dataset Classification diff --git a/ais_bench/benchmark/configs/datasets/refcoco_plus/README.md b/ais_bench/benchmark/configs/datasets/refcoco_plus/README.md index 4e1cad17..8471d76a 100644 --- a/ais_bench/benchmark/configs/datasets/refcoco_plus/README.md +++ b/ais_bench/benchmark/configs/datasets/refcoco_plus/README.md @@ -44,10 +44,10 @@ RefCOCOplus/ ## 可用数据集任务 -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Config File | -| ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ---------------------------------- | -------------------------------------------------------- | -| refcoco_plus_gen | RefCOCO+ 生成式定位任务配置,使用文件路径图像输入(`file://{image}`),导出 `RefCOCOPlus_val`、`RefCOCOPlus_testA`、`RefCOCOPlus_testB` 三个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | [refcoco_plus_gen.py](refcoco_plus_gen.py) | -| refcoco_plus_gen_base64 | RefCOCO+ 生成式定位任务配置,使用 base64 data URL 图像输入(`data:image/jpeg;base64,{image}`),导出 `RefCOCOPlus_base64_val`、`RefCOCOPlus_base64_testA`、`RefCOCOPlus_base64_testB` 三个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | [refcoco_plus_gen_base64.py](refcoco_plus_gen_base64.py) | +| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| refcoco_plus_gen | RefCOCO+ 生成式定位任务配置,使用文件路径图像输入(`file://{image}`),导出 `RefCOCOPlus_val`、`RefCOCOPlus_testA`、`RefCOCOPlus_testB` 三个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco_plus.refcoco_plus_gen import refcoco_plus_datasets as datasets` | [refcoco_plus_gen.py](refcoco_plus_gen.py) | +| refcoco_plus_gen_base64 | RefCOCO+ 生成式定位任务配置,使用 base64 data URL 图像输入(`data:image/jpeg;base64,{image}`),导出 `RefCOCOPlus_base64_val`、`RefCOCOPlus_base64_testA`、`RefCOCOPlus_base64_testB` 三个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco_plus.refcoco_plus_gen_base64 import refcoco_plus_datasets as datasets` | [refcoco_plus_gen_base64.py](refcoco_plus_gen_base64.py) | ## 数据集分类 diff --git a/ais_bench/benchmark/configs/datasets/refcoco_plus/README_en.md b/ais_bench/benchmark/configs/datasets/refcoco_plus/README_en.md index 0a90c023..3a833e2b 100644 --- a/ais_bench/benchmark/configs/datasets/refcoco_plus/README_en.md +++ b/ais_bench/benchmark/configs/datasets/refcoco_plus/README_en.md @@ -44,10 +44,10 @@ RefCOCOplus/ ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Config File | -| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ----------------------------------------- | -------------------------------------------------------- | -| refcoco_plus_gen | RefCOCO+ generative grounding config that uses file-path image input (`file://{image}`) and exports `RefCOCOPlus_val`, `RefCOCOPlus_testA`, and `RefCOCOPlus_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | [refcoco_plus_gen.py](refcoco_plus_gen.py) | -| refcoco_plus_gen_base64 | RefCOCO+ generative grounding config that uses base64 data-URL image input (`data:image/jpeg;base64,{image}`) and exports `RefCOCOPlus_base64_val`, `RefCOCOPlus_base64_testA`, and `RefCOCOPlus_base64_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | [refcoco_plus_gen_base64.py](refcoco_plus_gen_base64.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Config File | +| ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ----------------------------------------- | --- | -------------------------------------------------------- | +| refcoco_plus_gen | RefCOCO+ generative grounding config that uses file-path image input (`file://{image}`) and exports `RefCOCOPlus_val`, `RefCOCOPlus_testA`, and `RefCOCOPlus_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco_plus.refcoco_plus_gen import refcoco_plus_datasets as datasets` | [refcoco_plus_gen.py](refcoco_plus_gen.py) | +| refcoco_plus_gen_base64 | RefCOCO+ generative grounding config that uses base64 data-URL image input (`data:image/jpeg;base64,{image}`) and exports `RefCOCOPlus_base64_val`, `RefCOCOPlus_base64_testA`, and `RefCOCOPlus_base64_testB` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcoco_plus.refcoco_plus_gen_base64 import refcoco_plus_datasets as datasets` | [refcoco_plus_gen_base64.py](refcoco_plus_gen_base64.py) | ## Dataset Classification diff --git a/ais_bench/benchmark/configs/datasets/refcocog/README.md b/ais_bench/benchmark/configs/datasets/refcocog/README.md index 865355b2..03294558 100644 --- a/ais_bench/benchmark/configs/datasets/refcocog/README.md +++ b/ais_bench/benchmark/configs/datasets/refcocog/README.md @@ -43,10 +43,10 @@ RefCOCOg/ ## 可用数据集任务 -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Config File | -| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ---------------------------------- | ------------------------------------------------ | -| refcocog_gen | RefCOCOg 生成式定位任务配置,使用文件路径图像输入(`file://{image}`),导出 `RefCOCOg_val` 和 `RefCOCOg_test` 两个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | [refcocog_gen.py](refcocog_gen.py) | -| refcocog_gen_base64 | RefCOCOg 生成式定位任务配置,使用 base64 data URL 图像输入(`data:image/jpeg;base64,{image}`),导出 `RefCOCOg_base64_val` 和 `RefCOCOg_base64_test` 两个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | [refcocog_gen_base64.py](refcocog_gen_base64.py) | +| 任务名称 | 简介 | 评估指标 | Few-Shot | Prompt 格式 | 配套文件导入方式 | 对应源码配置文件路径 | +| --- | --- | --- | --- | --- | --- | --- | +| refcocog_gen | RefCOCOg 生成式定位任务配置,使用文件路径图像输入(`file://{image}`),导出 `RefCOCOg_val` 和 `RefCOCOg_test` 两个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcocog.refcocog_gen import refcocog_datasets as datasets` | [refcocog_gen.py](refcocog_gen.py) | +| refcocog_gen_base64 | RefCOCOg 生成式定位任务配置,使用 base64 data URL 图像输入(`data:image/jpeg;base64,{image}`),导出 `RefCOCOg_base64_val` 和 `RefCOCOg_base64_test` 两个 split 任务 | Accuracy@0.5 | 0-shot | 多模态对话格式(MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcocog.refcocog_gen_base64 import refcocog_datasets as datasets` | [refcocog_gen_base64.py](refcocog_gen_base64.py) | ## 数据集分类 diff --git a/ais_bench/benchmark/configs/datasets/refcocog/README_en.md b/ais_bench/benchmark/configs/datasets/refcocog/README_en.md index 8e803ed3..e4633b63 100644 --- a/ais_bench/benchmark/configs/datasets/refcocog/README_en.md +++ b/ais_bench/benchmark/configs/datasets/refcocog/README_en.md @@ -43,10 +43,10 @@ RefCOCOg/ ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Config File | -| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ----------------------------------------- | ------------------------------------------------ | -| refcocog_gen | RefCOCOg generative grounding config that uses file-path image input (`file://{image}`) and exports `RefCOCOg_val` and `RefCOCOg_test` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | [refcocog_gen.py](refcocog_gen.py) | -| refcocog_gen_base64 | RefCOCOg generative grounding config that uses base64 data-URL image input (`data:image/jpeg;base64,{image}`) and exports `RefCOCOg_base64_val` and `RefCOCOg_base64_test` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | [refcocog_gen_base64.py](refcocog_gen_base64.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Config File | +| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | -------- | ----------------------------------------- | --- | ------------------------------------------------ | +| refcocog_gen | RefCOCOg generative grounding config that uses file-path image input (`file://{image}`) and exports `RefCOCOg_val` and `RefCOCOg_test` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcocog.refcocog_gen import refcocog_datasets as datasets` | [refcocog_gen.py](refcocog_gen.py) | +| refcocog_gen_base64 | RefCOCOg generative grounding config that uses base64 data-URL image input (`data:image/jpeg;base64,{image}`) and exports `RefCOCOg_base64_val` and `RefCOCOg_base64_test` split tasks | Accuracy@0.5 | 0-shot | Multimodal chat format (MMPromptTemplate) | `from ais_bench.benchmark.configs.datasets.refcocog.refcocog_gen_base64 import refcocog_datasets as datasets` | [refcocog_gen_base64.py](refcocog_gen_base64.py) | ## Dataset Classification diff --git a/ais_bench/benchmark/configs/datasets/sharegpt/README.md b/ais_bench/benchmark/configs/datasets/sharegpt/README.md index 1e93d103..f775c829 100644 --- a/ais_bench/benchmark/configs/datasets/sharegpt/README.md +++ b/ais_bench/benchmark/configs/datasets/sharegpt/README.md @@ -54,9 +54,9 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/b ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|sharegpt_gen|sharegpt生成式任务|暂不支持精度评测|0-shot|列表格式|[sharegpt_gen.py](sharegpt_gen.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||sharegpt_gen|sharegpt生成式任务|暂不支持精度评测|0-shot|列表格式|`from ais_bench.benchmark.configs.datasets.sharegpt.sharegpt_gen import sharegpt_datasets as datasets`|[sharegpt_gen.py](sharegpt_gen.py)| *注意:该多轮对话数据集的测评支持vLLM、SGLang、MindIE Service等服务化,使用时需指定--models为vllm_api_stream_chat_multiturn* \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/sharegpt/README_en.md b/ais_bench/benchmark/configs/datasets/sharegpt/README_en.md index 39fb7a16..ca96f00d 100644 --- a/ais_bench/benchmark/configs/datasets/sharegpt/README_en.md +++ b/ais_bench/benchmark/configs/datasets/sharegpt/README_en.md @@ -57,9 +57,9 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/b ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| sharegpt_gen | Generative task for ShareGPT | Accuracy evaluation not supported temporarily | 0-shot | List Format | [sharegpt_gen.py](sharegpt_gen.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| sharegpt_gen | Generative task for ShareGPT | Accuracy evaluation not supported temporarily | 0-shot | List Format |`from ais_bench.benchmark.configs.datasets.sharegpt.sharegpt_gen import sharegpt_datasets as datasets`| [sharegpt_gen.py](sharegpt_gen.py) | *Note: The evaluation of this multi-turn conversation dataset supports service deployment frameworks such as vLLM, SGLang, and MindIE Service. When using it, you need to specify `--models` as `vllm_api_stream_chat_multiturn`.* \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/siqa/README.md b/ais_bench/benchmark/configs/datasets/siqa/README.md index cca0bc6f..fca0bdbf 100644 --- a/ais_bench/benchmark/configs/datasets/siqa/README.md +++ b/ais_bench/benchmark/configs/datasets/siqa/README.md @@ -27,7 +27,7 @@ rm -r OpenCompassData-core-20240207.zip ├── train-labels.lst ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|siqa_gen_0_shot_chat|siqa数据集生成式任务;`EDAccEvaluator`精度评估方式会通过`Levenshtein距离算法`选取最接近的答案,可能会造成误判,导致精度得分结果偏高。|accuracy|0-shot|对话格式|[siqa_gen_0_shot_chat.py](siqa_gen_0_shot_chat.py)| -|siqa_ppl_0_shot_chat|siqa数据集PPL任务|accuracy|0-shot|对话格式|[siqa_ppl_0_shot_chat.py](siqa_ppl_0_shot_chat.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||siqa_gen_0_shot_chat|siqa数据集生成式任务;`EDAccEvaluator`精度评估方式会通过`Levenshtein距离算法`选取最接近的答案,可能会造成误判,导致精度得分结果偏高。|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.siqa.siqa_gen_0_shot_chat import siqa_datasets as datasets`|[siqa_gen_0_shot_chat.py](siqa_gen_0_shot_chat.py)| +||siqa_ppl_0_shot_chat|siqa数据集PPL任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.siqa.siqa_ppl_0_shot_chat import siqa_datasets as datasets`|[siqa_ppl_0_shot_chat.py](siqa_ppl_0_shot_chat.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/siqa/README_en.md b/ais_bench/benchmark/configs/datasets/siqa/README_en.md index e05ee751..34590400 100644 --- a/ais_bench/benchmark/configs/datasets/siqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/siqa/README_en.md @@ -28,7 +28,7 @@ rm -r OpenCompassData-core-20240207.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| siqa_gen_0_shot_chat | Generative task for the SIQA dataset; The `EDAccEvaluator` accuracy evaluation method selects the closest answer using the `Levenshtein distance algorithm`, which may cause misjudgment and result in an artificially high accuracy score. | Accuracy | 0-shot | Chat Format | [siqa_gen_0_shot_chat.py](siqa_gen_0_shot_chat.py) | -| siqa_ppl_0_shot_chat | PPL task for SIQA dataset | Accuracy | 0-shot | Chat Format | [siqa_ppl_0_shot_chat.py](siqa_ppl_0_shot_chat.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| siqa_gen_0_shot_chat | Generative task for the SIQA dataset; The `EDAccEvaluator` accuracy evaluation method selects the closest answer using the `Levenshtein distance algorithm`, which may cause misjudgment and result in an artificially high accuracy score. | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.siqa.siqa_gen_0_shot_chat import siqa_datasets as datasets`| [siqa_gen_0_shot_chat.py](siqa_gen_0_shot_chat.py) | +|| siqa_ppl_0_shot_chat | PPL task for SIQA dataset | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.siqa.siqa_ppl_0_shot_chat import siqa_datasets as datasets`| [siqa_ppl_0_shot_chat.py](siqa_ppl_0_shot_chat.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/textvqa/README.md b/ais_bench/benchmark/configs/datasets/textvqa/README.md index a3d63f5e..9db00437 100644 --- a/ais_bench/benchmark/configs/datasets/textvqa/README.md +++ b/ais_bench/benchmark/configs/datasets/textvqa/README.md @@ -32,8 +32,8 @@ mv textvqa/*.jsonl textvqa/textvqa_json/ ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|textvqa_gen|TextVQA数据集生成式任务, ⚠️该数据集任务下,会直接将图片路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径图片。|VQA|0-shot|列表格式(包含文本和图片两种数据)|[textvqa_gen.py](textvqa_gen.py)| -|textvqa_gen_base64|TextVQA数据集生成式任务,⚠️该数据集任务下,会将图片数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据|VQA|0-shot|列表格式(包含文本和图片两种数据)|[textvqa_gen_base64.py](textvqa_gen_base64.py)| -|glm4v_textvqa_gen_base64|Glm4.1v-Thinking模型专用TextVQA数据集生成式任务,以适配该模型特殊的输出文本格式,⚠️该数据集任务下,会将图片数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据|VQA|0-shot|列表格式(包含文本和图片两种数据)|[glm4v_textvqa_gen_base64.py](glm4v_textvqa_gen_base64.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||textvqa_gen|TextVQA数据集生成式任务, ⚠️该数据集任务下,会直接将图片路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径图片。|VQA|0-shot|列表格式(包含文本和图片两种数据)|`from ais_bench.benchmark.configs.datasets.textvqa.textvqa_gen import textvqa_datasets as datasets`|[textvqa_gen.py](textvqa_gen.py)| +||textvqa_gen_base64|TextVQA数据集生成式任务,⚠️该数据集任务下,会将图片数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据|VQA|0-shot|列表格式(包含文本和图片两种数据)|`from ais_bench.benchmark.configs.datasets.textvqa.textvqa_gen_base64 import textvqa_datasets as datasets`|[textvqa_gen_base64.py](textvqa_gen_base64.py)| +||glm4v_textvqa_gen_base64|Glm4.1v-Thinking模型专用TextVQA数据集生成式任务,以适配该模型特殊的输出文本格式,⚠️该数据集任务下,会将图片数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据|VQA|0-shot|列表格式(包含文本和图片两种数据)|`from ais_bench.benchmark.configs.datasets.textvqa.glm4v_textvqa_gen_base64 import textvqa_datasets as datasets`|[glm4v_textvqa_gen_base64.py](glm4v_textvqa_gen_base64.py)| diff --git a/ais_bench/benchmark/configs/datasets/textvqa/README_en.md b/ais_bench/benchmark/configs/datasets/textvqa/README_en.md index fa505151..364aac4e 100644 --- a/ais_bench/benchmark/configs/datasets/textvqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/textvqa/README_en.md @@ -32,8 +32,8 @@ mv textvqa/*.jsonl textvqa/textvqa_json/ ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| textvqa_gen | Generative task for the TextVQA dataset. ⚠️ For this dataset task, the image path will be directly passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the images at the specified path. | VQA | 0-shot | List format (contains two types of data: text and image) | [textvqa_gen.py](textvqa_gen.py) | -| textvqa_gen_base64 | Generative task for the TextVQA dataset. ⚠️ For this dataset task, the image data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | VQA | 0-shot | List format (contains two types of data: text and image) | [textvqa_gen_base64.py](textvqa_gen_base64.py) | -| glm4v_textvqa_gen_base64 | Generative task for the TextVQA dataset limited to Glm4.1v-Thinking because of special output layout. ⚠️ For this dataset task, the image data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | VQA | 0-shot | List format (contains two types of data: text and image) | [glm4v_textvqa_gen_base64.py](glm4v_textvqa_gen_base64.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| textvqa_gen | Generative task for the TextVQA dataset. ⚠️ For this dataset task, the image path will be directly passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the images at the specified path. | VQA | 0-shot | List format (contains two types of data: text and image) |`from ais_bench.benchmark.configs.datasets.textvqa.textvqa_gen import textvqa_datasets as datasets`| [textvqa_gen.py](textvqa_gen.py) | +|| textvqa_gen_base64 | Generative task for the TextVQA dataset. ⚠️ For this dataset task, the image data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | VQA | 0-shot | List format (contains two types of data: text and image) |`from ais_bench.benchmark.configs.datasets.textvqa.textvqa_gen_base64 import textvqa_datasets as datasets`| [textvqa_gen_base64.py](textvqa_gen_base64.py) | +|| glm4v_textvqa_gen_base64 | Generative task for the TextVQA dataset limited to Glm4.1v-Thinking because of special output layout. ⚠️ For this dataset task, the image data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | VQA | 0-shot | List format (contains two types of data: text and image) |`from ais_bench.benchmark.configs.datasets.textvqa.glm4v_textvqa_gen_base64 import textvqa_datasets as datasets`| [glm4v_textvqa_gen_base64.py](glm4v_textvqa_gen_base64.py) | diff --git a/ais_bench/benchmark/configs/datasets/triviaqa/README.md b/ais_bench/benchmark/configs/datasets/triviaqa/README.md index b116a939..70a38f0a 100644 --- a/ais_bench/benchmark/configs/datasets/triviaqa/README.md +++ b/ais_bench/benchmark/configs/datasets/triviaqa/README.md @@ -25,7 +25,7 @@ rm triviaqa.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|triviaqa_gen_5_shot_chat_prompt|TriviaQA数据集生成式任务|accuracy|5-shot|对话格式|[triviaqa_gen_5_shot_chat_prompt.py](triviaqa_gen_5_shot_chat_prompt.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||triviaqa_gen_5_shot_chat_prompt|TriviaQA数据集生成式任务|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.triviaqa.triviaqa_gen_5_shot_chat_prompt import triviaqa_datasets as datasets`|[triviaqa_gen_5_shot_chat_prompt.py](triviaqa_gen_5_shot_chat_prompt.py)| diff --git a/ais_bench/benchmark/configs/datasets/triviaqa/README_en.md b/ais_bench/benchmark/configs/datasets/triviaqa/README_en.md index afc2741e..37d526af 100644 --- a/ais_bench/benchmark/configs/datasets/triviaqa/README_en.md +++ b/ais_bench/benchmark/configs/datasets/triviaqa/README_en.md @@ -25,6 +25,6 @@ rm triviaqa.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| triviaqa_gen_5_shot_chat_prompt | Generative task for the TriviaQA dataset | Accuracy | 5-shot | Chat Format | [triviaqa_gen_5_shot_chat_prompt.py](triviaqa_gen_5_shot_chat_prompt.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| triviaqa_gen_5_shot_chat_prompt | Generative task for the TriviaQA dataset | Accuracy | 5-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.triviaqa.triviaqa_gen_5_shot_chat_prompt import triviaqa_datasets as datasets`| [triviaqa_gen_5_shot_chat_prompt.py](triviaqa_gen_5_shot_chat_prompt.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/videobench/README.md b/ais_bench/benchmark/configs/datasets/videobench/README.md index 374777c0..76097cb0 100644 --- a/ais_bench/benchmark/configs/datasets/videobench/README.md +++ b/ais_bench/benchmark/configs/datasets/videobench/README.md @@ -33,7 +33,7 @@ mv videobench_subset/ videobench/ ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|videobench_gen|VideoBench数据集生成式任务,⚠️该数据集任务下,会直接将视频路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径视频。|accuracy|0-shot|列表格式(包含文本和视频两种数据)|[videobench_gen.py](videobench_gen.py)| -|videobench_gen_base64|VideoBench数据集生成式任务,⚠️该数据集任务下,会先将视频进行抽帧再转化为base64格式传入服务化,需确保服务化支持该输入格式数据。其中num_frames表示视频抽帧数,默认为5|accuracy|0-shot|列表格式(包含文本和视频两种数据)|[videobench_gen_base64.py](videobench_gen_base64.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||videobench_gen|VideoBench数据集生成式任务,⚠️该数据集任务下,会直接将视频路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径视频。|accuracy|0-shot|列表格式(包含文本和视频两种数据)|`from ais_bench.benchmark.configs.datasets.videobench.videobench_gen import videobench_datasets as datasets`|[videobench_gen.py](videobench_gen.py)| +||videobench_gen_base64|VideoBench数据集生成式任务,⚠️该数据集任务下,会先将视频进行抽帧再转化为base64格式传入服务化,需确保服务化支持该输入格式数据。其中num_frames表示视频抽帧数,默认为5|accuracy|0-shot|列表格式(包含文本和视频两种数据)|`from ais_bench.benchmark.configs.datasets.videobench.videobench_gen_base64 import videobench_datasets as datasets`|[videobench_gen_base64.py](videobench_gen_base64.py)| diff --git a/ais_bench/benchmark/configs/datasets/videobench/README_en.md b/ais_bench/benchmark/configs/datasets/videobench/README_en.md index 9acdae07..17c55b13 100644 --- a/ais_bench/benchmark/configs/datasets/videobench/README_en.md +++ b/ais_bench/benchmark/configs/datasets/videobench/README_en.md @@ -36,7 +36,7 @@ mv videobench_subset/ videobench/ ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| videobench_gen | Generative task for the VideoBench dataset. ⚠️ For this dataset task, the video path will be directly passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the videos at the specified path. | Accuracy | 0-shot | List format (contains two types of data: text and video) | [videobench_gen.py](videobench_gen.py) | -| videobench_gen_base64 | Generative task for the VideoBench dataset. ⚠️ For this dataset task, videos will first undergo frame extraction and then be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. Among the parameters, `num_frames` refers to the number of frames extracted from the video, with a default value of 5. | Accuracy | 0-shot | List format (contains two types of data: text and video) | [videobench_gen_base64.py](videobench_gen_base64.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| videobench_gen | Generative task for the VideoBench dataset. ⚠️ For this dataset task, the video path will be directly passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the videos at the specified path. | Accuracy | 0-shot | List format (contains two types of data: text and video) |`from ais_bench.benchmark.configs.datasets.videobench.videobench_gen import videobench_datasets as datasets`| [videobench_gen.py](videobench_gen.py) | +|| videobench_gen_base64 | Generative task for the VideoBench dataset. ⚠️ For this dataset task, videos will first undergo frame extraction and then be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. Among the parameters, `num_frames` refers to the number of frames extracted from the video, with a default value of 5. | Accuracy | 0-shot | List format (contains two types of data: text and video) |`from ais_bench.benchmark.configs.datasets.videobench.videobench_gen_base64 import videobench_datasets as datasets`| [videobench_gen_base64.py](videobench_gen_base64.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/videomme/README.md b/ais_bench/benchmark/configs/datasets/videomme/README.md index 0b308d0d..461549ad 100644 --- a/ais_bench/benchmark/configs/datasets/videomme/README.md +++ b/ais_bench/benchmark/configs/datasets/videomme/README.md @@ -30,6 +30,6 @@ Video-MME 是面向多模态大语言模型(MLLM)的视频理解评测基准 #### 基本信息 - 当前对于Video-MME数据集的测评暂不支持字幕数据的传入 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|videomme_gen|videomme数据集生成式任务|acc|0-shot|字符串格式|[videomme_gen.py](videomme_gen.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||videomme_gen|videomme数据集生成式任务|acc|0-shot|字符串格式|`from ais_bench.benchmark.configs.datasets.videomme.videomme_gen import videomme_datasets as datasets`|[videomme_gen.py](videomme_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/videomme/README_en.md b/ais_bench/benchmark/configs/datasets/videomme/README_en.md index 2bb588a1..c9ff77c1 100644 --- a/ais_bench/benchmark/configs/datasets/videomme/README_en.md +++ b/ais_bench/benchmark/configs/datasets/videomme/README_en.md @@ -30,6 +30,6 @@ Video-MME is a Video understanding evaluation benchmark for multimodal large lan #### Basic Information - Currently, the evaluation of the Video-MME dataset does not support the input of subtitle data for the time being -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | --- | -|videomme_gen|Generative task for the videomme dataset|acc|0-shot|String format|[videomme_gen.py](videomme_gen.py)| +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +||videomme_gen|Generative task for the videomme dataset|acc|0-shot|String format|`from ais_bench.benchmark.configs.datasets.videomme.videomme_gen import videomme_datasets as datasets`|[videomme_gen.py](videomme_gen.py)| diff --git a/ais_bench/benchmark/configs/datasets/vocalsound/README.md b/ais_bench/benchmark/configs/datasets/vocalsound/README.md index c52b1fda..fecfd549 100644 --- a/ais_bench/benchmark/configs/datasets/vocalsound/README.md +++ b/ais_bench/benchmark/configs/datasets/vocalsound/README.md @@ -30,7 +30,7 @@ mv vocalsound/subset5/* vocalsound/ ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|vocalsound_gen|VocalSound数据集生成式任务,⚠️该数据集任务下会直接将音频路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径音频。|accuracy|0-shot|列表格式(包含文本和音频两种数据)|[vocalsound_gen.py](vocalsound_gen.py)| -|vocalsound_gen_base64|VocalSound数据集生成式任务,⚠️该数据集任务下,会将音频数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据。|accuracy|0-shot|列表格式(包含文本和音频两种数据)|[vocalsound_gen_base64.py](vocalsound_gen_base64.py)| \ No newline at end of file +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||vocalsound_gen|VocalSound数据集生成式任务,⚠️该数据集任务下会直接将音频路径传入服务化,需确保服务化支持该格式输入并且有权限访问该路径音频。|accuracy|0-shot|列表格式(包含文本和音频两种数据)|`from ais_bench.benchmark.configs.datasets.vocalsound.vocalsound_gen import vocalsound_datasets as datasets`|[vocalsound_gen.py](vocalsound_gen.py)| +||vocalsound_gen_base64|VocalSound数据集生成式任务,⚠️该数据集任务下,会将音频数据转化为base64格式再传入服务化,需确保服务化支持该输入格式数据。|accuracy|0-shot|列表格式(包含文本和音频两种数据)|`from ais_bench.benchmark.configs.datasets.vocalsound.vocalsound_gen_base64 import vocalsound_datasets as datasets`|[vocalsound_gen_base64.py](vocalsound_gen_base64.py)| \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/vocalsound/README_en.md b/ais_bench/benchmark/configs/datasets/vocalsound/README_en.md index e76eded0..5653348c 100644 --- a/ais_bench/benchmark/configs/datasets/vocalsound/README_en.md +++ b/ais_bench/benchmark/configs/datasets/vocalsound/README_en.md @@ -30,7 +30,7 @@ mv vocalsound/subset5/* vocalsound/ ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| vocalsound_gen | Generative task for the VocalSound dataset. ⚠️ For this dataset task, the audio path will be directly passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the audio at the specified path. | Accuracy | 0-shot | List format (contains two types of data: text and audio) | [vocalsound_gen.py](vocalsound_gen.py) | -| vocalsound_gen_base64 | Generative task for the VocalSound dataset. ⚠️ For this dataset task, the audio data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | Accuracy | 0-shot | List format (contains two types of data: text and audio) | [vocalsound_gen_base64.py](vocalsound_gen_base64.py) | \ No newline at end of file +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| vocalsound_gen | Generative task for the VocalSound dataset. ⚠️ For this dataset task, the audio path will be directly passed to the service deployment. Ensure that the service deployment supports this input format and has permission to access the audio at the specified path. | Accuracy | 0-shot | List format (contains two types of data: text and audio) |`from ais_bench.benchmark.configs.datasets.vocalsound.vocalsound_gen import vocalsound_datasets as datasets`| [vocalsound_gen.py](vocalsound_gen.py) | +|| vocalsound_gen_base64 | Generative task for the VocalSound dataset. ⚠️ For this dataset task, the audio data will be converted to Base64 format before being passed to the service deployment. Ensure that the service deployment supports this input format. | Accuracy | 0-shot | List format (contains two types of data: text and audio) |`from ais_bench.benchmark.configs.datasets.vocalsound.vocalsound_gen_base64 import vocalsound_datasets as datasets`| [vocalsound_gen_base64.py](vocalsound_gen_base64.py) | \ No newline at end of file diff --git a/ais_bench/benchmark/configs/datasets/winogrande/README.md b/ais_bench/benchmark/configs/datasets/winogrande/README.md index c8330ef0..0aed137b 100644 --- a/ais_bench/benchmark/configs/datasets/winogrande/README.md +++ b/ais_bench/benchmark/configs/datasets/winogrande/README.md @@ -39,7 +39,7 @@ rm winogrande.zip ``` ## 可用数据集任务 -|任务名称|简介|评估指标|few-shot|prompt格式|对应源码配置文件路径| -| --- | --- | --- | --- | --- | --- | -|winogrande_gen_0_shot_chat_prompt|winogrande数据集生成式任务|accuracy|0-shot|对话格式|[winogrande_gen_0_shot_chat_prompt.py](winogrande_gen_0_shot_chat_prompt.py)| -|winogrande_gen_5_shot_chat_prompt|piqa数据集生成式任务|accuracy|5-shot|对话格式|[winogrande_gen_5_shot_chat_prompt.py](winogrande_gen_5_shot_chat_prompt.py)| +|任务名称|简介|评估指标|few-shot|prompt格式|配套文件导入方式|对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | --- | --- | +||winogrande_gen_0_shot_chat_prompt|winogrande数据集生成式任务|accuracy|0-shot|对话格式|`from ais_bench.benchmark.configs.datasets.winogrande.winogrande_gen_0_shot_chat_prompt import winogrande_datasets as datasets`|[winogrande_gen_0_shot_chat_prompt.py](winogrande_gen_0_shot_chat_prompt.py)| +||winogrande_gen_5_shot_chat_prompt|piqa数据集生成式任务|accuracy|5-shot|对话格式|`from ais_bench.benchmark.configs.datasets.winogrande.winogrande_gen_5_shot_chat_prompt import winogrande_datasets as datasets`|[winogrande_gen_5_shot_chat_prompt.py](winogrande_gen_5_shot_chat_prompt.py)| diff --git a/ais_bench/benchmark/configs/datasets/winogrande/README_en.md b/ais_bench/benchmark/configs/datasets/winogrande/README_en.md index 773ca037..d279a9e6 100644 --- a/ais_bench/benchmark/configs/datasets/winogrande/README_en.md +++ b/ais_bench/benchmark/configs/datasets/winogrande/README_en.md @@ -39,10 +39,10 @@ rm winogrande.zip ``` ## Available Dataset Tasks -| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Corresponding Source Code File Path | -| --- | --- | --- | --- | --- | --- | -| winogrande_gen_0_shot_chat_prompt | Generative task for the WinoGrande dataset | Accuracy | 0-shot | Chat Format | [winogrande_gen_0_shot_chat_prompt.py](winogrande_gen_0_shot_chat_prompt.py) | -| winogrande_gen_5_shot_chat_prompt | Generative task for the WinoGrande dataset (Note: The original "piqa dataset" in the introduction is a typo, corrected to "WinoGrande dataset" for consistency) | Accuracy | 5-shot | Chat Format | [winogrande_gen_5_shot_chat_prompt.py](winogrande_gen_5_shot_chat_prompt.py) | +| Task Name | Introduction | Evaluation Metric | Few-Shot | Prompt Format | Import Statement | Corresponding Source Code File Path | +| --- | --- | --- | --- | --- | --- | --- | --- | +|| winogrande_gen_0_shot_chat_prompt | Generative task for the WinoGrande dataset | Accuracy | 0-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.winogrande.winogrande_gen_0_shot_chat_prompt import winogrande_datasets as datasets`| [winogrande_gen_0_shot_chat_prompt.py](winogrande_gen_0_shot_chat_prompt.py) | +|| winogrande_gen_5_shot_chat_prompt | Generative task for the WinoGrande dataset (Note: The original "piqa dataset" in the introduction is a typo, corrected to "WinoGrande dataset" for consistency) | Accuracy | 5-shot | Chat Format |`from ais_bench.benchmark.configs.datasets.winogrande.winogrande_gen_5_shot_chat_prompt import winogrande_datasets as datasets`| [winogrande_gen_5_shot_chat_prompt.py](winogrande_gen_5_shot_chat_prompt.py) | ### Note diff --git a/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py b/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py new file mode 100644 index 00000000..d4a93804 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py @@ -0,0 +1,23 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + +models = vllm_api_general +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py b/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py new file mode 100644 index 00000000..8a57a1b2 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py @@ -0,0 +1,23 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py b/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py new file mode 100644 index 00000000..89334b22 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py @@ -0,0 +1,28 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import gsm8k_postprocess, gsm8k_dataset_postprocess + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +models = vllm_api_general_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 + +# 替换或修改答案的提取函数实现 +datasets[0]['eval_cfg']['pred_postprocessor'] = dict(type=gsm8k_postprocess) +datasets[0]['eval_cfg']['dataset_postprocessor'] = dict(type=gsm8k_dataset_postprocess) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py b/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py new file mode 100644 index 00000000..27ad2e71 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py @@ -0,0 +1,30 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[0]["generation_kwargs"] = dict( + temperature=0.01, + ignore_eos=False, + num_return_sequences=5, # 具体作用和约束请参考文档 accuracy_metric.md +) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py b/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py new file mode 100644 index 00000000..23886ff1 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py @@ -0,0 +1,33 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_chat + vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[1]["host_ip"] = "localhost" +models[1]["host_port"] = 8081 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[1]["max_out_len"] = 512 +models[1]["batch_size"] = 1 + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py b/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py new file mode 100644 index 00000000..a0faf378 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py @@ -0,0 +1,24 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_datasets +models = vllm_api_general_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py b/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py new file mode 100644 index 00000000..23886ff1 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py @@ -0,0 +1,33 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_chat + vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[1]["host_ip"] = "localhost" +models[1]["host_port"] = 8081 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[1]["max_out_len"] = 512 +models[1]["batch_size"] = 1 + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py b/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py new file mode 100644 index 00000000..cee89ab9 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py @@ -0,0 +1,26 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +models = vllm_api_general_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[0]["generation_kwargs"] = dict(temperature=0.01, ignore_eos=False) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py b/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py new file mode 100644 index 00000000..1450621f --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py @@ -0,0 +1,38 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.01, + do_sample=False, + ), + max_out_len=512, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +work_dir = 'outputs/default/' + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py b/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py new file mode 100644 index 00000000..d8887d47 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py @@ -0,0 +1,43 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import gsm8k_postprocess, gsm8k_dataset_postprocess + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.01, + do_sample=False, + ), + max_out_len=512, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +# 关键:替换或修改答案的提取函数实现 +datasets[0]['eval_cfg']['pred_postprocessor'] = dict(type=gsm8k_postprocess) +datasets[0]['eval_cfg']['dataset_postprocessor'] = dict(type=gsm8k_dataset_postprocess) + +work_dir = 'outputs/default/' + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py b/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py new file mode 100644 index 00000000..a8ed8259 --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py @@ -0,0 +1,41 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + +datasets = gsm8k_datasets + aime2024_datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.01, + do_sample=False, + ), + max_out_len=512, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +work_dir = 'outputs/default/' + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py b/ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py new file mode 100644 index 00000000..3e0ee9dd --- /dev/null +++ b/ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py @@ -0,0 +1,38 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.01, + do_sample=False, + ), + max_out_len=512, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +work_dir = 'outputs/default/' + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/api_examples/infer_vllm_api_multi_model_multi_dataset.py b/ais_bench/configs/api_examples/infer_vllm_api_multi_model_multi_dataset.py new file mode 100644 index 00000000..d56e9617 --- /dev/null +++ b/ais_bench/configs/api_examples/infer_vllm_api_multi_model_multi_dataset.py @@ -0,0 +1,15 @@ +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_str import mmlu_datasets as mmlu_5_shot_str + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat + mmlu_5_shot_str +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat + +work_dir = 'outputs/multi_model_multi_dataset/' diff --git a/ais_bench/configs/api_examples/infer_vllm_api_with_judge_model.py b/ais_bench/configs/api_examples/infer_vllm_api_with_judge_model.py new file mode 100644 index 00000000..743b058f --- /dev/null +++ b/ais_bench/configs/api_examples/infer_vllm_api_with_judge_model.py @@ -0,0 +1,45 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.aime2025.aime2025_gen_0_shot_llmjudge import aime2025_datasets + +datasets = aime2025_datasets + +datasets[0]['judge_infer_cfg']['judge_model']['host_ip'] = 'localhost' +datasets[0]['judge_infer_cfg']['judge_model']['host_port'] = 8081 + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-judge-eval', + path="", + model="", + stream=True, + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/judge_eval/' diff --git a/ais_bench/configs/api_examples/infer_vllm_api_with_model_dataset_combinations.py b/ais_bench/configs/api_examples/infer_vllm_api_with_model_dataset_combinations.py new file mode 100644 index 00000000..1a50985a --- /dev/null +++ b/ais_bench/configs/api_examples/infer_vllm_api_with_model_dataset_combinations.py @@ -0,0 +1,20 @@ +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat + +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), + dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), +] + +work_dir = 'outputs/custom_combinations/' diff --git a/ais_bench/configs/api_examples/perf_vllm_api_custom_dataset.py b/ais_bench/configs/api_examples/perf_vllm_api_custom_dataset.py new file mode 100644 index 00000000..d240a5cc --- /dev/null +++ b/ais_bench/configs/api_examples/perf_vllm_api_custom_dataset.py @@ -0,0 +1,66 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate +from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever +from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer +from ais_bench.benchmark.datasets import GenericDataset, AccuracyEvaluator + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +datasets = [ + dict( + abbr='my_custom_dataset', + type=GenericDataset, + path='/path/to/your/dataset.jsonl', + reader_cfg=dict( + input_columns=['question'], + output_column='answer', + ), + infer_cfg=dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict(role='HUMAN', prompt='{question}'), + ], + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + eval_cfg=dict( + evaluator=dict(type=AccuracyEvaluator), + ), + ) +] + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-custom-dataset', + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/custom_dataset/' diff --git a/ais_bench/configs/api_examples/perf_vllm_api_multiturn.py b/ais_bench/configs/api_examples/perf_vllm_api_multiturn.py new file mode 100644 index 00000000..69890e69 --- /dev/null +++ b/ais_bench/configs/api_examples/perf_vllm_api_multiturn.py @@ -0,0 +1,45 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.sharegpt.sharegpt_gen import sharegpt_datasets + +datasets = sharegpt_datasets + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr="vllm-multiturn-api-chat-stream", + path="", + model="", + stream=True, + request_rate=0, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/multi_turn_benchmark/' diff --git a/ais_bench/configs/api_examples/perf_vllm_api_rps_distribution.py b/ais_bench/configs/api_examples/perf_vllm_api_rps_distribution.py new file mode 100644 index 00000000..4c5e3db0 --- /dev/null +++ b/ais_bench/configs/api_examples/perf_vllm_api_rps_distribution.py @@ -0,0 +1,40 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPI + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import ( + synthetic_datasets, + ) + +datasets = synthetic_datasets + +models = [ + dict( + attr="service", + type=VLLMCustomAPI, + abbr='vllm-api-rps-distribution', + path="", + model="", + stream=True, + request_rate=100, + use_timestamp=False, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + traffic_cfg=dict( + burstiness=0.5, + ramp_up_strategy="linear", + ramp_up_start_rps=10, + ramp_up_end_rps=200, + ), + ) +] + +work_dir = 'outputs/rps_distribution_perf/' diff --git a/ais_bench/configs/api_examples/perf_vllm_api_stable_stage.py b/ais_bench/configs/api_examples/perf_vllm_api_stable_stage.py new file mode 100644 index 00000000..4701c701 --- /dev/null +++ b/ais_bench/configs/api_examples/perf_vllm_api_stable_stage.py @@ -0,0 +1,35 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPI + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import ( + synthetic_datasets, + ) + +datasets = synthetic_datasets + +models = [] +for rate in [0, 5, 10, 20]: + model_cfg = dict( + attr="service", + type=VLLMCustomAPI, + abbr=f'vllm-api-steady-rate-{rate}', + path="", + model="", + stream=True, + request_rate=rate, + use_timestamp=False, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + ) + models.append(model_cfg) + +work_dir = 'outputs/steady_state_perf/' diff --git a/ais_bench/configs/api_examples/perf_vllm_api_synthetic.py b/ais_bench/configs/api_examples/perf_vllm_api_synthetic.py new file mode 100644 index 00000000..1996719e --- /dev/null +++ b/ais_bench/configs/api_examples/perf_vllm_api_synthetic.py @@ -0,0 +1,48 @@ +from mmengine.config import read_base +from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate +from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever +from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer +from ais_bench.benchmark.datasets import SyntheticDataset, MATHEvaluator, math_postprocess_v2 + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import ( + models as vllm_api_general_stream, + ) + +synthetic_config = { + "Type": "string", + "RequestCount": 100, + "TrustRemoteCode": False, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": 1, "MaxValue": 500} + }, + "Output": { + "Method": "gaussian", + "Params": {"Mean": 200, "Var": 100, "MinValue": 1, "MaxValue": 500} + } + }, +} + +datasets = [ + dict( + abbr='synthetic_custom', + type=SyntheticDataset, + config=synthetic_config, + reader_cfg=dict(input_columns=['question', 'max_out_len'], output_column='answer'), + infer_cfg=dict( + prompt_template=dict(type=PromptTemplate, template="{question}"), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + eval_cfg=dict( + evaluator=dict(type=MATHEvaluator, version='v2'), + pred_postprocessor=dict(type=math_postprocess_v2), + ), + ) +] + +models = vllm_api_general_stream +work_dir = 'outputs/synthetic_perf_custom/' diff --git a/ais_bench/configs/hf_example/infer_hf_multi_model_multi_dataset.py b/ais_bench/configs/hf_example/infer_hf_multi_model_multi_dataset.py new file mode 100644 index 00000000..a5e115b3 --- /dev/null +++ b/ais_bench/configs/hf_example/infer_hf_multi_model_multi_dataset.py @@ -0,0 +1,46 @@ +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFaceBaseModel +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as gsm8k_0_shot_cot_chat + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + +datasets = [*gsm8k_0_shot_cot_chat] + [*math500_gen_0_shot_cot_chat] + +models = [ + dict( + type=HuggingFaceBaseModel, + abbr='hf-base-model', + path='THUDM/chatglm-6b', + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + do_sample=True, + seed=None, + repetition_penalty=1.03, + ), + max_out_len=100, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/hf-multi-model-multi-dataset/' diff --git a/ais_bench/configs/lmm_example/infer_lmm_multi_dataset.py b/ais_bench/configs/lmm_example/infer_lmm_multi_dataset.py new file mode 100644 index 00000000..511416d1 --- /dev/null +++ b/ais_bench/configs/lmm_example/infer_lmm_multi_dataset.py @@ -0,0 +1,12 @@ +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.textvqa.textvqa_gen_0_shot_str import textvqa_datasets + from ais_bench.benchmark.configs.datasets.docvqa.docvqa_gen_0_shot_str import docvqa_datasets + from ais_bench.benchmark.configs.models.lmm_models.lmm_vllm_api_chat import models as lmm_vllm_api_chat + +datasets = textvqa_datasets + docvqa_datasets +models = lmm_vllm_api_chat + +work_dir = 'outputs/lmm_multi_dataset/' diff --git a/ais_bench/configs/model_api_test_en.py b/ais_bench/configs/model_api_test_en.py new file mode 100644 index 00000000..c660e546 --- /dev/null +++ b/ais_bench/configs/model_api_test_en.py @@ -0,0 +1,36 @@ +from mmengine.config import read_base + +with read_base(): +# model tasks, choose one of them, other model tasks refer: https://ais-bench-benchmark-rf.readthedocs.io/en/latest/base_tutorials/all_params/models.html + # vllm_api_general is the base model, it only support text generation + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + # vllm_api_general_chat is the chat model, it support chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + # vllm_api_stream_chat is the stream chat model, it support stream chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + # vllm_api_general_stream is the stream model, it support stream generation + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + +# dataset task, get from https://ais-bench-benchmark-rf.readthedocs.io/en/latest/get_started/datasets.html + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = vllm_api_general_chat + +models[0]["path"] = "" # Specify the absolute path of the model serialized vocabulary file (generally not required for accuracy testing scenarios) +models[0]["model"] = "" # Specify the name of the model loaded on the server, configured according to the actual model name pulled by the VLLM inference service (configure as an empty string to get it automatically) +models[0]["request_rate"] = 0 # Request sending frequency: send 1 request to the server every 1/request_rate seconds; if less than 0.001, all requests are sent at once +models[0]["api_key"] = "" # Custom API key, default is an empty string +models[0]["host_ip"] = "localhost" # Specify the IP of the inference service +models[0]["host_port"] = 8080 # Specify the port of the inference service +models[0]["url"] = "" # Custom URL path for accessing the inference service (needs to be configured when the base URL is not a combination of http://host_ip:host_port; after configuration, host_ip and host_port will be ignored) +models[0]["max_out_len"] = 512 # Maximum number of tokens output by the inference service +models[0]["batch_size"] = 1 # Maximum concurrency for sending requests +models[0]["trust_remote_code"] = False # Whether the tokenizer trusts remote code, default is False; +models[0]["generation_kwargs"] = dict( # Model inference parameters, configured with reference to the VLLM documentation; the AISBench evaluation tool does not process them and attaches them to the sent request + temperature=0.01, + ignore_eos=False, +) + +# datasets[0]["path"] = ais_bench/datasets/gsm8k # Specify the absolute path of the dataset directory (required for accuracy testing scenarios) + +work_dir = 'outputs/default/' # Specify the working directory for saving task results and logs (default is outputs/default/) diff --git a/ais_bench/configs/model_api_test_zh_cn.py b/ais_bench/configs/model_api_test_zh_cn.py new file mode 100644 index 00000000..400c95c5 --- /dev/null +++ b/ais_bench/configs/model_api_test_zh_cn.py @@ -0,0 +1,36 @@ +from mmengine.config import read_base + +with read_base(): +# 模型任务,选择其中一个,其他模型任务参考:https://ais-bench-benchmark-rf.readthedocs.io/zh-cn/latest/base_tutorials/all_params/models.html 获取更多数据集任务 + # vllm_api_general 是基础模型,仅支持文本生成 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + # vllm_api_general_chat 是对话模型,支持对话 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + # vllm_api_stream_chat 是流式对话模型,支持流式对话 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + # vllm_api_general_stream 是流式模型,支持流式生成 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream +# 数据集任务,参考:https://ais-bench-benchmark-rf.readthedocs.io/zh-cn/latest/get_started/datasets.html 获取更多数据集任务 + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +# datasets = +models = vllm_api_general_chat + +models[0]["path"] = "" # 指定模型序列化词表文件的绝对路径(精度测试场景一般不需要配置) +models[0]["model"] = "" # 指定服务端加载的模型名称,根据 VLLM 推理服务实际拉取的模型名称配置(配置为空字符串则自动获取) +models[0]["request_rate"] = 0 # 请求发送频率:每 1/request_rate 秒向服务端发送 1 条请求;小于 0.001 时一次性发送所有请求 +models[0]["api_key"] = "" # 自定义 API key,默认为空字符串 +models[0]["host_ip"] = "localhost" # 指定推理服务的 IP +models[0]["host_port"] = 8080 # 指定推理服务的端口 +models[0]["url"] = "" # 自定义访问推理服务的 URL 路径(当基础 URL 不是 http://host_ip:host_port 的组合时需要配置;配置后 host_ip 和 host_port 将被忽略) +models[0]["max_out_len"] = 512 # 推理服务输出的最大 token 数 +models[0]["batch_size"] = 1 # 发送请求的最大并发数 +models[0]["trust_remote_code"] = False # tokenizer 是否信任远程代码,默认为 False +models[0]["generation_kwargs"] = dict( # 模型推理参数,参考 VLLM 文档配置;AISBench 评测工具不做处理,直接附加到发送的请求中 + temperature=0.01, + ignore_eos=False, +) + +# datasets[0]["path"] = ais_bench/datasets/gsm8k # 指定数据集目录的绝对路径(精度测试场景需要配置) + +work_dir = 'outputs/default/' # 指定任务结果和日志的保存工作目录(默认为 outputs/default/) diff --git a/ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py b/ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py new file mode 100644 index 00000000..2522cc17 --- /dev/null +++ b/ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py @@ -0,0 +1,26 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[0]["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/performance_benchmark/multi_task_synthetic_zh_cn.py b/ais_bench/configs/performance_benchmark/multi_task_synthetic_zh_cn.py new file mode 100644 index 00000000..eb51c86d --- /dev/null +++ b/ais_bench/configs/performance_benchmark/multi_task_synthetic_zh_cn.py @@ -0,0 +1,79 @@ +import copy +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import synthetic_datasets as base_synthetic_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as base_vllm_api_stream_chat + +# 关键:统一收束 batch_size / request_rate / request_count / input_range / output_range 五个参数 +# 用户希望配几个任务,就在对应列表中追加几个元素(同一分组内的列表长度需保持一致) +# 注意:models 与 datasets 的列表长度需保持一致,二者会按下标一一配对,而非笛卡尔积 +tasks_params = { + "models": { + "batch_size": [1, 2, 4, 8, 16, 32], + "request_rate": [0, 0, 0, 0, 0, 0], + }, + "datasets": { + "request_count": [100, 100, 100, 100, 100, 100], + "input_range": [(1, 2), (2, 4), (4, 8), (8, 16), (16, 32), (32, 64)], + "output_range": [(1, 2), (2, 4), (4, 8), (8, 16), (16, 32), (32, 64)], + }, +} + +# 关键:通过 deepcopy 复制同一个基础模型配置,按 tasks_params["models"] 批量覆盖 batch_size / request_rate +models = [] +for idx, (batch_size, request_rate) in enumerate(zip(tasks_params["models"]["batch_size"], + tasks_params["models"]["request_rate"])): + model_cfg = copy.deepcopy(base_vllm_api_stream_chat[0]) + model_cfg["abbr"] = f"vllm-api-stream-chat-bs{batch_size}-rr{request_rate}" + model_cfg["host_ip"] = "localhost" + model_cfg["host_port"] = 8080 + model_cfg["max_out_len"] = 512 + model_cfg["batch_size"] = batch_size + model_cfg["request_rate"] = request_rate + # 关键:每个模型任务使用独立的 generation_kwargs + model_cfg["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) + models.append(model_cfg) + +# 关键:按 tasks_params["datasets"] 批量构建合成数据集任务,名称按索引自动生成 +datasets = [] +for idx, (request_count, input_range, output_range) in enumerate( + zip(tasks_params["datasets"]["request_count"], + tasks_params["datasets"]["input_range"], + tasks_params["datasets"]["output_range"]) +): + ds = dict(base_synthetic_datasets[0]) + ds["abbr"] = f"synthetic-string-{idx}" + ds["config"] = { + "Type": "string", + "RequestCount": request_count, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": input_range[0], "MaxValue": input_range[1]}, + }, + "Output": { + "Method": "uniform", + "Params": {"MinValue": output_range[0], "MaxValue": output_range[1]}, + }, + }, + } + datasets.append(ds) + +# 关键:按索引一一配对 models[i] 与 datasets[i],避免笛卡尔积 +# 例如 models[0](batch_size=1) 仅与 datasets[0](input_range=(1,2)) 配对,而非与所有数据集交叉组合 +model_dataset_combinations = [ + dict(models=[models[idx]], datasets=[datasets[idx]]) + for idx in range(min(len(models), len(datasets))) +] + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict(type=LocalAPIRunner, task=dict(type=OpenICLInferTask)), +) \ No newline at end of file diff --git a/ais_bench/configs/performance_benchmark/multi_task_zh_cn.py b/ais_bench/configs/performance_benchmark/multi_task_zh_cn.py new file mode 100644 index 00000000..0d87718b --- /dev/null +++ b/ais_bench/configs/performance_benchmark/multi_task_zh_cn.py @@ -0,0 +1,33 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_str import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_stream + vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[1]["host_ip"] = "localhost" +models[1]["host_port"] = 8081 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[1]["max_out_len"] = 512 +models[1]["batch_size"] = 1 + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/performance_benchmark/perf_recalculate_zh_cn.py b/ais_bench/configs/performance_benchmark/perf_recalculate_zh_cn.py new file mode 100644 index 00000000..7388a13a --- /dev/null +++ b/ais_bench/configs/performance_benchmark/perf_recalculate_zh_cn.py @@ -0,0 +1,37 @@ +from mmengine.config import read_base +from ais_bench.benchmark.summarizers import DefaultPerfSummarizer +from ais_bench.benchmark.calculators import DefaultPerfMetricCalculator +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# 关键:自定义结果呈现任务中的 stats_list,调整要呈现的性能维度 +summarizer = dict( + attr="performance", + type=DefaultPerfSummarizer, + calculator=dict( + type=DefaultPerfMetricCalculator, + stats_list=["Average", "Min", "Max", "Median", "P75", "P90", "P95", "P99"], + ) +) + +models = vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[0]["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/performance_benchmark/single_task_zh_cn.py b/ais_bench/configs/performance_benchmark/single_task_zh_cn.py new file mode 100644 index 00000000..2522cc17 --- /dev/null +++ b/ais_bench/configs/performance_benchmark/single_task_zh_cn.py @@ -0,0 +1,26 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[0]["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/ais_bench/configs/performance_benchmark/synthetic_gen_string_zh_cn.py b/ais_bench/configs/performance_benchmark/synthetic_gen_string_zh_cn.py new file mode 100644 index 00000000..8af962cb --- /dev/null +++ b/ais_bench/configs/performance_benchmark/synthetic_gen_string_zh_cn.py @@ -0,0 +1,49 @@ +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import synthetic_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# 关键:自定义输入输出分布(可通过修改synthetic_config调整) +synthetic_config = { + "Type": "string", + "RequestCount": 1000, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": 50, "MaxValue": 500} + }, + "Output": { + "Method": "uniform", + "Params": {"MinValue": 20, "MaxValue": 200} + } + } +} + +datasets = [] +for ds in synthetic_datasets: + ds = dict(ds) + ds["config"] = synthetic_config + datasets.append(ds) + +models = vllm_api_stream_chat +# 关键:性能测试时需将 ignore_eos 设置为 True 以确保达到最大输出长度 +models[0]["host_ip"] = "localhost" +models[0]["host_port"] = 8080 +models[0]["max_out_len"] = 512 +models[0]["batch_size"] = 1 +models[0]["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + task=dict(type=OpenICLInferTask), + ), +) diff --git a/docs/requirements.txt b/docs/requirements.txt index 587d3c6c..b9ecd672 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -3,4 +3,5 @@ sphinx-rtd-theme sphinx-intl m2r2 linkify-it-py -myst_parser \ No newline at end of file +myst_parser +sphinx_design \ No newline at end of file diff --git a/docs/source_en/advanced_tutorials/custom_dataset.md b/docs/source_en/advanced_tutorials/custom_dataset.md index 3e420093..e1e84b27 100644 --- a/docs/source_en/advanced_tutorials/custom_dataset.md +++ b/docs/source_en/advanced_tutorials/custom_dataset.md @@ -133,6 +133,9 @@ datasets = [ ``` +> 💡 The above config file method is essentially a simplified application of the [Custom Config File Method](run_custom_config.md). For more complex scenarios (such as multi-model/multi-dataset combinations, custom model parameters, judge models, etc.), refer to the "Custom Dataset Evaluation" example in [Running AISBench with Custom Config Files](run_custom_config.md#custom-config-file-examples-for-various-scenarios). + + ### Guide to Using Dataset Supplementary Info (`.meta.json`) This feature currently only supports **performance evaluation scenarios**. The `ais_bench` system will automatically attempt to parse the input dataset file, so in most cases, a `.meta.json` file is **not required**. However, if the original dataset does not specify `max_tokens`, or if you need to configure data sampling, you must define these settings in a `.meta.json` file. diff --git a/docs/source_en/advanced_tutorials/judge_model_evaluate.md b/docs/source_en/advanced_tutorials/judge_model_evaluate.md index 9d383fab..9aef6df3 100644 --- a/docs/source_en/advanced_tutorials/judge_model_evaluate.md +++ b/docs/source_en/advanced_tutorials/judge_model_evaluate.md @@ -222,6 +222,10 @@ The result display example is as follows: From the quick start section of the judge model, you can see that except for the additional need to modify the judge model configuration in the data configuration file, the other evaluation execution methods are exactly the same as the conventional evaluation execution methods. Therefore, the execution methods for other accuracy evaluation function scenarios are also exactly the same. +## Implement via Custom Config Files + +> 💡 The above judge model evaluation scenario can also be implemented through the [Custom Config File Method](run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax such as loops, conditional judgments, list comprehensions, etc. You can write the tested model, judge model, dataset, summarizer, and other configurations into a single file, write once and reuse multiple times. See the "Judge Model Evaluation" example in [Running AISBench with Custom Config Files](run_custom_config.md#custom-config-file-examples-for-various-scenarios). + ### Multi-task Evaluation Refer to [Accuracy Evaluation Scenario Multi-task Evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#multi-task-evaluation) diff --git a/docs/source_en/advanced_tutorials/multimodal_benchmark.md b/docs/source_en/advanced_tutorials/multimodal_benchmark.md index a4fca202..2a87980d 100644 --- a/docs/source_en/advanced_tutorials/multimodal_benchmark.md +++ b/docs/source_en/advanced_tutorials/multimodal_benchmark.md @@ -28,6 +28,9 @@ Supported Model backend ## Quick Start + +> 💡 The multimodal evaluation scenario can also be implemented through the [Custom Config File Method](run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax such as loops, conditional judgments, list comprehensions, etc. You can write multimodal models, multimodal datasets, summarizer, and other configurations into a single file, write once and reuse multiple times. See [Running AISBench with Custom Config Files](run_custom_config.md). + ### Multimodal input format There are various formats for service-oriented multimodal data input. Taking image + text input as an example, it is as follows: - Method 1: Local file format, default method diff --git a/docs/source_en/advanced_tutorials/multiturn_benchmark.md b/docs/source_en/advanced_tutorials/multiturn_benchmark.md index bbca822d..f7aa9c85 100644 --- a/docs/source_en/advanced_tutorials/multiturn_benchmark.md +++ b/docs/source_en/advanced_tutorials/multiturn_benchmark.md @@ -172,6 +172,11 @@ After executing the AISBench command, detailed task execution data is saved to a This log indicates that detailed task data is stored in `outputs/default/20250628_151326` (relative to the directory where the command was executed). +## Implement via Custom Config Files + +> 💡 The above multi-turn dialogue performance evaluation scenario can also be implemented through the [Custom Config File Method](run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax such as loops, conditional judgments, list comprehensions, etc. You can write models, datasets, summarizer, and other configurations into a single file, write once and reuse multiple times. See the "Multi-Turn Dialogue Performance Evaluation" example in [Running AISBench with Custom Config Files](run_custom_config.md#custom-config-file-examples-for-various-scenarios). + +### Viewing Detailed Performance Data ```shell 20250628_151326 # Unique directory generated for each experiment based on timestamp ├── configs # Auto-saved dump of all configuration files @@ -181,7 +186,7 @@ This log indicates that detailed task data is stored in `outputs/default/2025062 └── vllm-api-chat-stream/ # Name of the "service-based model configuration" (corresponds to the `abbr` parameter in the model task configuration file) ├── sharegptdataset.csv # Per-request performance output (CSV), matching the "Performance Parameters" table in the printed results ├── sharegptdataset.json # End-to-end performance output (JSON), matching the "Common Metric" table in the printed results - ├── sharegptdataset_details.h5 # Full打点 ITL data (Inter-Token Latency) + ├── sharegptdataset_details.h5 # Full-granularity ITL data (Inter-Token Latency) ├── sharegptdataset_details.json # Full detailed metrics └── sharegptdataset_plot.html # Request concurrency visualization report (HTML) ``` diff --git a/docs/source_en/advanced_tutorials/rps_distribution.md b/docs/source_en/advanced_tutorials/rps_distribution.md index 51d5d474..83ee8506 100644 --- a/docs/source_en/advanced_tutorials/rps_distribution.md +++ b/docs/source_en/advanced_tutorials/rps_distribution.md @@ -324,6 +324,9 @@ $\lambda_i = \lambda_{\text{start}} \times \left(\frac{\lambda_{end}}{\lambda_{s 4. **In stress testing scenarios, the frequency of connection creation is controlled, but not the request sending rate (after each connection is created, requests are sent and responses are processed continuously without interruption)**. 5. **In multi-turn dialogue scenarios, only the request distribution of the first turn is valid**. +## Implement via Custom Config Files + +> 💡 The above RPS distribution control parameters (`traffic_cfg`) are also applicable in the [Custom Config File Method](run_custom_config.md). You only need to add the `traffic_cfg` field in the model configuration dict. The configuration file is essentially a Python script that supports all Python syntax such as loops, conditional judgments, list comprehensions, etc. You can write models, datasets, summarizer, and other configurations into a single file, write once and reuse multiple times. See [Running AISBench with Custom Config Files](run_custom_config.md). --- diff --git a/docs/source_en/advanced_tutorials/run_custom_config.md b/docs/source_en/advanced_tutorials/run_custom_config.md index be417747..e37efb64 100644 --- a/docs/source_en/advanced_tutorials/run_custom_config.md +++ b/docs/source_en/advanced_tutorials/run_custom_config.md @@ -1,6 +1,204 @@ # Running AISBench with a Custom Configuration File The standard command invocation method for AISBench specifies the model task via `--models`, the dataset task via `--datasets`, and the result presentation task via `--summarizer` to run an evaluation task. Additionally, AISBench supports specifying a **custom configuration file** that combines the configuration information of these three types of tasks, enabling the execution of custom task combinations. +## Why Use a Custom Configuration File +AISBench provides two ways to run tasks: **Command-Line Interface (CLI)** and **custom configuration file**. In actual use, it is recommended to prioritize the custom configuration file approach, for the following reasons: + +| Comparison Dimension | CLI Approach | Configuration File Approach | +| --- | --- | --- | +| **Reusability** | The complete command must be re-entered for each run | Configuration files can be saved, version-managed, and reused repeatedly | +| **Expressiveness** | Only model/dataset names can be specified via parameters | Allows precise control over all details including model parameters, dataset sampling range, and inference configuration | +| **Combination Flexibility** | Only Cartesian product combinations are supported | Supports `model_dataset_combinations` for arbitrary custom model-dataset pairings | +| **Parameter Override** | Internal parameters of preset models/datasets cannot be modified | Any field such as `abbr`, `test_range`, `host_ip`, `host_port` can be modified directly | +| **Batch Execution** | Requires running the command multiple times | A single configuration file can run multiple model and dataset combinations at once | +| **Team Collaboration** | Commands are hard to share and trace | Configuration files are code and can be committed to a repository for review and reuse | + +**Summary**: The CLI approach is suitable for quick validation, while the configuration file approach is suitable for formal, reproducible, and complex evaluation scenarios. + +## Configuration Files Are Python Scripts +The AISBench custom configuration file is essentially a Python script. This means you can use all Python syntax features in the configuration file to flexibly construct evaluation tasks. + +### Using `for` Loop to Batch Build Model Configurations + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + +datasets = gsm8k_0_shot_cot_str + +models = [] +for port in [8080, 8081, 8082]: + models.append( + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr=f'vllm-api-chat-port-{port}', + path="", + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=port, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) + ) + +work_dir = 'outputs/multi_port_benchmark/' +``` + +### Using List Comprehension to Batch Add Datasets + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat +datasets = [ + dict(d, abbr=f'my_{d["abbr"]}', reader_cfg=dict(d.get('reader_cfg', {}), test_range='[0:100]')) + for d in datasets +] + +models = vllm_api_general_chat +work_dir = 'outputs/my_benchmark/' +``` + +### Conditional Configuration: Switch Based on Environment Variables + +```python +import os +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat, VLLMCustomAPI + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + +datasets = gsm8k_0_shot_cot_str + +use_stream = os.environ.get('USE_STREAM', 'false').lower() == 'true' +model_type = VLLMCustomAPIChat if use_stream else VLLMCustomAPI + +models = [ + dict( + attr="service", + type=model_type, + abbr='vllm-api-conditional', + path="", + model="", + stream=use_stream, + request_rate=0, + retry=2, + host_ip=os.environ.get('HOST_IP', 'localhost'), + host_port=int(os.environ.get('HOST_PORT', '8080')), + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) +] + +work_dir = 'outputs/conditional_benchmark/' +``` + +### Using `.copy()` to Reuse and Modify Model Configurations + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_0_shot_cot_str + +model_high_temp = vllm_api_general_chat.copy() +model_high_temp[0]['abbr'] = vllm_api_general_chat[0]['abbr'] + '-high-temp' +model_high_temp[0]['generation_kwargs']['temperature'] = 0.9 + +model_low_temp = vllm_api_general_chat.copy() +model_low_temp[0]['abbr'] = vllm_api_general_chat[0]['abbr'] + '-low-temp' +model_low_temp[0]['generation_kwargs']['temperature'] = 0.1 + +models = model_high_temp + model_low_temp +work_dir = 'outputs/temperature_comparison/' +``` + +## Complete Configuration File Variable Reference +The following top-level variables can be defined in the custom configuration file. All variables are optional, but at least `models` and `datasets` must be defined to run an inference task. + +| Variable Name | Type | Required | Description | +| --- | --- | --- | --- | +| `models` | `list[dict]` | Yes (for inference) | List of model configurations. Each element is a dict that must at least include `type` (model class) and `abbr` (unique identifier) fields. Service-oriented models additionally require `attr="service"`, `host_ip`, `host_port`, etc.; local models additionally require `path`, `tokenizer_path`, etc. | +| `datasets` | `list[dict]` | Yes (for inference) | List of dataset configurations. Each element is a dict that must at least include `type` (dataset class), `abbr` (unique identifier), `reader_cfg`, `infer_cfg`, and `eval_cfg` fields | +| `summarizer` | `dict` | No | Result summarizer configuration. Usually imported from `ais_bench.benchmark.configs.summarizers.example`. Contains `attr` and `summary_groups` fields | +| `model_dataset_combinations` | `list[dict]` | No | List of custom model-dataset pairings. Each element is `dict(models=[...], datasets=[...])`. When not specified, the Cartesian product of `models` and `datasets` is used by default | +| `work_dir` | `str` | No | Working directory; inference results and logs will be output to this directory. Defaults to `outputs/default/` | +| `infer` | `dict` | No | Inference process configuration. Contains `partitioner` (partitioner) and `runner` (runner, with `max_num_workers` and `task` inside). Uses the default inference process when not specified | +| `eval` | `dict` | No | Evaluation process configuration. Same structure as `infer`. Only used when an independent evaluation phase is needed (e.g., SWE-Bench, VBench scenarios) | + +### Detailed `models` Field Description + +Common fields for each model configuration dict: + +| Field | Type | Description | +| --- | --- | --- | +| `type` | class | Model class, such as `VLLMCustomAPIChat`, `VLLMCustomAPI`, `HuggingFaceBaseModel`, `HuggingFacewithChatTemplate`, etc. | +| `abbr` | `str` | Unique identifier of the model, used as the column name in the result table. Model-dataset combinations with the same `abbr` in the same configuration file will be treated as duplicate tasks and skipped | +| `attr` | `str` | Model attribute; `"service"` for service-oriented models, `"local"` for local models | +| `path` | `str` | Model path (required for local models; can be an empty string for service-oriented models) | +| `model` | `str` | Model name specified for service-oriented inference | +| `host_ip` | `str` | IP address of the inference service (for service-oriented models) | +| `host_port` | `int` | Port of the inference service (for service-oriented models) | +| `stream` | `bool` | Whether to use streaming inference | +| `max_out_len` | `int` | Maximum output token count | +| `batch_size` | `int` | Inference batch size | +| `max_seq_len` | `int` | Maximum input sequence length | +| `request_rate` | `int` | Request rate limit; 0 means unlimited | +| `retry` | `int` | Number of retries for failed requests | +| `generation_kwargs` | `dict` | Generation parameters, such as `temperature`, `top_k`, `top_p`, `seed`, etc. | +| `tokenizer_path` | `str` | Tokenizer path (for local models) | +| `model_kwargs` | `dict` | Model loading parameters (for local models), such as `device_map` | +| `tokenizer_kwargs` | `dict` | Tokenizer parameters (for local models), such as `padding_side` | +| `run_cfg` | `dict` | Multi-GPU/multi-machine run configuration (for local models), such as `dict(num_gpus=1, num_procs=1)` | +| `pred_postprocessor` | `dict` | Model output post-processor, such as `dict(type=extract_non_reasoning_content)` | + +### Detailed `datasets` Field Description + +Common fields for each dataset configuration dict: + +| Field | Type | Description | +| --- | --- | --- | +| `type` | class | Dataset class, such as `GSM8KDataset`, `MATHDataset`, `SyntheticDataset`, etc. | +| `abbr` | `str` | Unique identifier of the dataset, used as the row name in the result table | +| `path` | `str` | Dataset file path | +| `reader_cfg` | `dict` | Reader configuration, containing `input_columns`, `output_column`, and optional `test_range` to control the sampling range (e.g., `'[0:100]'`) | +| `infer_cfg` | `dict` | Inference configuration, containing `prompt_template`, `retriever`, `inferencer` | +| `eval_cfg` | `dict` | Evaluation configuration, containing `evaluator` and optional `pred_postprocessor` | +| `judge_infer_cfg` | `dict` | Judge model inference configuration (for datasets requiring LLM Judge), containing `judge_model`, `judge_dataset_type`, `prompt_template`, `retriever`, `inferencer` | + +### Detailed `infer` Field Description + +```python +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) +``` ## Usage Instructions ```bash @@ -10,6 +208,460 @@ ais_bench ais_bench/configs/api_examples/infer_vllm_api_general.py ``` +## Custom Configuration File Examples for Each Scenario + +### 1. Service-Oriented Accuracy Evaluation +Access the inference service via API and perform accuracy evaluation using real datasets. Applicable to service-oriented deployment scenarios such as vLLM, MindIE, TGI, Triton, etc. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as gsm8k_0_shot_cot_chat + +datasets = [*gsm8k_0_shot_cot_chat] + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-general-chat', + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + seed=None, + repetition_penalty=1.03, + ) + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/api-vllm-general-chat/' +``` + +### 2. Pure Model Accuracy Evaluation +Use a HuggingFace local model for direct inference and evaluation without deploying a service. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFaceBaseModel +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as gsm8k_0_shot_cot_chat + +datasets = [*gsm8k_0_shot_cot_chat] + +models = [ + dict( + type=HuggingFaceBaseModel, + abbr='hf-base-model', + path='THUDM/chatglm-6b', + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + do_sample=True, + seed=None, + repetition_penalty=1.03, + ), + max_out_len=100, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/hf-base-model/' +``` + +### 3. Service-Oriented Performance Evaluation +Use a synthetic dataset to perform performance stress testing on the inference service, outputting metrics such as TTFT (Time To First Token), TPOT (Time Per Output Token), and E2EL (End-to-End Latency). + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import ( + synthetic_datasets, + ) + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import ( + models as vllm_api_general_stream, + ) + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import ( + models as vllm_api_stream_chat, + ) + +datasets = synthetic_datasets + +vllm_api_general_stream[0]["abbr"] = "demo-" + vllm_api_general_stream[0]["abbr"] +vllm_api_stream_chat[0]["abbr"] = "demo-" + vllm_api_stream_chat[0]["abbr"] + +models = vllm_api_general_stream + vllm_api_stream_chat + +work_dir = "outputs/demo_api-vllm-stream-perf/" +``` + +Run command: + +```bash +ais_bench ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py -m perf +``` + +### 4. Synthetic Dataset Performance Evaluation +Customize the parameters of the synthetic dataset to control the number of requests and the input/output token length distribution. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate +from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever +from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer +from ais_bench.benchmark.datasets import SyntheticDataset, MATHEvaluator, math_postprocess_v2 + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import ( + models as vllm_api_general_stream, + ) + +synthetic_config = { + "Type": "string", + "RequestCount": 100, + "TrustRemoteCode": False, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": 1, "MaxValue": 500} + }, + "Output": { + "Method": "gaussian", + "Params": {"Mean": 200, "Var": 100, "MinValue": 1, "MaxValue": 500} + } + }, +} + +datasets = [ + dict( + abbr='synthetic_custom', + type=SyntheticDataset, + config=synthetic_config, + reader_cfg=dict(input_columns=['question', 'max_out_len'], output_column='answer'), + infer_cfg=dict( + prompt_template=dict(type=PromptTemplate, template="{question}"), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + eval_cfg=dict( + evaluator=dict(type=MATHEvaluator, version='v2'), + pred_postprocessor=dict(type=math_postprocess_v2), + ), + ) +] + +models = vllm_api_general_stream +work_dir = 'outputs/synthetic_perf_custom/' +``` + +### 5. Multi-Model Multi-Dataset Combinations +Simultaneously evaluate the performance of multiple models on multiple datasets, automatically combined via the Cartesian product. + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_str import mmlu_datasets as mmlu_5_shot_str + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat + mmlu_5_shot_str +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat + +work_dir = 'outputs/multi_model_multi_dataset/' +``` + +### 6. Custom Model-Dataset Pairings +Precisely control which models are paired with which datasets via `model_dataset_combinations` to avoid unnecessary Cartesian products. + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat + +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), + dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), +] + +work_dir = 'outputs/custom_combinations/' +``` + +### 7. Judge Model Evaluation +For datasets that require LLM Judge evaluation (e.g., AIME 2025), configure the judge model in the dataset's `judge_infer_cfg`. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.aime2025.aime2025_gen_0_shot_llmjudge import aime2025_datasets + +datasets = aime2025_datasets + +datasets[0]['judge_infer_cfg']['judge_model']['host_ip'] = 'localhost' +datasets[0]['judge_infer_cfg']['judge_model']['host_port'] = 8081 + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-judge-eval', + path="", + model="", + stream=True, + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/judge_eval/' +``` + +### 8. Steady-State Performance Evaluation +Simulate performance under steady-state load by controlling the `request_rate` parameter and `stream` parameter. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPI + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import ( + synthetic_datasets, + ) + +datasets = synthetic_datasets + +models = [] +for rate in [0, 5, 10, 20]: + model_cfg = dict( + attr="service", + type=VLLMCustomAPI, + abbr=f'vllm-api-steady-rate-{rate}', + path="", + model="", + stream=True, + request_rate=rate, + use_timestamp=False, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + ) + models.append(model_cfg) + +work_dir = 'outputs/steady_state_perf/' +``` + +### 9. Multi-Turn Dialogue Performance Evaluation +Use the ShareGPT or MTBench multi-turn dialogue datasets for performance evaluation. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.sharegpt.sharegpt_gen import sharegpt_datasets + +datasets = sharegpt_datasets + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr="vllm-multiturn-api-chat-stream", + path="", + model="", + stream=True, + request_rate=0, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/multi_turn_benchmark/' +``` + +### 10. Custom Dataset Evaluation +When you need to use your own dataset for evaluation, you can do so by customizing the dataset configuration. + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate +from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever +from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer +from ais_bench.benchmark.datasets import GenericDataset, AccuracyEvaluator + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +datasets = [ + dict( + abbr='my_custom_dataset', + type=GenericDataset, + path='/path/to/your/dataset.jsonl', + reader_cfg=dict( + input_columns=['question'], + output_column='answer', + ), + infer_cfg=dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict(role='HUMAN', prompt='{question}'), + ], + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + eval_cfg=dict( + evaluator=dict(type=AccuracyEvaluator), + ), + ) +] + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-custom-dataset', + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/custom_dataset/' +``` + + ## Example of Using a Custom Configuration File for Accuracy Evaluation ### Editing the Example Content The following example demonstrates how to evaluate the performance of two service interfaces ([`v1/chat/completions`](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) and [`v1/completions`](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general.py)) on the [GSM8K](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/README_en.md) and [MATH datasets](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/math/README_en.md). Refer to the sample file: [demo_infer_vllm_api.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/demo_infer_vllm_api.py): @@ -27,16 +679,13 @@ with read_base(): from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general -# Use only a subset of samples for demo testing gsm8k_0_shot_cot_str[0]['abbr'] = 'demo_' + gsm8k_0_shot_cot_str[0]['abbr'] gsm8k_0_shot_cot_str[0]['reader_cfg']['test_range'] = '[0:8]' math500_gen_0_shot_cot_chat[0]['abbr'] = 'demo_' + math500_gen_0_shot_cot_chat[0]['abbr'] math500_gen_0_shot_cot_chat[0]['reader_cfg']['test_range'] = '[0:8]' -# Specify the dataset list; add different dataset configurations by concatenation datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat -# Specify the model configuration list models = [ dict( attr="service", @@ -46,8 +695,8 @@ models = [ model="", request_rate = 0, retry = 2, - host_ip = "localhost", # Specify the IP address of the inference service - host_port = 8080, # Specify the port of the inference service + host_ip = "localhost", + host_port = 8080, max_out_len = 512, batch_size=1, generation_kwargs = dict( @@ -103,12 +752,12 @@ with read_base(): models as vllm_api_stream_chat, ) -datasets = synthetic_datasets # Specify the dataset list +datasets = synthetic_datasets vllm_api_general_stream[0]["abbr"] = "demo-" + vllm_api_general_stream[0]["abbr"] vllm_api_stream_chat[0]["abbr"] = "demo-" + vllm_api_stream_chat[0]["abbr"] -models = vllm_api_general_stream + vllm_api_stream_chat # Specify the model list +models = vllm_api_general_stream + vllm_api_stream_chat work_dir = "outputs/demo_api-vllm-stream-perf/" ``` @@ -126,7 +775,7 @@ ais_bench ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py -m perf --m ### Output Results ```bash -[2025-12-05 12:10:44,147] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-general-stream/syntheticdataset]: +[2025-12-05 12:10:44,147] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-general-stream/syntheticdataset]: ╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡ @@ -135,7 +784,7 @@ ais_bench ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py -m perf --m │ TTFT │ total │ 103.5 ms │ 102.4 ms │ 107.0 ms │ 103.1 ms │ 103.3 ms │ 104.2 ms │ 106.8 ms │ 10 │ ... [2025-12-05 12:10:44,149] [ais_bench] [INFO] Performance Result files located in outputs/demo_api-vllm-general-stream-chat-perf/20251205_121020/performances/demo-vllm-api-general-stream-chat. -[2025-12-05 12:10:44,149] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-stream-chat/syntheticdataset]: +[2025-12-05 12:10:44,149] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-stream-chat/syntheticdataset]: ╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤════════════════╤═════════════════╤═════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════════╪═════════════════╪═════════════════╪═════╡ @@ -157,12 +806,12 @@ with read_base(): from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat -models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat model_dataset_combinations = [ - dict(models=[models[0]], datasets=[datasets[0]]), # Combination 1: Use model 0 (vllm_api_general) with dataset 0 (gsm8k_0_shot_cot_str) - dict(models=[models[1]], datasets=[datasets[1]]), # Combination 2: Use model 1 (vllm_api_general_chat) with dataset 1 (math500_gen_0_shot_cot_chat) - dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), # Combination 3: Use model 2 (vllm_api_stream_chat) with dataset 0 (gsm8k_0_shot_cot_str) and dataset 1 (math500_gen_0_shot_cot_chat) + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), + dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), ... ] ``` @@ -180,8 +829,8 @@ vllm_api_general_copy[0]['port'] = 8081 models = vllm_api_general_copy + vllm_api_general datasets = math500_gen_0_shot_cot_chat model_dataset_combinations = [ - dict(models=[models[1]], datasets=datasets), # Combination 1: Use model 1 (vllm_api_general) with dataset (math500_gen_0_shot_cot_chat) - dict(models=[models[0]], datasets=datasets), # Combination 2: Use model 0 (vllm_api_general_copy) with dataset 0 (math500_gen_0_shot_cot_chat). Since vllm_api_general_copy and vllm_api_general have the same abbr, this will be considered the same task as combination 1 and will be skipped, even if the internal parameters differ + dict(models=[models[1]], datasets=datasets), + dict(models=[models[0]], datasets=datasets), ] ``` @@ -189,7 +838,7 @@ Correct approach: When reusing model or dataset configurations, modify the `abbr ```python vllm_api_general_copy = vllm_api_general.copy() -vllm_api_general_copy[0]['abbr'] = vllm_api_general[0]['abbr'] + '-copy' # Modify abbr to identify the model +vllm_api_general_copy[0]['abbr'] = vllm_api_general[0]['abbr'] + '-copy' ``` In this way, `vllm_api_general_copy[0]` and `vllm_api_general[0]` have different `abbr` values, so combination 2 and combination 1 are different tasks and will be executed normally. @@ -199,11 +848,14 @@ In this way, `vllm_api_general_copy[0]` and `vllm_api_general[0]` have different | Filename | Description | | --- | --- | | [infer_vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_general.py) | Evaluates the `v1/completions` sub-service using vLLM API (version 0.6+) on the GSM8K dataset. The prompt format is a string, and the dataset path is customized. | -| [infer_mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_mindie_stream_api_general.py) | Evaluates the `infer` sub-service using MindIE Stream API on the GSM8K dataset. The prompt format is a string, and the dataset path is customized. | -| [infer_vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_old.py) | Evaluates the `generate` sub-service using vLLM API (version 0.2.6) on the GSM8K dataset. The prompt format is a string, and the dataset path is customized. | | [infer_vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_general_chat.py) | Evaluates the `v1/chat/completions` sub-service using vLLM API (version 0.6+) on the GSM8K dataset. The prompt format is a conversation format, and the dataset path is customized. | | [infer_vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_stream_chat.py) | Evaluates the `v1/chat/completions` sub-service with streaming inference using vLLM API (version 0.6+) on the GSM8K dataset. The prompt format is a conversation format, and the dataset path is customized. | +| [infer_vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_old.py) | Evaluates the `v1/completions` sub-service using older vLLM API on the GSM8K dataset. The prompt format is a string. | +| [infer_mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_mindie_stream_api_general.py) | Evaluates the `infer` sub-service using MindIE Stream API on the GSM8K dataset. The prompt format is a string, and the dataset path is customized. | | [infer_hf_base_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_base_model.py) | Evaluates using the inference interface of a Hugging Face base model on the GSM8K dataset. The prompt format is a string, and the dataset path is customized. | -| [infer_hf_chat_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_chat_model.py) | Evaluates using the inference interface of a Hugging Face chat model on the GSM8K dataset. The prompt format is a string, and the dataset path is customized. | +| [infer_hf_chat_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_chat_model.py) | Evaluates using the inference interface of a Hugging Face chat model on the GSM8K dataset. The prompt format is a conversation format, and the dataset path is customized. | +| [demo_infer_vllm_api.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/demo_infer_vllm_api.py) | Demo example: Evaluates the accuracy of two interfaces `v1/chat/completions` and `v1/completions` simultaneously on the GSM8K and MATH datasets. | +| [demo_infer_vllm_api_perf.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py) | Demo example: Evaluates the streaming performance of two interfaces `v1/chat/completions` and `v1/completions` simultaneously using synthetic datasets. | +| [all_dataset_configs.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/all_dataset_configs.py) | A consolidated import of all supported dataset configurations; can be used directly via `from ... import` in custom configuration files. | -**Note**: To evaluate other datasets using the above custom configuration files, import additional datasets from [ais_bench/configs/api_examples/all_dataset_configs.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/all_dataset_configs.py). +**Note**: To evaluate other datasets using the above custom configuration files, import additional datasets from [ais_bench/configs/api_examples/all_dataset_configs.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/all_dataset_configs.py). \ No newline at end of file diff --git a/docs/source_en/advanced_tutorials/stable_stage.md b/docs/source_en/advanced_tutorials/stable_stage.md index 8bad7831..77c407fa 100644 --- a/docs/source_en/advanced_tutorials/stable_stage.md +++ b/docs/source_en/advanced_tutorials/stable_stage.md @@ -208,6 +208,11 @@ After the command execution is completed, the task execution details in `outputs For instructions on how to view the charts in this HTML file, please refer to 📚 [Instructions for Using Performance Test Visualization Concurrency Charts](../base_tutorials/results_intro/performance_visualization.md) +## Implement via Custom Config Files + +> 💡 The above steady-state performance evaluation scenario can also be implemented through the [Custom Config File Method](run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax such as loops, conditional judgments, list comprehensions, etc. You can write models, datasets, summarizer, and other configurations into a single file, write once and reuse multiple times. See the "Steady-State Performance Evaluation" example in [Running AISBench with Custom Config Files](run_custom_config.md#custom-config-file-examples-for-various-scenarios). + + ## Other Functional Scenarios ### Recalculating Performance Results Refer to 📚 [Recalculation of Performance Results](../base_tutorials/scenes_intro/performance_benchmark.md#recalculation-of-performance-results) diff --git a/docs/source_en/advanced_tutorials/synthetic_dataset.md b/docs/source_en/advanced_tutorials/synthetic_dataset.md index 231b052b..6472abaa 100644 --- a/docs/source_en/advanced_tutorials/synthetic_dataset.md +++ b/docs/source_en/advanced_tutorials/synthetic_dataset.md @@ -215,6 +215,7 @@ synthetic_config = { } ``` +------ ### 4.2 TokenId Type Examples @@ -225,7 +226,8 @@ synthetic_config = { "Type": "tokenid", "RequestCount": 1000, "TokenIdConfig": { - "RequestSize": 2048 # 2048 tokens per request + "RequestSize": 2048, # 2048 tokens per request + "PrefixLen": 0 } } ``` @@ -237,12 +239,13 @@ synthetic_config = { "Type": "tokenid", "RequestCount": 5000, "TokenIdConfig": { - "RequestSize": 128 # Short text processing scenario + "RequestSize": 128, # Short text processing scenario + "PrefixLen": 0 } } ``` -#### prefix Cache Performance Testing +#### Prefix Cache Performance Testing ```python synthetic_config = { @@ -255,6 +258,8 @@ synthetic_config = { } ``` +------ + ## V. Frequently Asked Questions ### Q1: How to choose a distribution type? @@ -266,6 +271,7 @@ synthetic_config = { - **Stress testing**: Use zipf distribution for Input and uniform distribution for Output. - **Stability testing**: Use gaussian distribution for both Input and Output. +------ ### Q2: Why does the performance evaluation result matrix show unexpected values even after specifying the *input length*? @@ -273,13 +279,19 @@ synthetic_config = { - **`string` mode**: The input length here refers to the length of the input string, not the number of tokens. - **Preprocessing stage**: Additional string concatenation may be performed before/after using chat-related APIs. +------ ### Q3: Why does the performance evaluation result matrix show unexpected values even after specifying the *output length* in String mode? - **Significant discrepancy**: Check if the `ignore_eos` parameter in `generation_kwargs` of the model API configuration file is correctly set to `True` (this ensures the service ignores the end-of-sequence token until the preset output length is reached). +------ ## VI. Notes 1. **`tokenid` mode**: The value range of `tokenid` depends on the vocabulary range of the model specified in the model configuration file. -2. **`string` mode**: A fixed-length sequence is generated when MinValue=MaxValue. \ No newline at end of file +2. **`string` mode**: A fixed-length sequence is generated when MinValue=MaxValue. + +## VII. Implement via Custom Config Files + +> 💡 The above synthetic dataset evaluation scenario can also be implemented through the [Custom Config File Method](run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax such as loops, conditional judgments, list comprehensions, etc. You can write models, datasets, summarizer, and other configurations into a single file, write once and reuse multiple times. See the "Synthetic Dataset Performance Evaluation" example in [Running AISBench with Custom Config Files](run_custom_config.md#custom-config-file-examples-for-various-scenarios). \ No newline at end of file diff --git a/docs/source_en/base_tutorials/all_params/cli_args.md b/docs/source_en/base_tutorials/all_params/cli_args.md index 09f21bb1..6f6e5a82 100644 --- a/docs/source_en/base_tutorials/all_params/cli_args.md +++ b/docs/source_en/base_tutorials/all_params/cli_args.md @@ -17,14 +17,15 @@ Based on the execution scenario, command line parameters are divided into three `Accuracy Evaluation Parameters` take effect only when the `--mode` parameter is specified as `"all", "infer", "eval", "viz"`. `Performance Evaluation Parameters` take effect only when the `--mode` parameter is specified as `"perf", "perf_viz"`. `Common Parameters` are not restricted by the task execution mode and can be specified in all modes. -# ### Common Parameters +### Common Parameters Applicable to all modes and can be used in combination with accuracy or performance parameters. | Parameter | Description | Example | | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------- | -| `--models` | Specifies the name of the model inference backend task (corresponding to a pre-implemented default model configuration file under the path `ais_bench/benchmark/configs/models`). Multiple task names are supported. For details, refer to 📚 [Supported Models](./models.md) | `--models vllm_api_general` | -| `--datasets` | Specifies the name of the dataset task (corresponding to a pre-implemented default dataset configuration file under the path `ais_bench/benchmark/configs/datasets`). Multiple dataset names are supported. For details, refer to 📚 [Supported Dataset Types](./datasets.md) | `--datasets gsm8k_gen` | -| `--summarizer` | Specifies the name of the result summary task (corresponding to a pre-implemented default configuration file under the path `ais_bench/benchmark/configs/summarizers`). For details, refer to 📚 [Supported Result Summary Tasks](./summarizer.md) | `--summarizer medium`| +| `config` | Specifies the path to a custom configuration file. | `ais_bench /path/to/custom_config.py {other optional arguments}` | +| `--models` | Specifies the name of the model inference backend task (corresponding to a pre-implemented default model configuration file under the path `ais_bench/benchmark/configs/models`). Multiple task names are supported. For details, refer to 📚 [Supported Models](./models.md).
⚠️ **Note**: This parameter is invalid when a custom configuration file path is specified. | `--models vllm_api_general` | +| `--datasets` | Specifies the name of the dataset task (corresponding to a pre-implemented default dataset configuration file under the path `ais_bench/benchmark/configs/datasets`). Multiple dataset names are supported. For details, refer to 📚 [Supported Dataset Types](./datasets.md).
⚠️ **Note**: This parameter is invalid when a custom configuration file path is specified. | `--datasets gsm8k_gen` | +| `--summarizer` | Specifies the name of the result summary task (corresponding to a pre-implemented default configuration file under the path `ais_bench/benchmark/configs/summarizers`). For details, refer to 📚 [Supported Result Summary Tasks](./summarizer.md).
⚠️ **Note**: This parameter is invalid when a custom configuration file path is specified. | `--summarizer medium`| | `--mode` or `-m` | Running mode, optional values: `all`, `infer`, `eval`, `viz`, `perf`, `perf_viz`; default value is `all`.
For details, refer to 📚 [Running Mode Description](./mode.md). | `--mode infer`
`-m all`| | `--reuse` or `-r` | Specifies the timestamp in an existing working directory to continue execution and overwrite original results. Used in conjunction with the `--mode` parameter, it can resume interrupted inference, or perform accuracy calculation/visualization result printing based on existing inference results. If no parameter is added, the latest timestamp in the `--work-dir` is automatically selected. | `--reuse 20250126_144254`
`-r 20250126_144254` | | `--work-dir` or `-w` | Specifies the evaluation working directory for saving output results. Default path: `outputs/default`. | `--work-dir /path/to/work`
`-w /path/to/work` | @@ -34,11 +35,11 @@ Applicable to all modes and can be used in combination with accuracy or performa | `--max-workers-per-gpu` | Reserved parameter; not currently supported. | `--max-workers-per-gpu 1` | | `--merge-ds` | Enables merged inference for datasets of the same type (runs multiple datasets for the same task together). | `--merge-ds` | | `--num-prompts` | Specifies the number of test cases for the dataset (selected in dataset order). A positive integer must be passed. If the number exceeds the total number of cases in the dataset or no value is specified, the entire dataset is used for testing. | `--num-prompts 500` | -| `--max-num-workers` | Number of parallel tasks, range: `[1, number of CPU cores]`; default value: `1`. Invalid when `--debug` is specified; all tasks are executed serially. | `--max-num-workers 2` | +| `--max-num-workers` | Number of parallel tasks, range: `[1, number of CPU cores]`; default value: `1`. Invalid when `--debug` is specified; all tasks are executed serially. Note: In performance evaluation scenarios, an excessively high concurrency may cause resource contention among different processes, leading to inaccurate test results. | `--max-num-workers 2` | | `--num-warmups` | Number of warm-up runs before sending requests. Data is selected in dataset order for testing. When `num-warmups` exceeds the number of dataset entries, data from the dataset will be sent in a loop. Default value: `1`; set to `0` to disable warm-up. If all requests fail during the warmup phase, subsequent inference tasks will not be executed. | `--num-warmups 10` | -# ### Accuracy Evaluation Parameters +### Accuracy Evaluation Parameters Valid only when the mode is `all`, `infer`, `eval`, or `viz`. | Parameter | Description | Example | @@ -47,7 +48,7 @@ Valid only when the mode is `all`, `infer`, `eval`, or `viz`. | `--dump-extract-rate` | Toggle to dump evaluation speed data. Enabled if configured, disabled if not; disabled by default. | `--dump-extract-rate`| -# ### Performance Evaluation Parameters +### Performance Evaluation Parameters Valid only when the mode is `perf` or `perf_viz`. | Parameter | Description | Example | @@ -66,4 +67,6 @@ The currently supported parameter configurations are as follows: | `WORKERS_NUM` | Number of processes used for sending requests. The default value is 0, which means automatic allocation based on the maximum number of concurrent requests configured by the user. (Invalid when the command-line parameter `--debug` is specified; single-core execution is used for sending requests, which limits concurrency capabilities.) | [0, number of CPU cores] | | `MAX_CHUNK_SIZE` | Maximum cache size for a single chunk returned by the streaming inference model backend. The default value is 65535 bytes (64KB). | `(0, 16777216]` (Unit: Byte) | | `REQUEST_TIME_OUT` | Timeout period for the client to wait for a response after sending a request. The default value is None, meaning infinite waiting (always waiting for the model to return results). | `None` or `>0` (Unit: seconds) | -| `LOG_LEVEL` | Log level, optional values: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Default value: `INFO`. | `[DEBUG, INFO, WARNING, ERROR, CRITICAL]` | \ No newline at end of file +| `LOG_LEVEL` | Log level, optional values: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Default value: `INFO`. | `[DEBUG, INFO, WARNING, ERROR, CRITICAL]` | +| `PRESSURE_TIME` | Duration of pressure testing. Only takes effect when `--pressure` mode is specified. Unit: seconds. (This parameter will be deprecated in future versions; please use the `--pressure-time` parameter instead.) | `[1, 86400]` (i.e., 1 second to 24 hours) | +| `CONNECTION_ADD_RATE` | Rate at which concurrent threads are created. Represents the number of new concurrent threads per second until the maximum concurrency limit is reached. Only takes effect when `--pressure` mode is specified. (This parameter will be deprecated in future versions; please modify the `request_rate` parameter in the model configuration file instead.) | `> 0.1` (Unit: threads / second) | \ No newline at end of file diff --git a/docs/source_en/base_tutorials/all_params/models.md b/docs/source_en/base_tutorials/all_params/models.md index f59b7ee8..5066685e 100644 --- a/docs/source_en/base_tutorials/all_params/models.md +++ b/docs/source_en/base_tutorials/all_params/models.md @@ -13,20 +13,20 @@ Taking the vLLM inference service deployed on GPU as an example, you can refer t The model configurations corresponding to different service-oriented backends are as follows: -| Model Configuration Name | Description | Prerequisites for Use | Supported Evaluation Modes | Interface Type | Supported Dataset Prompt Formats | Configuration File Path | -| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | -| `vllm_api_general` | Access the inference service via vLLM's OpenAI-compatible API, with the interface `v1/completions` | The vLLM version used supports the `v1/completions` sub-service | Generative Evaluation, PPL Mode Evaluation | Text Interface | String Format | [vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general.py) | -| `vllm_api_general_stream` | Access the vLLM inference service in streaming mode, with the interface `v1/completions` | The vLLM version used supports the `v1/completions` sub-service | Generative Evaluation | Streaming Interface | String Format | [vllm_api_general_stream.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py) | -| `vllm_api_general_chat` | Access the inference service via vLLM's OpenAI-compatible API, with the interface `v1/chat/completions` | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation, PPL Mode Evaluation | Text Interface | String Format, Dialogue Format, Multimodal Format | [vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) | -| `vllm_api_stream_chat` | Access the vLLM inference service in streaming mode, with the interface `v1/chat/completions` | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation | Streaming Interface | String Format, Dialogue Format, Multimodal Format | [vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) | -| `vllm_api_stream_chat_multiturn` | Access the vLLM inference service in streaming mode for multi-turn dialogue scenarios, with the interface `v1/chat/completions` | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation | Streaming Interface | Dialogue Format | [vllm_api_stream_chat_multiturn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat_multiturn.py) | -| `vllm_api_function_call_chat` | API for accessing the vLLM inference service in function call accuracy evaluation scenarios, with the interface `v1/chat/completions` (only applicable to the [BFCL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/BFCL/README_en.md) evaluation scenario) | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation | Text Interface | Dialogue Format | [vllm_api_function_call_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_function_call_chat.py) | -| `vllm_api_old` | Access the inference service via vLLM-compatible API, with the interface `generate` | The vLLM version used supports the `generate` sub-service | Generative Evaluation | Text Interface | String Format, Multimodal Format | [vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_old.py) | -| `mindie_stream_api_general` | Access the inference service via MindIE streaming API, with the interface `infer` | The MindIE version used supports the `infer` sub-service | Generative Evaluation | Streaming Interface | String Format, Multimodal Format | [mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/mindie_api/mindie_stream_api_general.py) | -| `triton_api_general` | Access the inference service via Triton API, with the interface `v2/models/{model name}/generate` | Start an inference service that supports Triton API | Generative Evaluation | Text Interface | String Format, Multimodal Format | [triton_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_api_general.py) | -| `triton_stream_api_general` | Access the inference service via Triton streaming API, with the interface `v2/models/{model name}/generate_stream` | Start an inference service that supports Triton API | Generative Evaluation | Streaming Interface | String Format, Multimodal Format | [triton_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_stream_api_general.py) | -| `tgi_api_general` | Access the inference service via TGI API, with the interface `generate` | Start an inference service that supports TGI API | Generative Evaluation | Text Interface | String Format, Multimodal Format | [tgi_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_api_general.py) | -| `tgi_stream_api_general` | Access the inference service via TGI streaming API, with the interface `generate_stream` | Start an inference service that supports TGI API | Generative Evaluation | Streaming Interface | String Format, Multimodal Format | [tgi_stream_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_stream_api_general.py) | +| Model Configuration Name | Description | Prerequisites for Use | Supported Evaluation Modes | Interface Type | Supported Dataset Prompt Formats | Configuration File Import Method | Configuration File Path | +| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | +| `vllm_api_general` | Access the inference service via vLLM's OpenAI-compatible API, with the interface `v1/completions` | The vLLM version used supports the `v1/completions` sub-service | Generative Evaluation, PPL Mode Evaluation | Text Interface | String Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general` | [vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general.py) | +| `vllm_api_general_stream` | Access the vLLM inference service in streaming mode, with the interface `v1/completions` | The vLLM version used supports the `v1/completions` sub-service | Generative Evaluation | Streaming Interface | String Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream` | [vllm_api_general_stream.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py) | +| `vllm_api_general_chat` | Access the inference service via vLLM's OpenAI-compatible API, with the interface `v1/chat/completions` | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation, PPL Mode Evaluation | Text Interface | String Format, Dialogue Format, Multimodal Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat` | [vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) | +| `vllm_api_stream_chat` | Access the vLLM inference service in streaming mode, with the interface `v1/chat/completions` | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation | Streaming Interface | String Format, Dialogue Format, Multimodal Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat` | [vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) | +| `vllm_api_stream_chat_multiturn` | Access the vLLM inference service in streaming mode for multi-turn dialogue scenarios, with the interface `v1/chat/completions` | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation | Streaming Interface | Dialogue Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat_multiturn import models as vllm_api_stream_chat_multiturn` | [vllm_api_stream_chat_multiturn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat_multiturn.py) | +| `vllm_api_function_call_chat` | API for accessing the vLLM inference service in function call accuracy evaluation scenarios, with the interface `v1/chat/completions` (only applicable to the [BFCL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/BFCL/README_en.md) evaluation scenario) | The vLLM version used supports the `v1/chat/completions` sub-service | Generative Evaluation | Text Interface | Dialogue Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_function_call_chat import models as vllm_api_function_call_chat` | [vllm_api_function_call_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_function_call_chat.py) | +| `vllm_api_old` | Access the inference service via vLLM-compatible API, with the interface `generate` | The vLLM version used supports the `generate` sub-service | Generative Evaluation | Text Interface | String Format, Multimodal Format | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_old import models as vllm_api_old` | [vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_old.py) | +| `mindie_stream_api_general` | Access the inference service via MindIE streaming API, with the interface `infer` | The MindIE version used supports the `infer` sub-service | Generative Evaluation | Streaming Interface | String Format, Multimodal Format | `from ais_bench.benchmark.configs.models.mindie_api.mindie_stream_api_general import models as mindie_stream_api_general` | [mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/mindie_api/mindie_stream_api_general.py) | +| `triton_api_general` | Access the inference service via Triton API, with the interface `v2/models/{model name}/generate` | Start an inference service that supports Triton API | Generative Evaluation | Text Interface | String Format, Multimodal Format | `from ais_bench.benchmark.configs.models.triton_api.triton_api_general import models as triton_api_general` | [triton_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_api_general.py) | +| `triton_stream_api_general` | Access the inference service via Triton streaming API, with the interface `v2/models/{model name}/generate_stream` | Start an inference service that supports Triton API | Generative Evaluation | Streaming Interface | String Format, Multimodal Format | `from ais_bench.benchmark.configs.models.triton_api.triton_stream_api_general import models as triton_stream_api_general` | [triton_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_stream_api_general.py) | +| `tgi_api_general` | Access the inference service via TGI API, with the interface `generate` | Start an inference service that supports TGI API | Generative Evaluation | Text Interface | String Format, Multimodal Format | `from ais_bench.benchmark.configs.models.tgi_api.tgi_api_general import models as tgi_api_general` | [tgi_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_api_general.py) | +| `tgi_stream_api_general` | Access the inference service via TGI streaming API, with the interface `generate_stream` | Start an inference service that supports TGI API | Generative Evaluation | Streaming Interface | String Format, Multimodal Format | `from ais_bench.benchmark.configs.models.tgi_api.tgi_stream_api_general import models as tgi_stream_api_general` | [tgi_stream_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_stream_api_general.py) | ### Parameter Description for Service-Oriented Inference Backend Configuration @@ -34,7 +34,7 @@ The configuration file for the service-oriented inference backend is configured ```python from ais_bench.benchmark.models import VLLMCustomAPI -models = [ +models = [ # Equivalent to the `models` imported via `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general` in a custom configuration file dict( attr="service", type=VLLMCustomAPI, @@ -71,15 +71,15 @@ The description of configurable parameters for the service-oriented inference ba | `path` | String | Tokenizer path, usually the same as the model path. The Tokenizer is loaded using `AutoTokenizer.from_pretrained(path)`. Specify an accessible local path, e.g., `/weight/DeepSeek-R1` | | `model` | String | Name of the model accessible on the server, which must be consistent with the name specified during service-oriented deployment | | `model_name` | String | Applicable only to Triton services. It is concatenated into the endpoint URI `/v2/models/{modelname}/{infer, generate, generate_stream}` and must be consistent with the name used during deployment | -| `stream` | Boolean | Whether the inference service is a streaming interface. Required Parameter. | -| `request_rate` | Float | Request sending rate (unit: requests per second). A request is sent every `1/request_rate` seconds; if the value is less than 0.1, requests are automatically merged and sent in batches. Valid range: [0, 64000]. When the `traffic_cfg` item is enabled, this function may be overwritten (for specific reasons, refer to 🔗 [Parameter Interpretation Section in the Description of Request Rate (RPS) Distribution Control and Visualization](../../advanced_tutorials/rps_distribution.md#parameter-interpretation)) | +| `stream` | Boolean | API model inference interface type. The default is False, meaning a non-streaming interface. When True, it indicates a streaming interface (for details, refer to 🔗 [Service-Oriented Inference Backend](#service-oriented-inference-backend)) | +| `request_rate` | Float | Request sending rate (unit: seconds), a request is sent every `1/request_rate` seconds; in pressure testing scenarios, it represents the number of new server-side connections added per second; if the value is less than 0.1, the request sending rate is unlimited. Valid range: [0, 64000]. When the `traffic_cfg` item is enabled, this function may be overwritten (for specific reasons, refer to 🔗 [Parameter Interpretation Section in the Description of Request Rate (RPS) Distribution Control and Visualization](../../advanced_tutorials/rps_distribution.md#parameter-interpretation)) | | `use_timestamp` | Boolean | Whether to schedule requests according to the dataset's timestamp field. When True and the dataset contains timestamps, requests are sent by timestamp and **request_rate** / **traffic_cfg** are ignored; when False, request_rate and traffic_cfg apply. Default False. Used with timestamped datasets (e.g. Mooncake Trace). | | `traffic_cfg` | Dict | Parameters for controlling fluctuations in the request sending rate (for detailed usage instructions, refer to 🔗 [Description of Request Rate (RPS) Distribution Control and Visualization](../../advanced_tutorials/rps_distribution.md)). If this item is not filled in, the function is disabled by default | | `retry` | Int | Maximum number of retries after failing to connect to the server. Valid range: [0, 1000] | | `api_key` | String | Custom API key, default is an empty string. Only supports the `VLLMCustomAPI` and `VLLMCustomAPIChat` model type. | | `host_ip` | String | Server IP address, supporting valid IPv4 or IPv6, e.g., `127.0.0.1`, `::1`. When using an IPv6 literal, the tool automatically wraps it in brackets when building URLs, for example: `http://[::1]:8080/` | | `host_port` | Int | Server port number, which must be consistent with the port specified during service-oriented deployment | -| `url` | String | Custom URL path for accessing the inference service (needs to be configured when the base URL is not a combination of http://host_ip:host_port).For example, when `models`'s `type` is `VLLMCustomAPI`, configure `url` as `https://xxxxxxx/yyyy/`, the actual request URL accessed is `https://xxxxxxx/yyyy/v1/completions` | +| `url` | String | Custom URL path for accessing the inference service (needs to be configured when the base URL is not a combination of http/https://host_ip:host_port; after configuration, `host_ip` and `host_port` will be ignored). For example, when `models`'s `type` is `VLLMCustomAPI`, configure `url` as `https://xxxxxxx/yyyy/`, the actual request URL accessed is `https://xxxxxxx/yyyy/v1/completions` | | `max_out_len` | Int | Maximum output length of the inference response; the actual length may be limited by the server. Valid range: (0, 131072] | | `batch_size` | Int | Batch size for concurrent requests. Valid range: (0, 64000] | | `trust_remote_code` | Boolean | Whether the tokenizer trusts remote code, default is `False`| @@ -99,12 +99,12 @@ The description of configurable parameters for the service-oriented inference ba ## Local Model Backend -| Model Configuration Name | Description | Prerequisites for Use | Supported Prompt Formats (String Format or Dialogue Format) | Corresponding Source Code Configuration File Path | -| --- | --- | --- | --- | --- | -| `hf_base_model` | HuggingFace Base Model Backend | The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently) | String Format | [hf_base_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_base_model.py) | -| `hf_chat_model` | HuggingFace Chat Model Backend | The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently) | Dialogue Format | [hf_chat_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_chat_model.py) | -|`hf_qwenvl_model`| HuggingFace Chat QwenVL Model Backend|The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently)|Dialogue Format|[hf_qwenvl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_qwenvl_model.py)| -|`vllm_offline_vl_model`| vLLM Chat QwenVL Offline Inference Model Backend|The basic dependencies of the evaluation tool have been installed; the model weight path must be specified in the configuration file (automatic download is not supported currently)|Dialogue Format|[vllm_offline_vl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_offline_models/vllm_offline_vl_model.py)| +| Model Configuration Name | Description | Prerequisites for Use | Supported Prompt Formats (String Format or Dialogue Format) | Configuration File Import Method | Corresponding Source Code Configuration File Path | +| --- | --- | --- | --- | --- | --- | +| `hf_base_model` | HuggingFace Base Model Backend | The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently) | String Format | `from ais_bench.benchmark.configs.models.hf_models.hf_base_model import models as hf_base_model` | [hf_base_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_base_model.py) | +| `hf_chat_model` | HuggingFace Chat Model Backend | The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently) | Dialogue Format | `from ais_bench.benchmark.configs.models.hf_models.hf_chat_model import models as hf_chat_model` | [hf_chat_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_chat_model.py) | +|`hf_qwenvl_model`| HuggingFace Chat QwenVL Model Backend|The basic dependencies of the evaluation tool have been installed; the HuggingFace model weight path must be specified in the configuration file (automatic download is not supported currently)|Dialogue Format|`from ais_bench.benchmark.configs.models.hf_models.hf_qwenvl_model import models as hf_qwenvl_model`|[hf_qwenvl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_qwenvl_model.py)| +|`vllm_offline_vl_model`| vLLM Chat QwenVL Offline Inference Model Backend|The basic dependencies of the evaluation tool have been installed; the model weight path must be specified in the configuration file (automatic download is not supported currently)|Dialogue Format|`from ais_bench.benchmark.configs.models.vllm_offline_models.vllm_offline_vl_model import models as vllm_offline_vl_model`|[vllm_offline_vl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_offline_models/vllm_offline_vl_model.py)| ### Parameter Description for Huggingface Local Model Backend Configuration @@ -112,7 +112,7 @@ The configuration file for the huggingface local model backend is configured usi ```python from ais_bench.benchmark.models import HuggingFacewithChatTemplate -models = [ +models = [ # Equivalent to the `models` imported via `from ais_bench.benchmark.configs.models.hf_models.hf_chat_model import models as hf_chat_model` in a custom configuration file dict( attr="local", # Backend type identifier type=HuggingFacewithChatTemplate, # Model type diff --git a/docs/source_en/base_tutorials/all_params/summarizer.md b/docs/source_en/base_tutorials/all_params/summarizer.md index 5115f8ea..69a1c17f 100644 --- a/docs/source_en/base_tutorials/all_params/summarizer.md +++ b/docs/source_en/base_tutorials/all_params/summarizer.md @@ -1,7 +1,7 @@ # Supported Result Summary Tasks -| Task Name | Description | Configuration File Path | -| -------------- | -------------- | -------------- | -| `example` | A simplified accuracy evaluation result summary template that covers all currently supported datasets and is the default template used. | [example.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/example.py) | -| `medium` | A general accuracy evaluation result summary template, suitable for multiple basic datasets. | [medium.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/medium.py) | -| `default_perf` | A full-scale performance evaluation result summary template that aggregates performance data of all requests. It supports manual configuration of performance statistics indicators via `default_perf.py`. | [default_perf.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/default_perf.py) | -| `stable_stage` | A performance evaluation result summary template for the stable stage, which only aggregates request data when the system reaches the configured maximum concurrency. It supports manual configuration of performance statistics indicators via `stable_stage.py`. | [stable_stage.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/stable_stage.py) | \ No newline at end of file +| Task Name | Description | Configuration File Import Method | Configuration File Path | +| -------------- | -------------- | -------------- | -------------- | +| `example` | A simplified accuracy evaluation result summary template that covers all currently supported datasets and is the default template used. | `from ais_bench.benchmark.configs.summarizers.example import summarizer` | [example.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/example.py) | +| `medium` | A general accuracy evaluation result summary template, suitable for multiple basic datasets.| `from ais_bench.benchmark.configs.summarizers.medium import summarizer` | [medium.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/medium.py) | +| `default_perf` | A full-scale performance evaluation result summary template that aggregates performance data of all requests. It supports manual configuration of performance statistics indicators via `default_perf.py`. | `from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer` | [default_perf.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/default_perf.py) | +| `stable_stage` | A performance evaluation result summary template for the stable stage, which only aggregates request data when the system reaches the configured maximum concurrency. It supports manual configuration of performance statistics indicators via `stable_stage.py`. | `from ais_bench.benchmark.configs.summarizers.perf.stable_stage import summarizer` | [stable_stage.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/stable_stage.py) | \ No newline at end of file diff --git a/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark.md b/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark.md index 1cdbc4d8..f717b9a5 100644 --- a/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark.md +++ b/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark.md @@ -14,23 +14,92 @@ Before performing service-oriented inference, the following conditions must be m ## Main Functional Scenarios ### Single-Task Evaluation -Please refer to 📚 [Quick Start](../../get_started/quick_start.md) on the homepage for details; no further elaboration here. +Please refer to 📚 [Quick Start](../../get_started/quick_start.md) on the homepage for details. ### Multi-Task Evaluation It supports configuring multiple models or multiple dataset tasks simultaneously and conducting batch evaluations with a single command, which is suitable for large-scale model horizontal comparison or multi-dataset accuracy comparison analysis. -#### Command Description -Users can specify multiple configuration tasks via the `--models` and `--datasets` parameters. The number of subtasks is the product of the number of tasks configured by `--models` and `--datasets`—that is, one model configuration and one dataset configuration form a subtask. Example command: -```bash -ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt -``` -The above command specifies 2 model tasks (`vllm_api_general_chat`, `vllm_api_stream_chat`) and 2 dataset tasks (`gsm8k_gen_4_shot_cot_str`, `aime2024_gen_0_shot_chat_prompt`), and will execute the following 4 combined accuracy test tasks: +#### Description of Sub-task Combinations + +In multi-task evaluation scenarios, the number of subtasks is the product of the number of tasks configured by `models` and the number of tasks configured by `datasets`—that is, one model configuration and one dataset configuration form a subtask. The following example simultaneously evaluates 2 model tasks (`vllm_api_general_chat`, `vllm_api_stream_chat`) and 2 dataset tasks (`gsm8k_gen_4_shot_cot_str`, `aime2024_gen_0_shot_chat_prompt`), and will execute the following 4 combined accuracy test tasks: + [vllm_api_general_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) model task + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) dataset task + [vllm_api_general_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) model task + [aime2024_gen_0_shot_chat_prompt](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_chat_prompt.py) dataset task + [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) model task + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) dataset task + [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) model task + [aime2024_gen_0_shot_chat_prompt](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_chat_prompt.py) dataset task +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +Refer to the [model_api_test_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/model_api_test_zh_cn.py) file from the quick start. Import multiple model tasks and dataset tasks within `with read_base():`, then combine them into the `models` and `datasets` lists. For a complete example, refer to [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_chat + vllm_api_stream_chat +# ...For other parameter configurations, please refer to the configuration file +``` + +After modifying the configuration file, execute the command: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py +``` + +#### Custom Model-Dataset Pairings (Optional) + +By default, the `models` list and `datasets` list in the above configuration are automatically combined as a Cartesian product, with the number of subtasks equal to the number of models × the number of datasets (in this example, 2 × 2 = 4). If you want to precisely control which models are paired with which datasets (e.g., letting some models only run on some datasets to avoid meaningless combinations), you can explicitly declare the pairing relationship in the configuration file via the `model_dataset_combinations` field: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets +models = vllm_api_general_chat + vllm_api_stream_chat + +# Key: Precisely control pairings via model_dataset_combinations +# The following example generates only 2 subtasks (the Cartesian product would generate 4): +# - vllm_api_general_chat + gsm8k_gen_4_shot_cot_str +# - vllm_api_stream_chat + aime2024_gen_0_shot_chat_prompt +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), +] +``` + +> ⚠️ **Note**: The unique identifier for models and datasets is determined by the `abbr` field. In the same configuration file, repeated combinations of models or datasets with the same `abbr` will be treated as duplicate tasks and skipped. When reusing model/dataset configurations via methods such as `.copy()`, the `abbr` must be explicitly modified to ensure uniqueness. See 📚 [Custom Model and Dataset Combinations](../../advanced_tutorials/run_custom_config.md#custom-model-and-dataset-combinations) for details. + +::: + +:::{tab-item} Alternative: Using Command-Line Parameters + +Users can specify multiple configuration tasks via the `--models` and `--datasets` parameters. Example command: + +```bash +ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt +``` + #### Modify Configuration Files Corresponding to Tasks The actual paths of the configuration files for model tasks and dataset tasks can be queried by executing the command with the `--search` parameter: ```bash @@ -50,14 +119,20 @@ The following configuration files to be modified will be queried: │ --datasets │ aime2024_gen_0_shot_chat_prompt │ /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_chat_prompt.py │ ╘═════════════╧═════════════════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ ``` -- Refer to 📚 [Service-Oriented Inference Backend Configuration Parameter Description](../all_params/models.md#parameter-description-for-local-model-backend-configuration) to configure the configuration files corresponding to the model tasks `vllm_api_general_chat` and `vllm_api_stream_chat` according to the actual situation. +- Refer to 📚 [Service-Oriented Inference Backend Configuration Parameter Description](../all_params/models.md#parameter-description-for-service-oriented-inference-backend-configuration) to configure the configuration files corresponding to the model tasks `vllm_api_general_chat` and `vllm_api_stream_chat` according to the actual situation. - Refer to 📚 [Configure Open-Source Datasets](../all_params/datasets.md#configuring-open-source-datasets) to configure the configuration files corresponding to the dataset tasks `gsm8k_gen_4_shot_cot_str` and `aime2024_gen_0_shot_chat_prompt` according to the actual situation. **Note**: If the dataset is placed in the default directory `ais_bench/datasets/`, no configuration is generally required. #### Execute the Evaluation Command + Execute the command: + ```bash ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt ``` + +::: +:::: + During execution, a timestamp directory will be created under the path specified by 📚 [`--work-dir`](../all_params/cli_args.md#common-parameters) (default: `outputs/default/`) to store execution details. After the task is completed, an example of the on-screen log showing the results is as follows: @@ -109,18 +184,55 @@ At the same time, the final generated directory structure is as follows: ``` ### Multi-Task Parallel Evaluation -By default, multiple subtasks are executed serially. Continuous Batch is enabled by default within a single task, and multiple processes will be launched to send and process requests according to the maximum concurrency configured by the user, allowing for large concurrency settings. When the concurrency of a single task is low, multi-task parallelism can be achieved by setting the 📚 [`--max-num-workers`](../all_params/cli_args.md#accuracy-evaluation-parameters) parameter. Example as follows: +By default, multiple subtasks are executed serially. Continuous Batch is enabled by default within a single task, and multiple processes will be launched to send and process requests according to the maximum concurrency configured by the user, allowing for large concurrency settings. When the concurrency of a single task is low, multi-task parallelism can be achieved by setting the 📚 [`--max-num-workers`](../all_params/cli_args.md#common-parameters) parameter. Example as follows: + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +In the custom configuration file, `max_num_workers` no longer needs to be set; instead, it is passed via the command-line parameter [`--max-num-workers`](../all_params/cli_args.md#common-parameters). The configuration file example is identical to that in [Multi-Task Evaluation](#multi-task-evaluation). For a complete example, refer to [multi_task_parallel_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py): + +```python +# The complete example is identical to the configuration in Multi-Task Evaluation; the only difference lies in the execution command +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_chat + vllm_api_stream_chat +# ...For other parameter configurations, please refer to the configuration file +``` + +Execute the command (specify the parallelism count via `--max-num-workers 4`): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py --max-num-workers 4 +``` + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + ```bash ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --max-num-workers 4 ``` -In the example above, the maximum number of concurrent tasks is set to 4, so four subtasks will be executed simultaneously. This can be viewed on the command line dashboard: +::: +:::: +In the example above, the maximum number of concurrent tasks is set to 4, so four subtasks will be executed simultaneously. This can be viewed on the command-line dashboard: ``` Base path of result&log : outputs/default/20251106_113926 Task Progress Table (Updated at: 2025-11-06 11:39:58) Page: 1/1 Total 5 rows of data -Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to exit +Press Up/Down arrow to page, 'P' to PAUZE/RESUME screen refresh, 'Ctrl + C' to exit +--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+ | Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters | @@ -133,18 +245,28 @@ Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to +--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+ | vllm-api-stream-chat/aime2024 | 1250138 | [############### ] 15/30 [5.0 it/s] | 0:00:07 | inferencing | logs/infer/vllm-api-stream-chat/aime2024.out | {'POST': 20, 'RECV': 15, 'FINISH': 15, 'FAIL': 0} | +--------------------------------+-----------+----------------------------------------------------+-------------+-------------+-----------------------------------------------+---------------------------------------------------+ + ``` The generated result is consistent with the example in [Multi-Task Evaluation](#multi-task-evaluation). + ### Resumption After Interruption & Retesting of Failed Cases If the inference task fails due to an unexpected interruption or server exception during the evaluation, the breakpoint management function can be enabled via `--reuse` to resume the task. It also supports automatic retesting of only failed cases without re-running all tasks. Example as follows: 1. Assume the user first executes the inference evaluation with the following command. If the task is interrupted due to an abnormal exit or some requests fail due to server exceptions: + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +First execution command (based on [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py)): + ```bash -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt +ais_bench ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py ``` + At this point, some inference results will be saved, and the following file content will be generated under the 📚 [`--work-dir`](../all_params/cli_args.md#common-parameters) directory: + ```bash # Under output/default 20250628_151326/ # Timestamp directory created by the test task @@ -158,11 +280,41 @@ At this point, some inference results will be saved, and the following file cont └── tmp_0_2766386_1749107195.json # Cache file, named in the format: tmp_{task_process_ID}_{process_number}_{timestamp}.json ``` +2. Resume the inference by specifying the task timestamp directory via the `--reuse` parameter (`--reuse` is a common parameter; when using a custom configuration file, it can still be appended via the command line): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py --reuse 20250628_151326 +``` + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + +```bash +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt +``` +At this point, some inference results will be saved, and the following file content will be generated under the 📚 [`--work-dir`](../all_params/cli_args.md#common-parameters) directory: +```bash +# Under output/default +20250628_151326/ # Timestamp directory created by the test task +├── configs # A combined configuration file of the configuration files for model tasks, dataset tasks, and structure presentation tasks +│ └── 20250628_151326_29317.py +├── logs # Logs during execution; if --debug is added to the command, no process logs will be saved to disk (all will be printed directly) +│ └── infer # Logs of the inference phase +└── predictions # Directory for inference results, recording the input of each request, model output, and answers (for accuracy evaluation) + └── vllm-api-general-chat + └── tmp_demo_gsm8k # Inference output of completed requests + └── tmp_0_2766386_1749107195.json # Cache file, named in the format: tmp_{task_process_ID}_{process_number}_{timestamp}.json +``` 2. Resume the inference by specifying the task timestamp directory via the `--reuse` parameter: ```bash ais_bench --models vllm_api_general --datasets gsm8k_gen --reuse 20250628_151326 ``` + +::: +:::: + The following content will be printed in the log, indicating that the resumption task has started: + ```bash 02/20 13:14:15 - AISBench - INFO - Found 10 tmp items, run infer task from the last interrupted position ``` @@ -170,8 +322,56 @@ After the resumption is completed, the accuracy results of all requests will be > ⚠️ Note: Resumption after interruption and retesting of failed cases may change the order of requests, which may cause slight fluctuations in results. -💡 [Multi-Task Evaluation](#multi-task-evaluation) also supports resumption after interruption and retesting of failed cases for all or part of the tasks. -For example, if an interruption occurs when executing the following multi-task evaluation command: +💡[Multi-Task Evaluation](#multi-task-evaluation) also supports resumption after interruption and retesting of failed cases for all or part of the tasks. + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +For example, an interruption occurs when executing the following multi-task evaluation command: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py +``` + +Resume all tasks after interruption in the following way: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py --reuse 20250628_151326 +``` + +You can also resume only part of the tasks after editing the custom configuration file. For a complete example, refer to [multi_task_resume_partial_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_datasets +models = vllm_api_general_chat +# ...For other parameter configurations, please refer to the configuration file +``` + +Then execute: + +```bash +# Resume only the vllm_api_general_chat + gsm8k_gen_4_shot_cot_str task after interruption +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py --reuse 20250628_151326 + +# Resume the two tasks of vllm_api_general_chat + gsm8k_gen_4_shot_cot_str and vllm_api_general_chat + aime2024_gen_0_shot_chat_prompts +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py --reuse 20250628_151326 +``` + +> 💡 If you need to resume only part of the combinations (e.g., `vllm_api_general_chat + aime2024`, `vllm_api_stream_chat + aime2024`), simply specify the corresponding model tasks and dataset tasks in the custom configuration file and then specify the timestamp via `--reuse`. See 📚 [Custom Model-Dataset Pairings](../../advanced_tutorials/run_custom_config.md#6-custom-model-dataset-pairings) for details. + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + ```bash ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt ``` @@ -181,27 +381,179 @@ ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_g ``` You can also resume only part of the tasks in the following ways: ```bash -# Resume only the task of vllm_api_general_chat + gsm8k_gen_4_shot_cot_str +# Resume only the vllm_api_general_chat + gsm8k_gen_4_shot_cot_str task after interruption ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_4_shot_cot_str --reuse 20250628_151326 # Resume the two tasks of vllm_api_general_chat + gsm8k_gen_4_shot_cot_str and vllm_api_general_chat + aime2024_gen_0_shot_chat_prompts ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --reuse 20250628_151326 # Resume the two tasks of vllm_api_general_chat + aime2024_gen_0_shot_chat_prompts and vllm_api_stream_chat + aime2024_gen_0_shot_chat_prompts -ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets aime2024_gen_0_shot_chat_prompt --reuse 20250628 +ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets aime2024_gen_0_shot_chat_prompt --reuse 20250628_151326 ``` +::: +:::: + ### Merging Sub-dataset Inference -Some datasets are categorized into different sub-datasets, which will be split into multiple subtasks for inference during the inference process. Examples include 📚 [MMLU](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/mmlu/README_en.md) and 📚 [CEVAL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/ceval/README_en.md). AISBench Benchmark supports merging datasets that consist of multiple small-scale datasets into a single task for unified evaluation. An example command is as follows: +Some datasets are categorized into different sub-datasets, which will be split into multiple subtasks for inference during the inference process. Examples include 📚 [MMLU](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/mmlu/README_en.md) and 📚 [CEVAL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/ceval/README_en.md). AISBench Benchmark supports merging datasets that consist of multiple small-scale datasets into a single task for unified evaluation. An example is as follows: + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +Modify the custom configuration file to import a dataset task that supports merged inference. For a complete example, refer to [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + +models = vllm_api_general +# ...For other parameter configurations, please refer to the configuration file +``` + +Execute the command (`--merge-ds` is a common parameter; when using a custom configuration file, it can still be appended via the command line): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py --merge-ds +``` + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + ```bash ais_bench --models vllm_api_general --datasets ceval_gen --merge-ds ``` + +::: +:::: + > ⚠️ Note: In merge mode, only the overall result will be generated, and the accuracy of individual sub-datasets will no longer be listed separately. Additionally, if you need to resume interrupted inference or re-run failed cases for inference results that were interrupted or failed in merge mode, you must also add `--merge-ds` to the command. +### Fixed Request Count Evaluation + +When the dataset scale is too large and you only want to perform accuracy testing on a subset of samples, you can use either of the following two approaches to control the data reading range. They achieve the same goal, so just pick the one that fits your habit: + +- **Basic approach**: Specify the number of data entries to read directly via the command-line parameter 📚 [`--num-prompts`](../all_params/cli_args.md#common-parameters). No configuration file modification is required, and it is the simplest to use. +- **Advanced approach (more powerful)**: Set the `reader_cfg.test_range` field of the dataset in the custom configuration file, which supports a more flexible sampling range (e.g., specifying a start index and custom step). For detailed usage, refer to 📚 [Custom Configuration Files](../../advanced_tutorials/run_custom_config.md). + +Example as follows: + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +**Method 1: Basic approach — Use `--num-prompts` to specify the number of entries to read** + +For a complete example, refer to [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +# ...For other parameter configurations, please refer to the configuration file +``` + +Execute the command (specify reading only 1 sample via `--num-prompts 1`): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py --num-prompts 1 +``` + +**Method 2: Advanced approach — Use `test_range` to flexibly specify the reading range** + +If you need more flexible range control (e.g., specifying a start index and custom step), you can set the `reader_cfg.test_range` field of the dataset directly in the custom configuration file, without passing any command-line parameter. For a complete example, refer to [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# Key: control the sampling range flexibly via reader_cfg.test_range +# For example, '[0:8]' reads the first 8 samples; '[10:20]' reads samples from index 10 to 20 +datasets[0]['reader_cfg']['test_range'] = '[0:8]' + +models = vllm_api_stream_chat +# ...For other parameter configurations, please refer to the configuration file +``` + +Execute the command (test_range has been specified in the configuration file, no need to pass `--num-prompts`): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py +``` + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + +```bash +ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --num-prompts 1 +``` +The above command only performs inference on the first entry in the sample dataset and only evaluates the accuracy of this one entry. + +::: +:::: + +> ⚠️ Note: Currently, the dataset is read sequentially in the default queue order; random sampling or shuffling is not supported. When `reader_cfg.test_range` in the configuration file and the command-line `--num-prompts` are both specified, the command-line parameter `--num-prompts` takes precedence. ### Multiple Independent Repeat Inference > After enabling this feature, the `dataset`/`number of requests` will be expanded exponentially at the `data point level`, which will significantly increase inference time and memory usage. Please read 📚 [Accuracy Evaluation Scenario: Interpretation of Evaluation Metrics](../results_intro/accuracy_metric.md) first, and **confirm whether this feature is necessary for your current scenario** before enabling it. -This scenario aims to explore model capabilities from multiple dimensions such as reliability, stability, and overall accuracy. To enable it, configure the value of the 🔗[`num_return_sequences` parameter](../all_params/models.md#parameter-description-for-service-oriented-inference-backend-configuration) in the hyperparameter `generation_kwargs` within the `service-side inference backend configuration parameters`. Refer to the following example for the format (the value provided is for reference only): +This scenario aims to explore model capabilities from multiple dimensions such as reliability, stability, and overall accuracy. To enable it, configure the value of the 🔗[`num_return_sequences` parameter](../all_params/models.md#parameter-description-for-service-oriented-inference-backend-configuration) in the hyperparameter `generation_kwargs` within the `service-side inference backend configuration parameters`. + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +For a complete example, refer to [multi_repeat_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +# Key: Enable multiple independent repeat inference via generation_kwargs.num_return_sequences +models[0]["generation_kwargs"] = dict( + temperature=0.01, + ignore_eos=False, + num_return_sequences=5, # For specific functions and constraints, refer to the document accuracy_metric.md +) +# ...For other parameter configurations, please refer to the configuration file +``` + +Execute the command: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py +``` + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + +Modify `generation_kwargs` in the model task configuration file: ```python models = [ @@ -211,11 +563,14 @@ models = [ num_return_sequences = 5, # For specific functions and constraints, refer to the document accuracy_metric.md ... # Other parameters ), - ... + ... # Other parameters ) ] ``` +::: +:::: + After the accuracy evaluation phase is completed, the results will be recorded in the log and printed in the running window. The format is as shown in the following example (data is for reference only): ```bash @@ -227,8 +582,26 @@ After the accuracy evaluation phase is completed, the results will be recorded i | aime2024 | 604a78 | cons@5 | gen | 13.33 | ``` -For **specific interpretation of indicators** and **parameter constraints** in the table above, please refer to 📚 [Accuracy Evaluation Scenario: Interpretation of Evaluation Metrics](accuracy_metric.md). +For **specific interpretation of indicators** and **parameter constraints** in the table above, please refer to 📚 [Accuracy Evaluation Scenario: Interpretation of Evaluation Metrics](accuracy_metric.md) + +## Implementation via Custom Configuration Files + +> 💡 All the above functional scenarios (multi-task evaluation, multi-task parallelism, resumption after interruption, merged sub-datasets, fixed request count evaluation, multiple independent repeat inference, re-evaluation of inference results, etc.) provide two startup methods (**⭐ Recommended: Using a Custom Configuration File**, **Alternative: Using Command-Line Parameters**). The custom configuration file is essentially a Python script, which supports all Python syntax such as loops, conditional statements, and list comprehensions. You can write model, dataset, summarizer, and other configurations into a single file—write once, reuse multiple times. + +All custom configuration file examples involved in this section have been uniformly stored in the `ais_bench/configs/accuracy_benchmark/` directory for easy reference and reuse: + +| Filename | Corresponding Scenario | +| --- | --- | +| [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py) | Single-task evaluation | +| [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py) | Multi-task evaluation | +| [multi_task_parallel_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py) | Multi-task parallel evaluation | +| [multi_task_resume_partial_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py) | Resumption after interruption & retesting of failed cases (partial tasks) | +| [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py) | Merging sub-dataset inference | +| [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py) | Fixed request count evaluation | +| [multi_repeat_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py) | Multiple independent repeat inference | +| [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py) | Re-evaluation of inference results | +> For a complete description of the custom configuration file syntax (including the top-level variables that can be defined, detailed field descriptions, advanced Python usage, etc.), please refer to 📚 [Running AISBench with a Custom Configuration File](../../advanced_tutorials/run_custom_config.md). The "Custom Configuration File Examples for Each Scenario" section also provides complete examples of 10 typical scenarios (such as service-oriented performance evaluation, synthetic dataset performance evaluation, steady-state performance evaluation, multi-turn dialogue performance evaluation, judge model evaluation, custom dataset evaluation, etc.). ## Other Functional Scenarios ### Re-evaluation of Inference Results @@ -241,25 +614,70 @@ graph LR; D --> E[Generate a summary report based on accuracy data] E --> F((Present results)) ``` - -Each link in the entire execution process is independently decoupled, and inference results can be re-evaluated repeatedly. If there is an issue with the accuracy data obtained from the first accuracy evaluation (e.g., failure to accurately extract valuable content from the response), you can modify the answer extraction method and perform re-evaluation of the inference results. The specific operations are as follows: +Each link in the entire execution process is independently decoupled, and inference results can be re-evaluated repeatedly. If there is an issue with the accuracy data obtained from the first accuracy evaluation (e.g., failure to accurately extract valuable content from the response), you can modify the answer extraction method and perform re-evaluation of the inference results. The specific operations are as follows. Assume the command used for the previous performance evaluation was: + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + ```bash -ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt +ais_bench ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py ``` -And the timestamp of the saved results is `20250628_151326`. However, the accuracy data for 8 cases is incorrect, showing a score of 0: +At the same time, the timestamp of the saved results is `20250628_151326`. However, the accuracy data for 8 cases is incorrect, showing a score of 0: ```bash dataset version metric mode vllm_api_general_chat ----------------------- -------- -------- ----- ---------------------- demo_gsm8k 401e4c accuracy gen 00.00 ``` +Check `20250628_151326/predictions/vllm-api-general-chat/gsm8k.json` and find that the inference results actually contain the correct answers. + +**Re-evaluation steps:** + +1. Edit the custom configuration file (e.g., [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py)) to override the answer extraction function in the `eval_cfg` of the corresponding dataset according to actual needs (refer to the following example). The `pred_postprocessor` is responsible for extracting the answer from the model output and can be replaced or customized according to the actual situation. The complete example is as follows: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import gsm8k_postprocess, gsm8k_dataset_postprocess + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +models = vllm_api_general_chat +# ...For other parameter configurations, please refer to the configuration file + +# Key: Replace or modify the implementation of the answer extraction function +datasets[0]['eval_cfg']['pred_postprocessor'] = dict(type=gsm8k_postprocess) +datasets[0]['eval_cfg']['dataset_postprocessor'] = dict(type=gsm8k_dataset_postprocess) +``` + +2. On the basis of the first accuracy evaluation command, add `--mode eval` and `--reuse {timestamp of the inference results to be reused}` to perform repeated re-evaluation (`--mode` and `--reuse` are common parameters; when using a custom configuration file, they can still be appended via the command line): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py --mode eval --reuse 20250628_151326 +``` + +::: +:::{tab-item} Alternative: Using Command-Line Parameters +```bash +ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt +``` +At the same time, the timestamp of the saved results is `20250628_151326`. However, the accuracy data for 8 cases is incorrect, showing a score of 0: +```bash +dataset version metric mode vllm_api_general_chat +----------------------- -------- -------- ----- ---------------------- +demo_gsm8k 401e4c accuracy gen 00.00 +``` Check `20250628_151326/predictions/vllm-api-general-chat/gsm8k.json` and find that the inference results actually contain the correct answers. At this point, you can modify the configuration file corresponding to the `gsm8k_gen_4_shot_cot_chat_prompt` dataset task. Use the `--search` command to query the path of the corresponding configuration file: ```bash ais_bench --datasets gsm8k_gen_4_shot_cot_chat_prompt --search ``` - The configuration file path will be displayed as follows: ```bash ╒═════════════╤═══════════════════════════════════════╤═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ @@ -267,6 +685,7 @@ The configuration file path will be displayed as follows: ╞═════════════╪═══════════════════════════════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ │ --datasets │ gsm8k_gen_4_shot_cot_chat_prompt │ /your_workspace/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_chat_prompt.py │ ╘═════════════╧═══════════════════════════════════════╧═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ + ``` Open `gsm8k_gen_4_shot_cot_chat_prompt.py` and replace or modify the answer extraction function: @@ -281,9 +700,14 @@ gsm8k_eval_cfg = dict(evaluator=dict(type=Gsm8kEvaluator), pred_postprocessor=dict(type=gsm8k_postprocess), # Replace or modify the implementation of the answer extraction function dataset_postprocessor=dict(type=gsm8k_dataset_postprocess)) # ...... + ``` You can add `--mode eval` and `--reuse {timestamp of the inference results to be reused}` to the command of the first accuracy evaluation to perform repeated re-evaluation: ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode eval --reuse 20250628_151326 -``` \ No newline at end of file + +``` + +::: +:::: \ No newline at end of file diff --git a/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark_local.md b/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark_local.md index fc2d2d46..740f7374 100644 --- a/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark_local.md +++ b/docs/source_en/base_tutorials/scenes_intro/accuracy_benchmark_local.md @@ -2,29 +2,210 @@ Load models and datasets in a local environment, compare outputs with reference answers through a unified inference process, and evaluate the inherent accuracy of the model. Customize parameters such as batch size and sequence length, applicable to the **Huggingface Transformers** inference framework. ## Test Preparation -Before performing local model inference, the following conditions must be met: +Before performing service-oriented inference, the following conditions must be met: - Available model weights: Ensure that the model weight files to be tested are already available locally. Open-source weights can be obtained from 🔗 [Hugging Face Community](https://huggingface.co/models). - Dataset task preparation: Select a dataset from 📚 [Open-Source Datasets](../all_params/datasets.md#open-source-datasets), and choose the dataset task to execute in the "detailed introduction" document corresponding to the dataset. Prepare the dataset files according to the "detailed introduction" document of the selected dataset task. It is recommended to manually place the open-source dataset in the default directory `ais_bench/datasets/`, and the program will automatically load the dataset files during task execution. - Model task preparation: Select the model task to execute from 📚 [Local Model Backend](../all_params/models.md#local-model-backend). ## Main Functions -The main functions in the pure model accuracy evaluation scenario are similar to those in the service-oriented accuracy evaluation scenario. + +The main functions in the pure model accuracy evaluation scenario are similar to those in the service-oriented accuracy evaluation scenario, but the model task needs to be replaced with a local HuggingFace model task (such as [`HuggingFacewithChatTemplate`](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/models/huggingface_chat_model.py) or [`HuggingFaceBaseModel`](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/models/huggingface_base_model.py)). ### Pure Model Multi-Task Evaluation -Refer to [Usage of Service-Oriented Accuracy Multi-Task Evaluation](accuracy_benchmark.md#multi-task-evaluation). + +Supports simultaneous configuration of multiple dataset tasks through a single command for batch evaluation. For a complete example, refer to [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + +datasets = gsm8k_datasets + aime2024_datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # Replace with the actual local model weight path + tokenizer_path='THUDM/chatglm-6b', + # ...For other parameter configurations, see the configuration file + ) +] +``` + +Execution command: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py +``` + +#### Custom Model-Dataset Pairings (Optional) + +By default, the `models` list and `datasets` list in the above configuration will automatically be combined in a Cartesian product, and the number of sub-tasks is the number of models × the number of datasets (1 × 2 = 2 in this example). If you want to precisely control which models are paired with which datasets (for example, only let the model run a subset of datasets), you can explicitly declare the pairing relationship through the `model_dataset_combinations` field in the configuration file: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + +datasets = gsm8k_datasets + aime2024_datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # Replace with the actual local model weight path + tokenizer_path='THUDM/chatglm-6b', + ) +] + +# Key: Precisely control pairings through model_dataset_combinations +# The following example generates only 1 sub-task (the Cartesian product would generate 2): +# - hf-chat-model + gsm8k +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), +] +``` + +> ⚠️ **Note**: The unique identifier of a model or dataset is determined by the `abbr` field. In the same configuration file, combinations where models or datasets with the same `abbr` appear repeatedly will be considered duplicate tasks and will be skipped. When reusing model/dataset configurations through methods like `.copy()`, you must explicitly modify `abbr` to ensure uniqueness. For details, refer to 📚 [Custom Model-Dataset Combinations](../../advanced_tutorials/run_custom_config.md#custom-model-and-dataset-combinations). + +> 💡 For detailed usage, you can also refer to [Usage of Service-Oriented Accuracy Multi-Task Evaluation](accuracy_benchmark.md#multi-task-evaluation). ### Pure Model Multi-Task Parallel Evaluation -Refer to [Usage of Service-Oriented Accuracy Multi-Task Parallel Evaluation](accuracy_benchmark.md#multi-task-parallel-evaluation). + +Supports multi-task parallelism through the [`--max-num-workers`](../all_params/cli_args.md#common-parameters) command-line parameter. The configuration file example is exactly the same as [Pure Model Multi-Task Evaluation](#pure-model-multi-task-evaluation), the only difference is the execution command. + +Execution command (taking `max-num-workers 4` as an example): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py --max-num-workers 4 +``` + > ⚠️ Note: Multi-task parallel evaluation in pure model accuracy evaluation will occupy different GPU units. The number of GPU units required for parallel tasks should be less than or equal to the total number of available GPUs. +> 💡 For detailed usage, you can also refer to [Usage of Service-Oriented Accuracy Multi-Task Parallel Evaluation](accuracy_benchmark.md#multi-task-parallel-evaluation). + ### Pure Model Resumption After Interruption -During the pure model accuracy evaluation, if the task is interrupted, you can use the `--reuse` parameter to specify the task timestamp directory to continue the unfinished inference task, realizing breakpoint resumption. This function does not require re-running all tasks, but only performs supplementary inference on the unfinished parts. For details on usage, refer to [Usage of Service-Oriented Accuracy Resumption After Interruption](accuracy_benchmark.md#resumption-after-interruption-&-retesting-of-failed-cases). + +During the pure model accuracy evaluation, if the task is interrupted, you can use the `--reuse` parameter to specify the task timestamp directory to continue the unfinished inference task, realizing breakpoint resumption. This function does not require re-running all tasks, but only performs supplementary inference on the unfinished parts. + +First execution command: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py +``` + +Specify the task timestamp directory through the `--reuse` parameter to continue (`--reuse` is a common parameter, and can still be appended through the command line when using a custom configuration file): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py --reuse 20250628_151326 +``` + > ⚠️ Note: Currently, pure model accuracy evaluation does not support automatic retesting of failed cases. +> 💡 For detailed usage, you can also refer to [Usage of Service-Oriented Accuracy Resumption After Interruption](accuracy_benchmark.md#resumption-after-interruption-&-retesting-of-failed-cases). + ### Pure Model Merged Sub-Dataset Inference -Refer to [Usage of Service-Oriented Accuracy Merged Sub-Dataset Inference](accuracy_benchmark.md#merging-sub-dataset-inference). + +Supports merging datasets containing multiple small-scale sub-datasets into a single task for unified evaluation. For a complete example, refer to [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # Replace with the actual local model weight path + tokenizer_path='THUDM/chatglm-6b', + # ...For other parameter configurations, see the configuration file + ) +] +``` + +Execution command (`--merge-ds` is a common parameter, and can still be appended through the command line when using a custom configuration file): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py --merge-ds +``` + +> 💡 For detailed usage, you can also refer to [Usage of Service-Oriented Accuracy Merged Sub-Dataset Inference](accuracy_benchmark.md#merging-sub-dataset-inference). + +## Implementation via Custom Configuration Files + +> 💡 All the above functional scenarios (multi-task evaluation, multi-task parallel, resumption after interruption, merged sub-dataset, etc.) can be implemented through the [Custom Configuration File](../../advanced_tutorials/run_custom_config.md) approach. The configuration file is essentially a Python script, which supports all Python syntaxes such as loops, conditional judgments, and list comprehensions. Model, dataset, summarizer, and other configurations can be written into one file for one-time writing and multiple reuse. + +All custom configuration file examples involved in this section are uniformly stored in the `ais_bench/configs/accuracy_benchmark_local/` directory for easy reference and reuse: + +| File Name | Corresponding Scenario | +| --- | --- | +| [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py) | Single-Task Evaluation | +| [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py) | Pure Model Multi-Task Evaluation / Multi-Task Parallel Evaluation | +| [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py) | Merged Sub-Dataset Inference | +| [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py) | Re-Evaluation of Pure Model Inference Results | + +For details, refer to the "Pure Model Accuracy Evaluation" example in [Running AISBench via Custom Configuration Files](../../advanced_tutorials/run_custom_config.md#custom-configuration-file-examples-for-each-scenario). ## Other Functions + ### Re-Evaluation of Pure Model Inference Results -Refer to [Usage of Service-Oriented Accuracy Re-Evaluation of Inference Results](accuracy_benchmark.md#re-evaluation-of-inference-results). \ No newline at end of file + +For a complete example, refer to [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import gsm8k_postprocess, gsm8k_dataset_postprocess + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # Replace with the actual local model weight path + tokenizer_path='THUDM/chatglm-6b', + # ...For other parameter configurations, see the configuration file + ) +] + +# Key: Replace or modify the answer extraction function implementation +datasets[0]['eval_cfg']['pred_postprocessor'] = dict(type=gsm8k_postprocess) +datasets[0]['eval_cfg']['dataset_postprocessor'] = dict(type=gsm8k_dataset_postprocess) +``` + +Execution command (`--mode eval` and `--reuse` are common parameters, and can still be appended through the command line when using a custom configuration file): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py --mode eval --reuse 20250628_151326 +``` + +> 💡 For detailed usage, you can also refer to [Usage of Service-Oriented Accuracy Re-Evaluation of Inference Results](accuracy_benchmark.md#re-evaluation-of-inference-results). \ No newline at end of file diff --git a/docs/source_en/base_tutorials/scenes_intro/performance_benchmark.md b/docs/source_en/base_tutorials/scenes_intro/performance_benchmark.md index 7bdc5dd2..5d1fd299 100644 --- a/docs/source_en/base_tutorials/scenes_intro/performance_benchmark.md +++ b/docs/source_en/base_tutorials/scenes_intro/performance_benchmark.md @@ -1,40 +1,159 @@ -# Guide to Service-Oriented Performance Evaluation -## Introduction -AISBench Benchmark provides service-oriented performance evaluation capabilities. For streaming inference scenarios, it systematically evaluates key performance indicators of model services in real-world deployment environments—such as response latency (e.g., TTFT, Inter-Token Latency), throughput capacity (e.g., QPS, TPUT), and concurrent processing capability—by accurately recording the send time of each request, the return time of each stage, and the response content. +# Service-Oriented Performance Evaluation +Send batch requests to the service through a unified request interface to evaluate the service performance of the model in actual deployment scenarios. The request sending mode and request data can be customized to obtain performance indicators such as throughput and latency. It supports two deployment frameworks: **vLLM** and **vLLM-Ascend**, and provides complete performance analysis reports. -Users can flexibly control request content, request intervals, concurrent quantities, and other parameters by configuring service-oriented backend parameters to adapt to different evaluation scenarios (e.g., low-concurrency latency-sensitive scenarios, high-concurrency throughput-priority scenarios). The evaluation supports automated execution and outputs structured results, facilitating horizontal comparison of service performance differences across different models, deployment solutions, and hardware configurations. +## Quick Start +### Prerequisite + +The performance evaluation requires **first preparing a service environment** (i.e., a service program that provides OpenAI-compatible interfaces). + +Here is the reference service startup method (vLLM OpenAI-compatible service): + +```bash +vllm serve Qwen/Qwen2.5-7B-Instruct --port 8080 --max-model-len 4096 +``` + +Wait for the service to start successfully (the port shows that the service process is listening), then use the following configuration file for evaluation. + +:::{admonition} Recommended Practice +:class: tip + +For details on how to write the following custom configuration file, please refer to [Custom Configuration Files](../../advanced_tutorials/run_custom_config.md#custom-configuration-file-examples-for-each-scenario). Using a custom configuration file can support richer custom parameter configurations, such as supporting `num_prompts`, `request_rate` (QPS sending mode), etc. +::: + +### One-Click Evaluation + +After the service is started, the following **custom configuration file** can be used to send the `ShareGPT` dataset to the service at `request_rate=1` (QPS) for performance evaluation. + +- Configuration file content: + ```python + from mmengine.config import read_base + from ais_bench.benchmark.models import vLLMCausalLM + from ais_bench.benchmark.partitioners import NaivePartitioner + from ais_bench.benchmark.runners.local_api import LocalAPIRunner + from ais_bench.benchmark.tasks import OpenICLInferTask + from ais_bench.benchmark.datasets import GenericDataset + + with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + + datasets = [ + dict( + type=GenericDataset, + abbr='sharegpt', + path='ais_bench/datasets/ShareGPT/ShareGPT.jsonl', + reader_cfg=dict( + input_columns=['prompt'], + output_column='completion', + ), + infer_cfg=dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict( + role='HUMAN', + prompt='{prompt}', + ), + ], + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict( + type=GenInferencer, + generation_kwargs={ + 'max_new_tokens': 1024, + 'temperature': 0, + 'top_p': 1.0, + }, + ), + ), + ) + ] + + models = [ + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + }, + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + generation_kwargs={ + 'temperature': 0, + 'top_p': 1.0, + }, + ), + ] + + # Custom performance dimensions + stats_list = [ + 'request_rate', + 'num_prompts', + 'benchmark_duration', + 'avg_latency', + 'p99_latency', + 'qps', + 'tput', + 'concurrency', + ] + + # Number of requests to send + num_prompts = 50 + # Sending rate (QPS), only takes effect when not equal to -1 + request_rate = 1.0 + ``` + +- Execution command: + ```bash + ais_bench performance_qwen2_7b_sharegpt.py + ``` + +After the task is completed, you can view the performance result report in the `summary/` directory under the task output directory. -## Quick Start for Service-Oriented Performance Evaluation ### Command Meaning + The meaning of the AISBench service-oriented performance evaluation command is the same as explained in 📚 [Tool Quick Start/Command Meaning](../../get_started/quick_start.md#command-meaning). On this basis, you need to add `--mode perf` or `-m perf` to enter the performance evaluation scenario. Take the following AISBench command as an example: + ```shell ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf ``` + Among them: + - `--models` specifies the model task, i.e., the `vllm_api_stream_chat` model task. - `--datasets` specifies the dataset task, i.e., the `demo_gsm8k_gen_4_shot_cot_chat_prompt` dataset task. -- `--summarizer` specifies the result presentation task, i.e., the `default_perf` result presentation task (if `--summarizer` is not specified, the `default_perf` task is used by default in accuracy evaluation scenarios). It is generally used by default and does not need to be specified in the command line; subsequent commands will omit this parameter. +- `--summarizer` specifies the result presentation task, i.e., the `default_perf` result presentation task (if `--summarizer` is not specified, the `default_perf` task is used by default in performance evaluation scenarios). It is generally used by default and does not need to be specified in the command line; subsequent commands will omit this parameter. ### Task Meaning Query (Optional) + Specific information (introduction, usage constraints, etc.) about the selected model task `vllm_api_stream_chat`, dataset task `demo_gsm8k_gen_4_shot_cot_chat_prompt`, and result presentation task `default_perf` can be queried from the following links: + - `--models`: 📚 [Service-Oriented Inference Backend](../all_params/models.md#service-oriented-inference-backend) - `--datasets`: 📚 [Open-Source Datasets](../all_params/datasets.md#open-source-datasets) → 📚 [Detailed Introduction](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README_en.md) - `--summarizer`: 📚 [Result Summary Tasks](../all_params/summarizer.md#supported-result-summary-tasks) ### Preparations Before Running the Command + - `--models`: To use the `vllm_api_stream_chat` model task, you need to prepare an inference service that supports the `v1/chat/completions` sub-service. You can refer to 🔗 [VLLM Launch OpenAI-Compatible Server](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server) to start the inference service. - `--datasets`: To use the `demo_gsm8k_gen_4_shot_cot_chat_prompt` dataset task, you need to prepare the GSM8K dataset, which can be downloaded from 🔗 [GSM8K Dataset Compressed Package Provided by OpenCompass](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip). Deploy the unzipped `gsm8k/` folder to the `ais_bench/datasets` folder in the root path of the AISBench evaluation tool. -# Modification of Configuration Files Corresponding to Tasks +### Modification of Configuration Files Corresponding to Tasks + Each model task, dataset task, and result presentation task corresponds to a configuration file. The content of these configuration files must be modified before executing commands. The paths of these configuration files can be queried by adding `--search` to the original AISBench command. For example: + ```shell # Note: Whether to add "--mode perf" to the search command does not affect the search results ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf --search ``` + > ⚠️ **Note**: Executing a command with the `search` option will print the absolute path of the configuration file corresponding to the task. Executing the query command will yield the following results: + ```shell ╒══════════════╤═══════════════════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ │ Task Type │ Task Name │ Config File Path │ @@ -43,12 +162,12 @@ Executing the query command will yield the following results: ├──────────────┼───────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ --datasets │ demo_gsm8k_gen_4_shot_cot_chat_prompt │ /your_workspace/benchmark/ais_bench/benchmark/configs/datasets/demo/demo_gsm8k_gen_4_shot_cot_chat_prompt.py │ ╘══════════════╧═══════════════════════════════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ - ``` - The dataset task configuration file `demo_gsm8k_gen_4_shot_cot_chat_prompt.py` in the quick start does not require additional modifications. For an introduction to the content of the dataset task configuration file, please refer to 📚 [Configure Open-Source Datasets](../all_params/datasets.md#configure-open-source-datasets) The model configuration file `vllm_api_stream_chat.py` contains configuration content related to model operation and needs to be modified according to actual conditions. The content that needs to be modified in the quick start is marked with comments. + ```python from ais_bench.benchmark.models import VLLMCustomAPIChatStream @@ -78,14 +197,10 @@ models = [ ] ``` -# Execute Commands -After modifying the configuration files, execute the command to start the service performance evaluation: -```bash -ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf -``` +### View Task Execution Details -## View Task Execution Details After executing the AISBench command, the status of the ongoing task will be displayed on a real-time refreshing dashboard in the command line (press the "P" key on the keyboard to stop refreshing for copying dashboard information, and press "P" again to resume refreshing). For example: + ``` Base path of result&log : outputs/default/20251106_103326 Task Progress Table (Updated at: 2025-11-06 10:34:41) @@ -97,21 +212,24 @@ Press Up/Down arrow to page, 'P' to PAUZE/RESUME screen refresh, 'Ctrl + C' to +=================================+===========+=================================================+=============+=============+================================================+================================================+ | vllm-api-stream-chat/demo_gsm8k | 744887 | [########### ] 3/8 [0.1 it/s] | 0:00:54 | inferencing | logs/infer/vllm-api-stream-chat/demo_gsm8k.out | {'POST': 4, 'RECV': 3, 'FINISH': 3, 'FAIL': 0} | +---------------------------------+-----------+-------------------------------------------------+-------------+-------------+------------------------------------------------+------------------------------------------------+ -` - ``` Detailed logs of task execution will be continuously saved to the default output path, which is displayed on the real-time refreshing dashboard as `Log Path`. The `Log Path` (`logs/infer/vllm-api-stream-chat/demo_gsm8k.out`) is a subpath under the `Base path` (`outputs/default/20251106_103326`). Taking the above dashboard information as an example, the path to the detailed logs of task execution is: + ```shell # {Base path}/{Log Path} outputs/default/20251106_103326/logs/infer/vllm-api-stream-chat/demo_gsm8k.out ``` > 💡 If you want detailed logs to be printed directly during execution, you can add `--debug` to the command: -`ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf --debug` +> +> ```bash +> ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf --debug +> ``` -# View Performance Results -An example of performance results printed on the screen is as follows: +### View Performance Results + +The on-screen performance results are displayed as follows: ```bash [2025-11-06 10:35:43,667] [ais_bench] [INFO] Performance Results of task: vllm-api-stream-chat/demo_gsm8k: @@ -163,334 +281,701 @@ An example of performance results printed on the screen is as follows: ╘══════════════════════════╧═════════╧══════════════════╛ [2025-11-06 10:35:43,672] [ais_bench] [INFO] Performance Result files located in outputs/default/20251106_103326/performances/vllm-api-stream-chat. ``` -💡 For the meaning of specific performance parameters, please refer to 📚 [Explanation of Performance Evaluation Results](../results_intro/performance_metric.md) -# View Performance Details -After executing the AISBench command, more details of task execution will eventually be saved to the `Base path` (`outputs/default/20251106_103326`) +💡 For the meaning of specific performance parameters, refer to 📚 [Performance Evaluation Results Description](../results_intro/performance_metric.md) + +### Performance Details View + +After executing the AISBench command, more details of task execution will eventually be saved to the `Base path` (`outputs/default/20251106_103326`). + +After the command execution ends, the task execution details in `outputs/default/20250628_151326` are as follows: -After the command execution is completed, the details of task execution in `outputs/default/20250628_151326` are as follows: ```shell 20251106_103326 # Unique directory generated based on timestamp for each experiment -├── configs # Automatically stored all dumped configuration files -├── logs # Logs during execution; if --debug is added to the command, no process logs will be saved to disk (all will be printed directly) -│ └── performance/ # Log files of the inference phase +├── configs # Automatically stored configuration files of all dumped configurations +├── logs # Logs during execution; if --debug is added to the command, there will be no on-disk logs (all printed directly) +│ └── performance/ # Log files from the inference phase └── performance # Performance evaluation results -│ └── vllm-api-stream-chat/ # Name of "service model configuration", corresponding to the abbr parameter of models in the model task configuration file -│ ├── demo_gsm8k.csv # Single-request performance output (CSV), consistent with the Performance Parameters table in the on-screen performance results -│ ├── demo_gsm8k.json # End-to-end performance output (JSON), consistent with the Common Metric table in the on-screen performance results -│ ├── demo_gsm8k_plot.html # Request concurrency visualization report (HTML) -│ └── ...... -``` -💡 It is recommended to open the request concurrency visualization report `demo_gsm8k_plot.html` using browsers such as Chrome or Edge. You can view the latency of each request and the number of concurrent service times perceived by the client at each moment: - ![full_plot_example.img](../../img/request_concurrency/full_plot_example.png) + └── vllm-api-stream-chat/ # "Service-oriented model configuration" name, corresponding to the abbr parameter of models in the model task configuration file + ├── demo_gsm8k.csv # Single-request performance output (CSV), consistent with the Performance Parameters table in the on-screen performance results + ├── demo_gsm8k.json # End-to-end performance output (JSON), consistent with the Common Metric table in the on-screen performance results + ├── demo_gsm8k_plot.html # Request concurrency visualization report (HTML) + └── ...... +``` + +💡 The `demo_gsm8k_plot.html` request concurrency visualization report is recommended to be opened with browsers such as Chrome or Edge, where you can see the latency of each request and the number of concurrent service requests perceived by the client at each moment: +![full_plot_example](../../img/request_concurrency/full_plot_example.png) + For instructions on using this HTML visualization file, please refer to 📚 [Instructions for Using Performance Test Visualization Concurrency Graphs](../results_intro/performance_visualization.md) -# Preconditions for Service-Oriented Performance Evaluation -Before conducting service-oriented inference, the following conditions must be met: +## Test Preparation + +Before performing service-oriented inference, the following conditions must be met: + +- Available model weights: Ensure that the model weight files to be tested are already available locally. Open-source weights can be obtained from 🔗 [Hugging Face Community](https://huggingface.co/models). +- Service environment preparation: Ensure that the model inference service is started through inference engines such as vLLM/vLLM-Ascend. The startup parameters need to ensure that the server's `max-model-len` and other configurations can accommodate the length of the prompt and output to be sent. +- Dataset preparation: Select a dataset suitable for performance evaluation scenarios, such as `ShareGPT`. For details, refer to 📚 [Datasets](../all_params/datasets.md#open-source-datasets). The user can also prepare a custom dataset, see [Custom Dataset Evaluation](#custom-dataset-evaluation). +- Model task preparation: Select the model task to execute from 📚 [vLLM Model Backend](../all_params/models.md#vllm-model-backend). + +:::{admonition} Service Startup Precautions +:class: warning + +- It is recommended to ensure that the service is fully started before starting the evaluation task, otherwise the task may fail due to connection failure. +- When the service fails, the tool will record the failure cause in the logs, and the user can troubleshoot based on the error information. +::: -- **Accessible Service-Oriented Model Service**: Ensure the service process can be directly accessed in the current environment. -- **Dataset Preparation**: - - **Open-Source Dataset**: Select a dataset from 📚 [Open-Source Datasets](../all_params/datasets.md#开源数据集), and choose the dataset task to execute from the "Detailed Introduction" document corresponding to the dataset. Prepare the dataset files by referring to the "Detailed Introduction" document of the selected dataset task. It is recommended to manually place the open-source dataset in the default directory `ais_bench/datasets/`; the program will automatically load the dataset files during task execution. - - **Randomly Synthesized Dataset**: Select `synthetic_gen` as the dataset task, and refer to 📚 [Randomly Synthesized Dataset](../../advanced_tutorials/synthetic_dataset.md) for other configurations. - - **Custom Dataset**: No need to specify a dataset task; refer to 📚 [Custom Dataset](../../advanced_tutorials/custom_dataset.md) for other configurations. -- **Service-Oriented Model Backend Configuration**: From [Service-Oriented Inference Backend](../all_params/models.md#服务化推理后端), select a sub-service with the interface type of `Streaming Interface` (⚠️ Other types are not supported). +## Main Functional Scenarios +### Single-Task Performance Evaluation + +#### Using a Custom Configuration File (Recommended) + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + +The configuration file content is consistent with the [Quick Start One-Click Evaluation](#one-click-evaluation). + +Execution command: + +```bash +ais_bench performance_qwen2_7b_sharegpt.py +``` -# Main Functional Scenarios -## Single-Task Evaluation -Refer to [Quick Start for Service-Oriented Performance Evaluation](#服务化性能测评快速入门) +::: +:::{tab-item} Alternative: Command-Line Parameters -## Multi-Task Evaluation -Supports simultaneous configuration of multiple models or multiple dataset tasks, enabling batch evaluation through a single command. This is suitable for serial execution of multiple test commands. +You can also use the preset configuration file for one-click evaluation: -### Command Description -Users can specify multiple configuration tasks via the `--models` and `--datasets` parameters. The number of subtasks is the product of the number of tasks configured in `--models` and `--datasets`—that is, one model configuration and one dataset configuration form a subtask. Example: ```bash -ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf +ais_bench --models vllm_qwen2_5_7b_chat --datasets sharegpt_gen_perf --url http://localhost:8080/v1/chat/completions ``` -The above command specifies 2 model tasks (`vllm_api_general_stream` `vllm_api_stream_chat`) and 2 dataset tasks (`gsm8k_gen_4_shot_cot_str` `aime2024_gen_0_shot_str`), and will execute the following 4 combined performance test tasks: + +::: +::: + +#### Specifying Custom Performance Dimensions + +AISBench supports users in customizing the statistical items of performance reports. By modifying the `stats_list` field in the custom configuration file, you can control which performance dimensions to output in the summary report. + +The `stats_list` field is a string list. Common configurable performance dimensions include: + +| Dimension | Description | +| --- | --- | +| `benchmark_duration` | Total benchmark duration | +| `num_prompts` | Total number of requests | +| `request_rate` | Sending rate (QPS) | +| `qps` | Actual QPS | +| `tput` | Total token throughput (tokens/second) | +| `prefill_token_throughput` | Prefill phase token throughput | +| `decode_token_throughput` | Decode phase token throughput | +| `concurrency` | Concurrency | +| `avg_latency` | Average end-to-end latency | +| `p50_latency` | P50 end-to-end latency | +| `p90_latency` | P90 end-to-end latency | +| `p99_latency` | P99 end-to-end latency | +| `ttft` | Time To First Token | +| `tpot` | Time Per Output Token | +| `itl` | Inter-Token Latency | +| `e2el` | End-to-End Latency | +| `output_tokens_per_request` | Average output tokens per request | +| `total_input_tokens` | Total input tokens | +| `total_output_tokens` | Total output tokens | + +The following is an example configuration that contains the most commonly used performance dimensions: + +```python +stats_list = [ + 'benchmark_duration', + 'num_prompts', + 'request_rate', + 'qps', + 'tput', + 'concurrency', + 'avg_latency', + 'p50_latency', + 'p99_latency', +] +``` + +### Multi-Task Performance Evaluation + +Supports simultaneous configuration of multiple datasets or multiple sending parameter combinations (such as different `request_rate`s) for performance evaluation through a single command, facilitating the comparison of performance indicators of different sending strategies. + +#### Description of Sub-task Combinations + +In multi-task evaluation scenarios, the number of subtasks is the product of the number of tasks configured by `models` and the number of tasks configured by `datasets`—that is, one model configuration and one dataset configuration form a subtask. + +The following example simultaneously evaluates 2 model tasks (`vllm_api_general_stream`, `vllm_api_stream_chat`) and 2 dataset tasks (`gsm8k_gen_4_shot_cot_str`, `aime2024_gen_0_shot_str`), and will execute the following 4 combined performance test tasks: + + [vllm_api_general_stream](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py) Model Task + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) Dataset Task + [vllm_api_general_stream](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py) Model Task + [aime2024_gen_0_shot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_str) Dataset Task + [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) Model Task + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) Dataset Task + [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) Model Task + [aime2024_gen_0_shot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_str.py) Dataset Task -### Modify Configuration Files Corresponding to Tasks -The actual paths of the configuration files corresponding to model tasks and dataset tasks can be queried by executing the command with the `--search` parameter: -```bash -ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf --search +#### Custom Model-Dataset Pairings (Optional) + +By default, the `models` list and `datasets` list in the configuration file are automatically combined as a Cartesian product, with the number of subtasks equal to the number of models × the number of datasets (in this example, 2 × 2 = 4). If you want to precisely control which models are paired with which datasets (e.g., letting some models only run on some datasets to avoid meaningless combinations), you can explicitly declare the pairing relationship in the configuration file via the `model_dataset_combinations` field: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_str import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets +models = vllm_api_general_stream + vllm_api_stream_chat + +# Key: Precisely control pairings via model_dataset_combinations +# The following example generates only 2 subtasks (the Cartesian product would generate 4): +# - vllm_api_general_stream + gsm8k_gen_4_shot_cot_str +# - vllm_api_stream_chat + aime2024_gen_0_shot_str +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), +] ``` -The following configuration files to be modified will be displayed: -```bash -╒═════════════╤══════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ -│ Task Type │ Task Name │ Config File Path │ -╞═════════════╪══════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ -│ --models │ vllm_api_general_stream │ /your_workspace/benchmark_test/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py │ -├─────────────┼──────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ -│ --models │ vllm_api_stream_chat │ /your_workspace/benchmark_test/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py │ -├─────────────┼──────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ -│ --datasets │ gsm8k_gen_4_shot_cot_str │ /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py │ -├─────────────┼──────────────────────────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ -│ --datasets │ aime2024_gen_0_shot_str │ /your_workspace/benchmark_test/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_str.py │ -╘═════════════╧══════════════════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ -``` -- Refer to 📚 [Description of Service-Oriented Inference Backend Configuration Parameters](../all_params/models.md#服务化推理后端配置参数说明) to configure the configuration files corresponding to the model tasks `vllm_api_general_stream` and `vllm_api_stream_chat` according to the actual situation. -- Refer to 📚 [Configure Open-Source Dataset](../all_params/datasets.md#配置开源数据集) to configure the configuration files corresponding to the dataset tasks `gsm8k_gen_4_shot_cot_str` and `aime2024_gen_0_shot_str` according to the actual situation. **Note**: If the dataset is placed in the default directory `ais_bench/datasets/`, no configuration is generally required. - -### Execute the Evaluation Command -Execute the command: -```bash -ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf + +> ⚠️ **Note**: The unique identifier for models and datasets is determined by the `abbr` field. In the same configuration file, repeated combinations of models or datasets with the same `abbr` will be treated as duplicate tasks and skipped. When reusing model/dataset configurations via methods such as `.copy()`, the `abbr` must be explicitly modified to ensure uniqueness. See 📚 [Custom Model and Dataset Combinations](../../advanced_tutorials/run_custom_config.md#custom-model-and-dataset-combinations) for details. + +#### Multi-Task Parallel + +Supports multi-task parallelism through the [`--max-num-workers`](../all_params/cli_args.md#common-parameters) command-line parameter. Different sub-tasks will be distributed to different processes for parallel execution. + +#### Specifying Multiple Datasets for Performance Evaluation + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import vLLMCausalLM +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + +datasets = gsm8k_datasets + aime2024_datasets + +models = [ + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), +] ``` -During execution, a timestamp directory will be created under the path specified by 📚 [`--work-dir`](../all_params/cli_args.md#公共参数) (the default path is `outputs/default/`) to save execution details. -After the 4 performance evaluation tasks are completed, the performance results of all 4 tasks will be printed at once: +Execution command: + ```bash -[2025-11-06 10:35:43,667] [ais_bench] [INFO] Performance Results of task: vllm-api-stream-chat/demo_gsm8k: -╒══════════════════════════╤═════════╤═════════════════╤═══════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤══════╕ -│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ -╞══════════════════════════╪═════════╪═════════════════╪═══════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪══════╡ -│ E2EL │ total │ 2754.0929 ms │ 2189.0804 ms │ 3366.1463 ms │ 2753.1668 ms │ 3048.2929 ms │ 3222.573 ms │ 3303.3894 ms │ 1319 │ -...... -╒══════════════════════════╤═════════╤═══════════════════╕ -│ Common Metric │ Stage │ Value │ -╞══════════════════════════╪═════════╪═══════════════════╡ -│ Benchmark Duration │ total │ 38039.9928 ms │ -...... -[2025-11-06 11:11:33,468] [ais_bench] [INFO] Performance Result files located in outputs/default/20251106_110904/performances/vllm-api-general-stream. -[2025-11-06 11:11:33,468] [ais_bench] [INFO] Performance Results of task: vllm-api-general-stream/aime2024: -╒══════════════════════════╤═════════╤═════════════════╤════════════════╤════════════════╤═══════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕ -│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ -╞══════════════════════════╪═════════╪═════════════════╪════════════════╪════════════════╪═══════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡ -│ E2EL │ total │ 2868.1822 ms │ 2277.1049 ms │ 3307.2084 ms │ 2941.6767 ms │ 3158.5361 ms │ 3220.2141 ms │ 3307.0174 ms │ 30 │ -...... -╒══════════════════════════╤═════════╤═══════════════════╕ -│ Common Metric │ Stage │ Value │ -╞══════════════════════════╪═════════╪═══════════════════╡ -│ Benchmark Duration │ total │ 3346.9782 ms │ -...... -[2025-11-06 11:11:33,471] [ais_bench] [INFO] Performance Result files located in outputs/default/20251106_110904/performances/vllm-api-general-stream. -[2025-11-06 11:11:33,471] [ais_bench] [INFO] Performance Results of task: vllm-api-stream-chat/gsm8k: -╒══════════════════════════╤═════════╤═════════════════╤════════════════╤═════════════════╤═════════════════╤═════════════════╤════════════════╤═════════════════╤══════╕ -│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ -╞══════════════════════════╪═════════╪═════════════════╪════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════════╪═════════════════╪══════╡ -│ E2EL │ total │ 2753.3518 ms │ 2189.5185 ms │ 3339.4463 ms │ 2755.8153 ms │ 3039.7431 ms │ 3219.6642 ms │ 3313.0408 ms │ 1319 │ -...... -╒══════════════════════════╤═════════╤═══════════════════╕ -│ Common Metric │ Stage │ Value │ -╞══════════════════════════╪═════════╪═══════════════════╡ -│ Benchmark Duration │ total │ 38101.2396 ms │ -...... -[2025-11-06 11:11:33,474] [ais_bench] [INFO] Performance Result files located in outputs/default/20251106_110904/performances/vllm-api-stream-chat. -[2025-11-06 11:11:33,474] [ais_bench] [INFO] Performance Results of task: vllm-api-stream-chat/aime2024: -╒══════════════════════════╤═════════╤═════════════════╤═══════════════╤════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕ -│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ -╞══════════════════════════╪═════════╪═════════════════╪═══════════════╪════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡ -│ E2EL │ total │ 2745.4115 ms │ 2187.5882 ms │ 3288.4635 ms │ 2820.7541 ms │ 2988.8338 ms │ 3188.436 ms │ 3273.7475 ms │ 30 │ -...... -╒══════════════════════════╤═════════╤═══════════════════╕ -│ Common Metric │ Stage │ Value │ -╞══════════════════════════╪═════════╪═══════════════════╡ -│ Benchmark Duration │ total │ 3335.7672 ms │ -...... -[2025-11-06 11:11:33,477] [ais_bench] [INFO] Performance Result files located in outputs/default/20251106_110904/performances/vllm-api-stream-chat. -``` - -At the same time, the final generated directory structure is as follows: +ais_bench performance_multi_dataset.py +``` + +::: +:::{tab-item} Alternative: Command-Line Parameters + +Use the `--models` parameter to specify multiple datasets: + ```bash -# Under output/default -20251106_110904/ # Output directory corresponding to the task creation time -├── configs # A combined configuration file integrating configs for model tasks, dataset tasks, and structure presentation tasks -├── logs # Contains logs from the inference and accuracy evaluation phases; when the --debug command is added, logs will be printed directly to the screen without generating disk-stored files -│ └── performance # Log files from the inference phase -└── performances # Performance evaluation results - ├── vllm-api-general-stream # Name of the "service-oriented model configuration", corresponding to the abbr parameter in the models section of the model task configuration file - │ ├── aime2024.csv # Single-request performance output (CSV), consistent with the Performance Parameters table in the on-screen performance results display - │ ├── aime2024.json # End-to-end performance output (JSON), consistent with the Common Metric table in the on-screen performance results display - │ ├── aime2024_plot.html # Request concurrency visualization report (HTML) - │ ├── gsm8k.csv - │ ├── gsm8k.json - │ ├── gsm8k_plot.html - │ └── ...... - └── vllm-api-stream-chat - ├── aime2024.csv - ├── aime2024.json - ├── aime2024_plot.html - ├── gsm8k.csv - ├── gsm8k.json - ├── gsm8k_plot.html - └── ...... +ais_bench --models vllm_qwen2_5_7b_chat --datasets gsm8k_gen_4_shot_cot_str_perf,aime2024_gen_perf --url http://localhost:8080/v1/chat/completions +``` + +::: +::: + +#### Specifying Multiple Sending Rates for Performance Evaluation +The following configuration file example sends the `ShareGPT` dataset to the service at `request_rate=1, 2, 4, 8` respectively for performance evaluation. + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import vLLMCausalLM +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import GenericDataset + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +datasets = [ + dict( + type=GenericDataset, + abbr=f'sharegpt_rate_{rate}', + path='ais_bench/datasets/ShareGPT/ShareGPT.jsonl', + reader_cfg=dict( + input_columns=['prompt'], + output_column='completion', + ), + infer_cfg=dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict( + role='HUMAN', + prompt='{prompt}', + ), + ], + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + ) + for rate in [1, 2, 4, 8] +] + +models = [ + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), +] + +# Each dataset uses the corresponding request_rate +request_rate = [1.0, 2.0, 4.0, 8.0] ``` -> ⚠️ Note: -> - In multi-task performance evaluation scenarios, the dataset tasks specified by `--datasets` must belong to different dataset types. Otherwise, performance data may be missing due to overwriting. For example, you cannot use `--datasets` to specify both the `aime2024_gen_0_shot_str` and `aime2024_gen_0_shot_chat_prompt` dataset tasks simultaneously. +Execution command: -### Custom Sequence Length Evaluation -#### 1 Configure Input and Output Distribution for Custom Sequence Datasets -To perform custom sequence length evaluation, you need to specify the special dataset task `synthetic_gen_string`. Execute the following command to retrieve the path of the configuration file corresponding to `synthetic_gen_string`: ```bash -ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string --search +ais_bench performance_multi_rate.py ``` -The result will be: + +::: +:::{tab-item} Alternative: Command-Line Parameters + +It is not supported to specify multiple sending rates for one dataset in a single command. It is recommended to use a custom configuration file. + +::: +::: + +#### Specifying Multiple Models for Performance Evaluation + +Supports simultaneous evaluation of multiple models on the same dataset, suitable for comparing the performance of different models. + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + +```python +models = [ + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-14b', + path='Qwen/Qwen2.5-14B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-14B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), +] ``` -╒══════════════╤═══════════════════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ -│ Task Type │ Task Name │ Config File Path │ -╞══════════════╪═══════════════════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ -│ --models │ vllm_api_stream_chat │ /your_workspace/benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py │ -├──────────────┼───────────────────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ -│ --datasets │ synthetic_gen_string │ /your_workspace/benchmark/ais_bench/benchmark/configs/datasets/synthetic/synthetic_gen_string.py │ -╘══════════════╧═══════════════════════════════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ + +Execution command: + +```bash +ais_bench performance_multi_model.py ``` -Modify the `synthetic_config` in `/your_workspace/benchmark/ais_bench/benchmark/configs/datasets/synthetic/synthetic_gen_string.py`. The configuration content is as follows: -```python -synthetic_config = { - "Type": "string", - "RequestCount": 1000, # Number of requests (number of dataset entries) - "StringConfig": { - "Input": { - "Method": "uniform", - "Params": {"MinValue": 50, "MaxValue": 500} # Input length: 50-500 - }, - "Output": { - "Method": "uniform", - "Params": {"MinValue": 20, "MaxValue": 200} # Output length: 20-200 - } - } -} -``` -💡 For more custom input and output distributions, refer to 📚 [Random Synthetic Dataset](../../advanced_tutorials/synthetic_dataset.md) - -#### 2 Ensure the Inference Service Reaches the Set Maximum Output -To ensure the inference service achieves the set maximum output, you need to configure the special post-processing parameter `ignore_eos = True` in `generation_kwargs` of the 📚 [Service-Oriented Model Configuration](../all_params/models.md#Service-Oriented Inference Backend Configuration Parameter Description) to control the maximum output length of requests (preventing early termination). - -For example, modify the content of the configuration file [vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/models/vllm_api/vllm_api_stream_chat.py) corresponding to the `vllm_api_stream_chat` model task: +::: +:::{tab-item} Alternative: Command-Line Parameters + +```bash +ais_bench --models vllm_qwen2_5_7b_chat,vllm_qwen2_5_14b_chat --datasets sharegpt_gen_perf --url http://localhost:8080/v1/chat/completions +``` + +::: +::: + +### Synthetic Dataset Multi-Task Combinations + +In actual performance evaluation, it is sometimes necessary to simulate the input load in production environments, such as fixed-length inputs, Poisson-distributed request arrival, etc. AISBench supports users in defining custom performance evaluation datasets through the `SyntheticDataset`, and supports configuring the distribution of input sequence lengths, the distribution of output sequence lengths, the request arrival rate (QPS), etc. through parameters. The model-dataset sub-tasks generated by the synthetic dataset support combinations with each other. + +The following configuration file example sends synthetic datasets of different input lengths to the service for performance evaluation at `request_rate=2`: + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + ```python -from ais_bench.benchmark.models import VLLMCustomAPIChatStream +from mmengine.config import read_base +from ais_bench.benchmark.models import vLLMCausalLM +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import SyntheticDataset + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +# Define multiple sub-datasets with different input/output lengths +datasets = [] +for input_len in [256, 512, 1024]: + for output_len in [256, 512]: + datasets.append( + dict( + type=SyntheticDataset, + abbr=f'syn_in{input_len}_out{output_len}', + num_infer_questions=100, + input_lens=[input_len], + output_lens=[output_len], + input_distribution='uniform', + output_distribution='uniform', + reader_cfg=dict( + input_columns=['query'], + output_column='answer', + ), + infer_cfg=dict( + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + ) + ) + models = [ dict( - attr="service", - type=VLLMCustomAPIChatStream, - abbr='vllm-api-stream-chat', - # Configure other model task parameters such as port and IP by yourself - generation_kwargs = dict( - # ..... - ignore_eos = True, # The inference service output ignores EOS (output length will definitely reach max_out_len) + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), +] + +request_rate = 2.0 +``` + +Execution command: + +```bash +ais_bench performance_synthetic.py +``` + +::: +:::{tab-item} Alternative: Command-Line Parameters + +It is not supported to specify multiple synthetic datasets with different lengths in a single command. It is recommended to use a custom configuration file. + +::: +::: + +> 💡 For more configuration details of `SyntheticDataset`, please refer to 📚 [Datasets](../all_params/datasets.md#synthetic-dataset). + +### Custom Sequence Length Usage through Custom Config File Approach + +:::{admonition} Why use a custom config file? +:class: tip + +For the synthetic dataset scenario, in order to fully support the user's combination of multiple different input/output lengths, multiple different QPS sending rates, etc., it is **strongly recommended to use a custom configuration file**, because the command-line parameters can only support a single fixed length and a single QPS, and cannot satisfy the combinatorial requirements. +::: + +For detailed instructions on writing custom configuration files, please refer to [Custom Configuration Files](../../advanced_tutorials/run_custom_config.md#synthetic-dataset-performance-evaluation). + +### Custom Sequence Multi-Task Combinations + +For multi-task combinations based on custom sequence lengths, the user can combine different models and datasets for evaluation through the `model_dataset_combinations` field. + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import vLLMCausalLM +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import SyntheticDataset + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +datasets = [] +for input_len in [256, 512]: + for output_len in [256, 512]: + datasets.append( + dict( + type=SyntheticDataset, + abbr=f'syn_in{input_len}_out{output_len}', + num_infer_questions=100, + input_lens=[input_len], + output_lens=[output_len], + input_distribution='uniform', + output_distribution='uniform', + reader_cfg=dict( + input_columns=['query'], + output_column='answer', + ), + infer_cfg=dict( + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + ) ) - ) + +models = [ + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), ] +# Key: Only specify partial models for partial datasets +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0], datasets[1]]), + dict(models=[models[0]], datasets=[datasets[2]]), +] ``` -#### 3 Start Performance Evaluation -Execute the following command: +Execution command: + ```bash -ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string -m perf +ais_bench performance_seq_combinations.py ``` -After completion, the output directory structure is the same as that described in the [Multi-Task Evaluation](#Multi-Task Evaluation) section. Corresponding CSV/JSON/HTML files will be generated under performance/vllm-api-stream-chat/synthetic*. -> ⚠️ Note: -> - Some service-oriented backends do not support the `ignore_eos` post-processing parameter. In such cases, the actual number of output `Tokens` may not reach the configured maximum output length. You need to configure other post-processing parameters (e.g., parameters for limiting minimum output) to achieve the maximum output length. +::: +:::{tab-item} Alternative: Command-Line Parameters + +Not supported. + +::: +:::: ### Fixed Request Count Evaluation -When the dataset scale is too large and you only want to perform performance testing on a subset of samples, you can use the 📚 [`--num-prompts`](../all_params/cli_args.md#Performance Evaluation Parameters) parameter to specify the number of data entries to read. An example is as follows: + +When the dataset scale is too large and you only want to perform performance testing on a subset of samples, you can use either of the following two approaches to control the data reading range. They achieve the same goal, so just pick the one that fits your habit: + +- **Basic approach**: Specify the number of data entries to read directly via the command-line parameter 📚 [`--num-prompts`](../all_params/cli_args.md#common-parameters). No configuration file modification is required, and it is the simplest to use. +- **Advanced approach (more powerful)**: Set the `reader_cfg.test_range` field of the dataset in the custom configuration file, which supports a more flexible sampling range (e.g., specifying a start index and custom step). For detailed usage, refer to 📚 [Custom Configuration Files](../../advanced_tutorials/run_custom_config.md). + +Example as follows: + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +**Method 1: Basic approach — Use `--num-prompts` to specify the number of entries to read** + +For a complete example, refer to [performance_fixed_request.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_fixed_request.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +# ...For other parameter configurations, please refer to the configuration file +``` + +Execute the command (specify reading only 1 sample via `--num-prompts 1`): + ```bash -ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf --num-prompts 1 +ais_bench ais_bench/configs/performance_benchmark/performance_fixed_request.py --mode perf --num-prompts 1 ``` -The above command only performs inference on the first entry in the sample dataset and measures its performance. -> ⚠️ Note: Currently, the dataset is read sequentially in the default queue order; random sampling or shuffling is not supported. +**Method 2: Advanced approach — Use `test_range` to flexibly specify the reading range** -## Other Functional Scenarios -### Performance Result Recalculation -The main functional scenario evaluation tool for performance testing executes a complete workflow of performance sampling → calculation → aggregation: -```mermaid -graph LR; - A[Execute inference based on the given dataset] --> B((Performance打点数据)) - B --> C[Calculate metrics based on the打点数据] - C --> D((Performance data)) - D --> E[Generate an aggregated report based on the performance data] - E --> F((Present results)) +If you need more flexible range control (e.g., specifying a start index and custom step), you can set the `reader_cfg.test_range` field of the dataset directly in the custom configuration file, without passing any command-line parameter. For a complete example, refer to [performance_fixed_request.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_fixed_request.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# Key: control the sampling range flexibly via reader_cfg.test_range +# For example, '[0:8]' reads the first 8 samples; '[10:20]' reads samples from index 10 to 20 +datasets[0]['reader_cfg']['test_range'] = '[0:8]' + +models = vllm_api_stream_chat +# ...For other parameter configurations, please refer to the configuration file ``` -*Note: "打点数据" (dǎdiǎn shùjù) refers to "instrumented data" or "sampled performance metrics" in this technical context.* -Each link in the execution workflow is independently decoupled. Calculation and aggregation can be repeatedly performed based on the results of performance sampling. If the directly printed performance data does not include data for relevant dimensions (e.g., missing 95th percentile data), you need to modify some configurations for recalculation. The specific operations are as follows: +Execute the command (test_range has been specified in the configuration file, no need to pass `--num-prompts`): -Assume the command used for the previous performance evaluation was: ```bash -ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf +ais_bench ais_bench/configs/performance_benchmark/performance_fixed_request.py --mode perf ``` -The printed `Performance Parameters` table is as follows: + +::: +:::{tab-item} Alternative: Using Command-Line Parameters + ```bash -[2025-11-06 11:11:33,463] [ais_bench] [INFO] Performance Results of task: vllm-api-general-stream/gsm8k: -╒══════════════════════════╤═════════╤═════════════════╤════════════════╤═════════════════╤═════════════════╤═════════════════╤════════════════╤═════════════════╤══════╕ -│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ -╞══════════════════════════╪═════════╪═════════════════╪════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════════╪═════════════════╪══════╡ -│ E2EL │ total │ 2753.3518 ms │ 2189.5185 ms │ 3339.4463 ms │ 2755.8153 ms │ 3039.7431 ms │ 3219.6642 ms │ 3313.0408 ms │ 1319 │ -...... +ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf --num-prompts 1 +``` +The above command only performs inference on the first entry in the sample dataset and only measures the performance of this one entry. + +::: +:::: + +> ⚠️ Note: Currently, the dataset is read sequentially in the default queue order; random sampling or shuffling is not supported. When `reader_cfg.test_range` in the configuration file and the command-line `--num-prompts` are both specified, the command-line parameter `--num-prompts` takes precedence. + +### Fixed Request Count Performance Evaluation + +In some scenarios, the user wants to fix the total number of requests sent without limiting the sending rate, that is, to send requests at the maximum throughput. In this case, `request_rate` needs to be set to `-1`, indicating that requests are sent concurrently without rate limiting. + +:::{tab-set} +:::{tab-item} ⭐ Custom Configuration File + +```python +num_prompts = 100 +request_rate = -1 # -1 indicates concurrent sending without rate limiting ``` -*Note: "E2EL" stands for "End-to-End Latency" in this performance context.* -If you want to view performance data for the "P95" (95th percentile) dimension, you need to modify the content of the configuration file corresponding to the default result presentation task `default_perf` for `--summarizer`. The path of `default_perf` can be queried using the `--search` command: +Execution command: + ```bash -╒══════════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ -│ Task Type │ Task Name │ Config File Path │ -╞══════════════╪══════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ -│ --summarizer │ default_perf │ /your_workspace/benchmark/ais_bench/benchmark/configs/summarizers/perf/default_perf.py │ -╘══════════════╧══════════════╧═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ +ais_bench performance_fixed_request.py +``` +::: +:::{tab-item} Alternative: Command-Line Parameters + +```bash +ais_bench --models vllm_qwen2_5_7b_chat --datasets sharegpt_gen_perf --url http://localhost:8080/v1/chat/completions --num-prompts 100 --request-rate inf ``` -Modify the content of `default_perf.py`: -```py +::: +::: + +## Implementation via Custom Configuration Files + +> 💡 All the above functional scenarios (multi-task evaluation, multi-task parallel, fixed request count, etc.) can be implemented through the [Custom Configuration File](../../advanced_tutorials/run_custom_config.md) approach. The configuration file is essentially a Python script, which supports all Python syntaxes such as loops, conditional judgments, and list comprehensions. Model, dataset, summarizer, and other configurations can be written into one file for one-time writing and multiple reuse. + +All custom configuration file examples involved in this section are uniformly stored in the `ais_bench/configs/performance_benchmark/` directory for easy reference and reuse: + +| File Name | Corresponding Scenario | +| --- | --- | +| [performance_qwen2_7b_sharegpt.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_qwen2_7b_sharegpt.py) | Single-Task Performance Evaluation | +| [performance_multi_dataset.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_multi_dataset.py) | Multi-Dataset Performance Evaluation | +| [performance_multi_rate.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_multi_rate.py) | Multi-Rate Performance Evaluation | +| [performance_multi_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_multi_model.py) | Multi-Model Performance Evaluation | +| [performance_synthetic.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_synthetic.py) | Synthetic Dataset Multi-Task Combinations | +| [performance_seq_combinations.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_seq_combinations.py) | Custom Sequence Multi-Task Combinations | +| [performance_fixed_request.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_fixed_request.py) | Fixed Request Count Performance Evaluation | +| [performance_re_eval.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_re_eval.py) | Performance Result Recalculation | + +For details, refer to the "Service-Oriented Performance Evaluation" example in [Running AISBench via Custom Configuration Files](../../advanced_tutorials/run_custom_config.md#custom-configuration-file-examples-for-each-scenario). + +## Other Functional Scenarios + +### Performance Result Recalculation + +In the actual evaluation process, the user may want to update the performance summary based on the existing inference results, for example, after modifying the `stats_list` configuration, recalculate the summary report without re-running the inference. + +AISBench supports recalculating performance summaries based on existing inference results through the `--mode perf` and `--reuse` parameters. + +For a complete example, refer to [performance_re_eval.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/performance_re_eval.py): + +```python from mmengine.config import read_base -from ais_bench.benchmark.summarizers import DefaultPerfSummarizer -from ais_bench.benchmark.calculators import DefaultPerfMetricCalculator - -summarizer = dict( - type=DefaultPerfSummarizer, - calculator=dict( - type=DefaultPerfMetricCalculator, - stats_list=["Average", "Min", "Max", "Median", "P95"], - ) -) +from ais_bench.benchmark.models import vLLMCausalLM +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +models = [ + dict( + type=vLLMCausalLM, + abbr='vllm-qwen2.5-7b', + path='Qwen/Qwen2.5-7B-Instruct', + model_kwargs=dict( + tokenizer_path='Qwen/Qwen2.5-7B-Instruct', + ), + url='http://localhost:8080/v1/chat/completions', + max_out_len=1024, + batch_size=50, + ), +] + +# Recalculate the performance summary based on existing inference results +stats_list = [ + 'benchmark_duration', + 'num_prompts', + 'qps', + 'tput', + 'avg_latency', + 'p99_latency', +] ``` -Among them, the `stats_list` can hold data for up to 8 performance dimensions at the same time. -After the modification is completed, you can execute the following command to recalculate the performance metrics: +Execution command (`--mode perf` and `--reuse` are common parameters, and can still be appended through the command line when using a custom configuration file): ```bash -## Note: --summarizer default_perf must be specified -ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf_viz --pressure --debug --reuse 20250628_151326 +ais_bench performance_re_eval.py --mode perf --reuse 20250628_151326 ``` -The on-screen performance results will be as follows: -```bash -[2025-11-06 11:11:33,463] [ais_bench] [INFO] Performance Results of task: vllm-api-general-stream/gsm8k: -╒══════════════════════════╤═════════╤════════════════╤═════════════════╤═════════════════╤════════════════╤═════════════════╤═════╕ -│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P95 │ N │ -╞══════════════════════════╪═════════╪════════════════╪═════════════════╪═════════════════╪════════════════╪═════════════════╪═════╡ -│ E2EL │ total │ 2761.6153 ms │ 2493.8016 ms │ 3086.0523 ms │ 2848.9603 ms │ 3021.0043 ms │ 8 │ -...... -╒══════════════════════════╤═════════╤═══════════════════╕ -│ Common Metric │ Stage │ Value │ -╞══════════════════════════╪═════════╪═══════════════════╡ -│ Benchmark Duration │ total │ 3090.7835 ms │ -...... -[2025-11-06 11:11:33,468] [ais_bench] [INFO] Performance Result files located in outputs/default/20251106_110904/performances/vllm-api-general-stream. - -``` -> ⚠️ The files `gsm8kdataset.csv`, `gsm8kdataset_details.json`, and `gsm8kdataset_plot.html` under `20251106_110904/performance/` will be regenerated (overwriting the original ones). - - -## Specifications for Service-Oriented Performance Testing -The scale of service-oriented performance testing determines the resource usage of the AISBench evaluation tool. Taking [Custom Sequence Length Evaluation](#Custom Sequence Length Evaluation) as an example, the test scale is mainly determined by the total number of requests (`RequestCount`), dataset input token length (`Input`), and output token length (`Output`). When tested on a CPU of model `Intel(R) Xeon(R) Platinum 8480P`, the resource usage under typical test scales is approximately as follows: - -| Total Number of Requests (`RequestCount`) | Dataset Input Token Length (`Input`) | Output Token Length (`Output`) | Maximum Memory Usage (GB) | Maximum Disk Usage (GB) | Performance Data Calculation Time (s) | Remarks | -|-------------------------------------------|--------------------------------------|---------------------------------|---------------------------|--------------------------|----------------------------------------|---------| -| 10,000 | 1024 | 1024 | < 16 | 0.12 | 3 | | -| 10,000 | 1024 | 4096 | < 16 | 0.16 | 4 | | -| 10,000 | 4096 | 4096 | < 16 | 0.17 | 6 | | -| 50,000 | 4096 | 4096 | < 32 | 0.80 | 30 | | -| 250,000 | 4096 | 4096 | < 64 | 4.00 | 150 | Maximum specification | - -> ⚠️ The maximum memory usage, maximum disk usage, and calculation time of performance data are roughly proportional to the value of (`RequestCount × (Input + Output)`). The maximum specification supported by a single machine in AISBench is `RequestCount × (Input + Output) = 250,000 × (4096 + 4096) = 2,024,000,000`. \ No newline at end of file + +## Specifications + +The following specifications are required when using AISBench for performance evaluation: + +| Item | Specification | +| --- | --- | +| Service Status | The service must be running normally, and the listening port is consistent with the `url` field in the configuration | +| `max-model-len` | Must be greater than or equal to `prompt length + output length`, otherwise the service will reject the request | +| Network | The evaluation machine needs to be able to access the service address normally | +| Concurrency | The number of concurrent evaluations should not exceed the service's processing capacity to avoid request timeout/failure | +| Output Directory | Each task generates a timestamp directory containing `configs/`, `logs/`, `predictions/`, `results/`, `summary/`, `performances/` | \ No newline at end of file diff --git a/docs/source_en/best_practices/practice_ascend.md b/docs/source_en/best_practices/practice_ascend.md index a597841b..5ffc7c50 100644 --- a/docs/source_en/best_practices/practice_ascend.md +++ b/docs/source_en/best_practices/practice_ascend.md @@ -2,6 +2,8 @@ ### Version of AISBench Evaluation Tool Used for Reproduction The version of the AISBench evaluation tool used for reproduction in this paper is [v3.0-20250331](https://github.com/AISBench/benchmark/releases/tag/v3.0-20250331). +> 💡 All evaluation commands in this document can be implemented through the [custom configuration file approach](../advanced_tutorials/run_custom_config.md). Write configurations for models, datasets, summarizers, etc. into a single Python file for one-time writing and multiple reuse. The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. See [Running AISBench with Custom Configuration Files](../advanced_tutorials/run_custom_config.md) for details. + ### I. Background and Objectives #### 1.1 Significance of Reproduction ##### 1.1.1 Mathematical Reasoning Advantages of DeepSeek-R1 diff --git a/docs/source_en/best_practices/practice_nvidia.md b/docs/source_en/best_practices/practice_nvidia.md index 9b7ca0ae..044ba162 100644 --- a/docs/source_en/best_practices/practice_nvidia.md +++ b/docs/source_en/best_practices/practice_nvidia.md @@ -2,6 +2,7 @@ ### Version of AISBench Evaluation Tool Used for Reproduction The version of the AISBench evaluation tool used for reproduction in this paper is [v3.0-20250412](https://github.com/AISBench/benchmark/releases/tag/v3.0-20250412). +> 💡 All evaluation commands in this document can be implemented through the [custom configuration file approach](../advanced_tutorials/run_custom_config.md). Write configurations for models, datasets, summarizers, etc. into a single Python file for one-time writing and multiple reuse. The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. See [Running AISBench with Custom Configuration Files](../advanced_tutorials/run_custom_config.md) for details. ### I. Background and Objectives #### 1.1 Significance of Reproduction diff --git a/docs/source_en/best_practices/replicate_llm_datasets_accuracy.md b/docs/source_en/best_practices/replicate_llm_datasets_accuracy.md index e43afe64..16ca3421 100644 --- a/docs/source_en/best_practices/replicate_llm_datasets_accuracy.md +++ b/docs/source_en/best_practices/replicate_llm_datasets_accuracy.md @@ -1,47 +1,45 @@ # Reproducing Dataset Evaluation Results from Large Language Model (LLM) Papers (Technical Reports) — Taking the GPQA Dataset Used by DeepSeek R1 as an Example +> 💡 All evaluation commands in this document can be implemented through the [custom configuration file approach](../advanced_tutorials/run_custom_config.md). Write configurations for models, datasets, summarizers, etc. into a single Python file for one-time writing and multiple reuse. The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. See [Running AISBench with Custom Configuration Files](../advanced_tutorials/run_custom_config.md) for details. + ## Preface - Methodology -To reproduce the accuracy results reported in papers using the AISBench evaluation tool, it is essential to align with the testing methodology for the dataset as described in the model’s technical report or paper. The following configurations in the evaluation tool need to be aligned accordingly: +To reproduce the accuracy results reported in papers using the AISBench evaluation tool, it is essential to align with the testing methodology for the dataset as described in the model's technical report or paper. The following configurations in the evaluation tool need to be aligned accordingly: -### Model - Related Configurations +**Model - Related Configurations**: - Select the appropriate model task corresponding to the endpoint - Fully align the maximum output length -- Fully align the post - processing parameters +- Fully align the post-processing parameters -### Dataset - Related Configurations +**Dataset - Related Configurations**: - Fully align the prompt engineering - Fully align the answer extraction method - Align the accuracy evaluation metrics ---- - ## Example: Reproducing the Evaluation Results of the DeepSeek R1 Model on the GPQA Dataset ### Select the Appropriate Model Configuration File Corresponding to the Endpoint -For execution efficiency, inference services are generally used as the subjects under test when reproducing model accuracy. Inference services can be accessed via various endpoints, and the industry standard mainly adopts OpenAI - style endpoints. There are two primary OpenAI endpoints: `v1/completions` and `v1/chat/completions`. +For execution efficiency, inference services are generally used as the subjects under test when reproducing model accuracy. Inference services can be accessed via various endpoints, and the industry standard mainly adopts OpenAI-style endpoints. There are two primary OpenAI endpoints: `v1/completions` and `v1/chat/completions`. - **v1/completions**: The model generates text based on a "prefix continuation" logic and does not inherently distinguish between "instructions" and "content". Strong guidance through prompt engineering (e.g., adding "Please answer:") is required; otherwise, it may produce imitative outputs rather than executing instructions. For instance, inputting "Translate the following English to Chinese: Hello" might result in the continuation "Translate the following Chinese to English: Nihao" instead of a direct translation. - Therefore, it is suitable for single - turn text generation tasks (such as code completion, short - text writing, text continuation, and simple text classification) or scenarios that need to be compatible with legacy base models. + Therefore, it is suitable for single-turn text generation tasks (such as code completion, short-text writing, text continuation, and simple text classification) or scenarios that need to be compatible with legacy base models. - **v1/chat/completions**: The model natively understands the semantic roles of system/user/assistant, prioritizes executing user instructions, and ensures more stable dialogue consistency and intent alignment. It can complete tasks like translation and summarization without complex prompt wrapping. - Hence, it is ideal for modern LLM application scenarios such as multi - turn dialogues (customer service, chatbots), instruction - driven tasks (translation, summarization, data analysis), tool integration (function calling, retrieval - augmented generation), and multimodal interactions. + Hence, it is ideal for modern LLM application scenarios such as multi-turn dialogues (customer service, chatbots), instruction-driven tasks (translation, summarization, data analysis), tool integration (function calling, retrieval-augmented generation), and multimodal interactions. -💡 As of January 2025, nearly all newly released LLM models support the `v1/chat/completions` endpoint, and the `v1/completions` endpoint has been largely deprecated. Consequently, model configuration files typically only use the model tasks for accessing the `v1/chat/completions` endpoint: **vllm_api_general_chat** (accessing the service via a non - streaming interface) and **vllm_api_stream_chat** (accessing the service via a streaming interface). +💡 As of January 2025, nearly all newly released LLM models support the `v1/chat/completions` endpoint, and the `v1/completions` endpoint has been largely deprecated. Consequently, model configuration files typically only use the model tasks for accessing the `v1/chat/completions` endpoint: **vllm_api_general_chat** (accessing the service via a non-streaming interface) and **vllm_api_stream_chat** (accessing the service via a streaming interface). Taking the model task `vllm_api_general_chat` as an example, the absolute path to its corresponding model configuration file can be obtained by running the following command: ```bash ais_bench --models vllm_api_general_chat --search ``` -⚠️ All subsequent model - related configurations will be modified in this configuration file. - ---- +⚠️ All subsequent model-related configurations will be modified in this configuration file. ### Fully Align the Maximum Output Length -The following description can be found on the [DeepSeek R1 Hugging Face Model Card](https://huggingface.co/deepseek - ai/DeepSeek - R1): +The following description can be found on the [DeepSeek R1 Hugging Face Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1): > ## 4. Evaluation Results -> ### DeepSeek - R1 - Evaluation +> ### DeepSeek-R1-Evaluation > For all our models, the maximum generation length is set to 32,768 tokens.... This indicates that the maximum output length of the DeepSeek R1 model is set to 32,768 tokens. @@ -54,7 +52,7 @@ models = [ dict( attr="service", type=VLLMCustomAPIChat, - abbr='vllm - api - general - chat', + abbr='vllm-api-general-chat', # ...... max_out_len=32768, # Maximum number of tokens output by the inference service # ...... @@ -62,17 +60,15 @@ models = [ ] ``` ---- - -### Fully Align the Post - processing Parameters -The following description is available on the [DeepSeek R1 Hugging Face Model Card](https://huggingface.co/deepseek - ai/DeepSeek - R1): +### Fully Align the Post-processing Parameters +The following description is available on the [DeepSeek R1 Hugging Face Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1): > ## 4. Evaluation Results -> ### DeepSeek - R1 - Evaluation -> ..., For benchmarks requiring sampling, we use a temperature of $0.6$, a top - p value of $0.95$, ... +> ### DeepSeek-R1-Evaluation +> ..., For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, ... -It can be seen from this that the post - processing parameters of the DeepSeek R1 model include a temperature of 0.6 and a top - p value of 0.95. +It can be seen from this that the post-processing parameters of the DeepSeek R1 model include a temperature of 0.6 and a top-p value of 0.95. -Taking the model task `vllm_api_general_chat` as an example, the configuration for the post - processing parameters is as follows: +Taking the model task `vllm_api_general_chat` as an example, the configuration for the post-processing parameters is as follows: ```python from ais_bench.benchmark.models import VLLMCustomAPIChat @@ -80,77 +76,125 @@ models = [ dict( attr="service", type=VLLMCustomAPIChat, - abbr='vllm - api - general - chat', + abbr='vllm-api-general-chat', # ...... - temperature=0.6, # Sampling temperature for text generation - top_p=0.95, # Top - p sampling parameter + generation_kwargs=dict( # Post-processing parameters are filled in here + temperature=0.6, + top_p=0.95, + ), # ...... ) ] ``` ---- - ### Fully Align Prompt Engineering -In the [DeepSeek R1 technical report](https://github.com/deepseek - ai/DeepSeek - R1/blob/main/DeepSeek_R1.pdf), the prompt format for the GPQA dataset is specified as follows: -> For GPQA, we use the 0 - shot chain - of - thought (CoT) prompt from the original GPQA paper. The prompt template is as follows: -> Q: [question] -> A: Let's think step by step. +Prompt engineering has a significant impact on model accuracy. Generally, model papers or technical reports disclose the prompts used for testing. If third-party tools are used for testing, the specific open-source tool is also explicitly stated. + +In the DeepSeek R1 paper, the prompt engineering used for the GPQA dataset test is described as follows: -In the AISBench dataset configuration file, the prompt engineering can be aligned by modifying the reader configuration, as shown below: +> Evaluation Prompts Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP, GPQADiamond, and SimpleQA are evaluated using prompts from the simple-evals framework. + +This indicates that the prompt engineering for DeepSeek R1 uses the prompt template from the simple-evals tool. You can refer to the [simple-evals](https://github.com/openai/simple-evals) project. The relevant portion of the prompt used by GPQA is as follows (you need to look at the code): ```python -# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/gpqa/gpqa_gen_0_shot_cot_chat_prompt.py +QUERY_TEMPLATE_MULTICHOICE = """ +Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering. -gpqa_reader_cfg = dict( - # ...... - prompt_template='Q: {question}\nA: Let\'s think step by step.', - # ...... -) +{Question} + +A) {A} +B) {B} +C) {C} +D) {D} +""".strip() ``` ---- +Therefore, the prompt engineering in AISBench should be modified to: +```python +# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/gpqa/gpqa_gen_0_shot_cot_chat_prompt.py + +## Prompt template identical to simple-evals +align_prompt = """ +Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering. + +{question} + +A) {A} +B) {B} +C) {C} +D) {D} +""".strip() + +## ...... + +gpqa_infer_cfg = dict( + prompt_template=dict( # Prompt engineering + type=PromptTemplate, + template=dict( + round=[ + dict(role='HUMAN', prompt=align_prompt), # Pass in the prompt template + ], )), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer)) +``` +For a more detailed introduction to prompt engineering, please refer to [Prompt Template Introduction](../prompt/prompt_template.md). ### Fully Align the Answer Extraction Method -The answer format in the GPQA dataset is option - based (options A, B, C, D). In the DeepSeek R1 paper, the answer extraction method is to extract the final answer option (A/B/C/D) from the model - generated reasoning process. +How to extract answers from the model's inference results and evaluate them directly affects the evaluation scores. The answer extraction methods for LLM evaluation datasets generally fall into 3 categories: +1. For evaluation datasets of multiple-choice or Q&A types (such as ceval, gsm8k, etc.), the answer extraction method is generally based on fixed regular expressions. +2. For more complex datasets such as code-related ones or mathematical ones that need to include the problem-solving process (e.g., livecodebench, humaneval, math500, etc.), there is generally a unified evaluate library associated with the dataset that can be called. +3. For some divergent or subjective datasets, it may be necessary to introduce a judge model for evaluation. -Therefore, in AISBench, a custom post - processing function for answer extraction needs to be implemented in the dataset configuration file, as shown below: -```python -# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/gpqa/gpqa_gen_0_shot_cot_chat_prompt.py +In general, only when the evaluation involves the third category of datasets will the model's paper or technical report explicitly state it (for example, using the GPT-4-1106 API as a judge model). For the second category of datasets, the evaluate library associated with the dataset is generally used by default for answer extraction. -import re +For the first category of datasets, if the model paper or technical report indicates which tool the prompt engineering comes from, the answer extraction method mentioned in that tool can be directly used. Taking GPQA as an example, in simple-evals, the answer extraction method for GPQA is based on a fixed regular expression: +```python +# https://github.com/openai/simple-evals/blob/main/common.py +ANSWER_PATTERN_MULTICHOICE = r"(?i)Answer[ \t]*:[ \t]*\$?([A-D])\$?" -def gpqa_extract_answer(text): +# https://github.com/openai/simple-evals/blob/main/gpqa_eval.py +match = re.search(ANSWER_PATTERN_MULTICHOICE, response_text) +``` +Therefore, the answer extraction method in AISBench should be modified to be exactly the same regular expression as in simple-evals: +```python +# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/datasets/gpqa.py +@TEXT_POSTPROCESSORS.register_module() # Answer extraction function: extracts one option from A, B, C, D from the original model response string +def GPQA_Simple_Eval_postprocess(text: str) -> str: """ - Extract the final answer option (A/B/C/D) from the model - generated reasoning text + Extract one option from A, B, C, D from the original model response string as the answer. + + :param text: The original model response string. + :return: The extracted answer option (A, B, C, D). Returns None if no match is found. """ - ANSWER_PATTERN = r"Answer[ \t]*:[ \t]*\$?([A - D])\$?" + ANSWER_PATTERN = r"(?i)Answer[ \t]*:[ \t]*\$?([A-D])\$?" match = re.search(ANSWER_PATTERN, text) if match: return match.group(1) return None + +# https://github.com/AISBench/benchmark/blob/master/ais_bench/benchmark/configs/datasets/gpqa/gpqa_gen_0_shot_cot_chat_prompt.py + from ais_bench.benchmark.datasets import GPQADataset, GPQA_Simple_Eval_postprocess, GPQAEvaluator gpqa_eval_cfg = dict(evaluator=dict(type=GPQAEvaluator), - pred_postprocessor=dict(type=GPQA_Simple_Eval_postprocess, func=gpqa_extract_answer)) # Pass in the custom answer extraction function, which can also be directly defined in the dataset configuration file + pred_postprocessor=dict(type=GPQA_Simple_Eval_postprocess)) # Pass in the custom answer extraction function. The function itself can also be defined directly in the dataset configuration file ``` ---- - ### Align the Accuracy Evaluation Metrics Typically, model evaluation results are presented in a table. Take the results from DeepSeek as an example: -| Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH - 500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | +| Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating | | ----- | ---------------- | ----------------- | ----------------- | ------------------- | -------------------- | ----------------- | -| GPT - 4o - 0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | -| Claude - 3.5 - Sonnet - 1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | -| o1 - mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | +| GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 | +| Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 | +| o1-mini | 63.6 | 80.0 | 90.0 | 60.0 | 53.8 | 1820 | -Here, `cons@64` and `pass@1` represent accuracy evaluation metrics. For detailed explanations of these metrics, refer to [Accuracy Metric Description](../base_tutorials/results_intro/accuracy_metric.md#ii - definition - and - relationship - between - passk - consk - and - avgn). +Here, `cons@64` and `pass@1` represent accuracy evaluation metrics. For detailed explanations of these metrics, refer to [Accuracy Metric Description](../base_tutorials/results_intro/accuracy_metric.md#ii-definition-and-relationship-between-passk-consk-and-avgn). Taking GPQA as an example, the table shows that `pass@1` is used as the accuracy evaluation metric. The description of pass@1 in the DeepSeek R1 paper is as follows: -> ..., and report pass@1 using a non - zero temperature. Specifically, we use a sampling temperature of 0.6 and a top - 𝑝 value of 0.95 to generate 𝑘 responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as -> ${\text{pass@1}} = \frac{1}{n} \sum_{i = 1}^{n} p_i$ + +> ..., and report pass@1 using a non-zero temperature. Specifically, we use a sampling temperature of 0.6 and a top-𝑝 value of 0.95 to generate 𝑘 responses (typically between 4 and 64, depending on the test set size) for each question. Pass@1 is then calculated as +> ${\text{pass@1}} = \frac{1}{n} \sum_{i=1}^{n} p_i$ Then in AISBench, configure the model configuration file as follows: ```python @@ -182,13 +226,10 @@ After the precision evaluation phase, the results will be recorded in the logs a ``` Among them, `avg@4` has the same meaning as `pass@1` (average over 4 runs) in DeepSeek. - > ⚠️ While `n` only affects the fluctuation range of the evaluation results and not the mathematical expectation, a larger `n` means more repeated runs for each test case, leading to higher resource consumption. When reproducing accuracy, adjustments should be made based on the actual resource availability. > 💡 If a paper does not specify the accuracy evaluation metric for a dataset, `pass@1` is generally used by default. Thus, omitting the configuration of `n` and `k` in the AISBench dataset configuration file defaults to `pass@1`. ---- - ## References -- DeepSeek R1 Hugging Face Model Card: https://huggingface.co/deepseek - ai/DeepSeek - R1 -- DeepSeek R1 Paper: https://github.com/deepseek - ai/DeepSeek - R1/blob/main/DeepSeek_R1.pdf \ No newline at end of file +- DeepSeek R1 Hugging Face Model Card: https://huggingface.co/deepseek-ai/DeepSeek-R1 +- DeepSeek R1 Paper: https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf \ No newline at end of file diff --git a/docs/source_en/conf.py b/docs/source_en/conf.py index d90e03bb..6b9d5f58 100644 --- a/docs/source_en/conf.py +++ b/docs/source_en/conf.py @@ -35,6 +35,7 @@ 'sphinx.ext.imgconverter', # 支持图片格式转换 'sphinx.ext.mathjax', # 支持数学公式 'sphinx.ext.viewcode', # 查看代码源文件 + 'sphinx_design', # 支持 tab-set、card 等 UI 组件 ] # 4. 若使用 Markdown,需指定源文件后缀 @@ -58,6 +59,7 @@ 'dollarmath', # 支持 $ 分隔的数学公式 'html_admonition', # 支持 HTML 警告框 'replacements', # 支持文本替换 + 'colon_fence', # 支持 ::: 栅栏指令(用于 tab-set 等 sphinx_design 组件) ] # (可选)配置 Mermaid 输出格式 diff --git a/docs/source_en/extended_benchmark/agent/harbor_bench.md b/docs/source_en/extended_benchmark/agent/harbor_bench.md index 306325cb..b85a447e 100644 --- a/docs/source_en/extended_benchmark/agent/harbor_bench.md +++ b/docs/source_en/extended_benchmark/agent/harbor_bench.md @@ -77,6 +77,8 @@ Terminal-Bench-2 pre-packaged images: Modify `ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py` under AISBench tool root directory: +> 💡 The above `harbor_terminal_bench_2_task.py` is a concrete application of the [custom configuration file approach](../../advanced_tutorials/run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can refer to this example file to write a configuration file that meets specific needs. See [Running AISBench with Custom Configuration Files](../../advanced_tutorials/run_custom_config.md) for details. + ```python models = [ dict( @@ -97,22 +99,24 @@ models = [ ] # ...... datasets = [] -datasets.append( - dict( - abbr=f'harbor_terminal-bench-2', - args=dict( - n_attempts=1, # -k/--n-attempts: Number of attempts per trial - timeout_multiplier=1.0, # --timeout-multiplier: Timeout multiplier - # ...... - n_concurrent_trials=5, # -n/--n-concurrent: Number of concurrent trials - # ...... - path="/path/to/terminal-bench-2/", # -p/--path: Local dataset path - # ...... - n_tasks=None, # --n-tasks: Maximum number of tasks, None runs all, try setting a few for quick testing - # ...... - ), +for task in sub_tasks: + datasets.append( + dict( + abbr=f'harbor_{task}', + args=dict( + n_attempts=1, # -k/--n-attempts: Number of attempts per trial + timeout_multiplier=1.0, # --timeout-multiplier: Timeout multiplier (all timeouts multiplied by this coefficient) + # ...... + n_concurrent_trials=5, # -n/--n-concurrent: Number of concurrent trials + # ...... + path="/path/to/terminal-bench-2/", # -p/--path: Local dataset path + # ...... + n_tasks=None, # --n-tasks: Maximum number of tasks, None defaults to running all, set a few for quick testing + # ...... + ), + ) ) -) + # ...... ``` diff --git a/docs/source_en/extended_benchmark/agent/swe_bench.md b/docs/source_en/extended_benchmark/agent/swe_bench.md index 59e4865e..ec1c7d49 100644 --- a/docs/source_en/extended_benchmark/agent/swe_bench.md +++ b/docs/source_en/extended_benchmark/agent/swe_bench.md @@ -21,6 +21,8 @@ Directory `ais_bench/configs/swe_bench_examples/` provides the following example - `mini_swe_agent_swe_bench_multilingual.py`: SWE-bench Multilingual (`SWE-bench/SWE-bench_Multilingual`) — multilingual issue statements. - `mini_swe_agent_swe_bench_multilingual_mini.py`: SWE-bench Multilingual Mini (**15**/**30**/**60** instances) — an AISBench-constructed Multilingual subset designed to significantly reduce evaluation cost; see the dataset card and construction repository: `https://modelers.cn/datasets/AISBench/SWE-Bench_Multilingual_mini` and `https://github.com/AISBench/datasets/tree/main/mini_datasets/swe_bench_multiligual_mini`. +> 💡 The example configuration files mentioned above are concrete applications of the [custom configuration file approach](../../advanced_tutorials/run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can refer to these example files to write a configuration file that meets specific needs. See [Running AISBench with Custom Configuration Files](../../advanced_tutorials/run_custom_config.md) for details. + ## 2. Prerequisites Before running, make sure the following dependencies are available: diff --git a/docs/source_en/extended_benchmark/agent/swe_bench_pro.md b/docs/source_en/extended_benchmark/agent/swe_bench_pro.md index 00501e8d..20303ab4 100644 --- a/docs/source_en/extended_benchmark/agent/swe_bench_pro.md +++ b/docs/source_en/extended_benchmark/agent/swe_bench_pro.md @@ -19,6 +19,8 @@ Directory `ais_bench/configs/swe_bench_pro_examples/` provides the following exa - `mini_swe_agent_swe_bench_pro_mini.py`: SWE-bench Pro Mini — commonly used for quick iterations. - `mini_swe_agent_swe_bench_pro_full.py`: SWE-bench Pro Full — the full test set. +> 💡 The example configuration files mentioned above are concrete applications of the [custom configuration file approach](../../advanced_tutorials/run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can refer to these example files to write a configuration file that meets specific needs. See [Running AISBench with Custom Configuration Files](../../advanced_tutorials/run_custom_config.md) for details. + ## 2. Prerequisites Before running, make sure the following dependencies are available: diff --git a/docs/source_en/extended_benchmark/agent/tau2_bench.md b/docs/source_en/extended_benchmark/agent/tau2_bench.md index d98c00c7..74a36126 100644 --- a/docs/source_en/extended_benchmark/agent/tau2_bench.md +++ b/docs/source_en/extended_benchmark/agent/tau2_bench.md @@ -64,6 +64,8 @@ Ensure local or cloud deployment of tested inference services following OpenAI c ### 3. Configure Custom Configuration File for τ²-Bench Tasks 1. Modify necessary configurations in `ais_bench/configs/agent_example/tau2_bench_task.py` under AISBench tool root directory (mainly configuring information about tested inference services and user-simulating inference services) + +> 💡 The above `tau2_bench_task.py` is a concrete application of the [custom configuration file approach](../../advanced_tutorials/run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can refer to this example file to write a configuration file that meets specific needs. See [Running AISBench with Custom Configuration Files](../../advanced_tutorials/run_custom_config.md) for details. ```python # ...... models = [ @@ -226,11 +228,11 @@ for task in sub_tasks: +-----------------------------------+-----------+------------------------------------------------------------+-------------+----------+-------------------------------------------------+---------------------+ | Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters | +===================================+===========+============================================================+=============+==========+=================================================+=====================+ -| openai-v1-chat/tau2_bench_airline | 1856223 | [###### ] 30/150 Running TAU2 Bench | 0:07:13 | running | logs/eval/openai-v1-chat/tau2_bench_airline.out | None | +| openai-v1-chat/tau2_bench_airline | 1856223 | [###### ] 30/250 Running TAU2 Bench | 0:07:13 | running | logs/eval/openai-v1-chat/tau2_bench_airline.out | None | +-----------------------------------+-----------+------------------------------------------------------------+-------------+----------+-------------------------------------------------+---------------------+ -| openai-v1-chat/tau2_bench_retail | 1856224 | [###### ] 75/342 Running TAU2 Bench | 0:11:56 | running | logs/eval/openai-v1-chat/tau2_bench_retail.out | None | +| openai-v1-chat/tau2_bench_retail | 1856224 | [###### ] 75/568 Running TAU2 Bench | 0:11:56 | running | logs/eval/openai-v1-chat/tau2_bench_retail.out | None | +-----------------------------------+-----------+------------------------------------------------------------+-------------+----------+-------------------------------------------------+---------------------+ -| openai-v1-chat/tau2_bench_telecom | 1856222 | [###### ] 76/342 Running TAU2 Bench | 1:09:51 | running | logs/eval/openai-v1-chat/tau2_bench_telecom.out | None | +| openai-v1-chat/tau2_bench_telecom | 1856222 | [###### ] 76/568 Running TAU2 Bench | 1:09:51 | running | logs/eval/openai-v1-chat/tau2_bench_telecom.out | None | +-----------------------------------+-----------+------------------------------------------------------------+-------------+----------+-------------------------------------------------+---------------------+ ``` diff --git a/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md b/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md index 63ab0b58..73a495a3 100644 --- a/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md +++ b/docs/source_en/extended_benchmark/lmm_generate/gedit_bench.md @@ -116,6 +116,8 @@ Place the dataset in the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/datasets` dir In the container, navigate to the `${PATH_TO_WORKSPACE}/benchmark/ais_bench/configs/lmm_example` directory, open the `multi_device_run_qwen_image_edit.py` file, and edit the following content to set the model configuration: +> 💡 The above `multi_device_run_qwen_image_edit.py` is a concrete application of the [custom configuration file approach](../../advanced_tutorials/run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can refer to this example file to write a configuration file that meets specific needs. See [Running AISBench with Custom Configuration Files](../../advanced_tutorials/run_custom_config.md) for details. + ```python # ...... # ====== User configuration parameters ========= diff --git a/docs/source_en/extended_benchmark/lmm_generate/vbench.md b/docs/source_en/extended_benchmark/lmm_generate/vbench.md index 9fd8dd81..f2032f04 100644 --- a/docs/source_en/extended_benchmark/lmm_generate/vbench.md +++ b/docs/source_en/extended_benchmark/lmm_generate/vbench.md @@ -4,6 +4,8 @@ AISBench has **adapted to VBench 1.0**. The repository directory `ais_bench/configs/vbench_examples/` contains **standalone configuration file** examples for running quality/semantic dimension evaluation on generated videos on **GPU** or **NPU**. **AISBench currently does not include multimodal video generation**, so please generate videos first and then run the evaluation. (For Standard mode, see the [Dataset Generation](#dataset-generation) section.) +> 💡 The example configuration files under `vbench_examples/` mentioned above are concrete applications of the [custom configuration file approach](../../advanced_tutorials/run_custom_config.md). The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can refer to these example files to write a configuration file that meets specific needs. See [Running AISBench with Custom Configuration Files](../../advanced_tutorials/run_custom_config.md) for details. + ## Table of Contents - [Dependencies and Environment](#dependencies-and-environment) diff --git a/docs/source_en/get_started/quick_start.md b/docs/source_en/get_started/quick_start.md index 22af4efc..dc4a7b24 100644 --- a/docs/source_en/get_started/quick_start.md +++ b/docs/source_en/get_started/quick_start.md @@ -1,38 +1,116 @@ # Quick Start -## Command Meaning -A single or multiple evaluation tasks executed by the AISBench command are defined by a combination of model tasks (single or multiple), dataset tasks (single or multiple), and result presentation tasks (single). Other command-line options of AISBench specify the scenario of the evaluation task (e.g., accuracy evaluation scenario, performance evaluation scenario). Take the following AISBench command as an example: + +## Preparations Before Running the Command + +- An inference service that supports the `v1/chat/completions` sub-service is required. You can refer to 🔗 [Launching an OpenAI-Compatible Server with VLLM](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server) to start the inference service. +- The gsm8k dataset is required, which can be downloaded from 🔗 [the gsm8k dataset zip package provided by opencompass](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip). Deploy the unzipped `gsm8k/` folder to the `ais_bench/datasets` folder in the root directory of the AISBench evaluation tool. + +## Start Evaluation (Choose One of Two Methods) + +| ⭐ Recommended: Using a Custom Configuration File | Alternative: Using Command-Line Arguments (Original Quick Start Method) | +| :--- | :--- | +| Modify a single file to centrally manage all configurations, with configuration written at any path | Specify via `--models` `--datasets` parameters | +| Write once, reuse multiple times | Each run requires inputting the full command | +| Supports all Python syntax for flexible extension | Only supports Cartesian product combinations | + +::::{tab-set} +:::{tab-item} ⭐ Recommended: Using a Custom Configuration File + +AISBench provides a pre-built custom configuration file [model_api_test_en.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/model_api_test_en.py), which centralizes common service-oriented inference test configurations (model selection, service address, port, generation parameters, etc.) in a single file, eliminating the need to find and modify multiple configuration files separately. This file is essentially a Python script that supports all Python syntax, allowing you to freely extend it. + +Open `ais_bench/configs/model_api_test_en.py` and modify the following configurations according to the actual situation (If you installed the tool via `pip3 install ais_bench_benchmark`, you can create `model_api_test_en.py` at any path and write the following configuration content into that file): + +```python +from mmengine.config import read_base + +with read_base(): +# Model tasks, choose one of them. For other model tasks, refer to: https://ais-bench-benchmark-rf.readthedocs.io/en/latest/base_tutorials/all_params/models.html to obtain more model tasks + # vllm_api_general is a base model that only supports text generation + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + # vllm_api_general_chat is a chat model that supports dialogue + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + # vllm_api_stream_chat is a streaming chat model that supports streaming dialogue + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + # vllm_api_general_stream is a streaming model that supports streaming generation + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + +# Dataset tasks, refer to: https://ais-bench-benchmark-rf.readthedocs.io/en/latest/get_started/datasets.html to obtain more dataset tasks + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = vllm_api_general_chat + +models[0]["path"] = "" # Specify the absolute path to the model serialized vocabulary file (generally not required for accuracy testing scenarios) +models[0]["model"] = "" # Specify the name of the model loaded on the server, configured according to the actual model name pulled by the VLLM inference service (configure as an empty string to automatically retrieve it) +models[0]["request_rate"] = 0 # Request sending frequency: send 1 request to the server every 1/request_rate seconds; if less than 0.001, all requests are sent at once +models[0]["api_key"] = "" # Custom API key, default is an empty string +models[0]["host_ip"] = "localhost" # Specify the IP of the inference service +models[0]["host_port"] = 8080 # Specify the port of the inference service +models[0]["url"] = "" # Custom URL path for accessing the inference service (needs to be configured when the base URL is not a combination of http://host_ip:host_port; after configuration, host_ip and host_port will be ignored) +models[0]["max_out_len"] = 512 # Maximum number of tokens output by the inference service +models[0]["batch_size"] = 1 # Maximum concurrency for sending requests +models[0]["trust_remote_code"] = False # Whether the tokenizer trusts remote code, default is False +models[0]["generation_kwargs"] = dict( # Model inference parameters, configured with reference to the VLLM documentation; the AISBench evaluation tool does not process these parameters and attaches them directly to the sent requests + temperature=0.01, + ignore_eos=False, +) + +# datasets[0]["path"] = ais_bench/datasets/gsm8k # Specify the absolute path of the dataset directory (required for accuracy testing scenarios) + +work_dir = 'outputs/default/' # Specify the working directory for saving task results and logs (default is outputs/default/) + +``` + +> 💡 The configuration file already pre-imports commonly used model types (`vllm_api_general`, `vllm_api_general_chat`, `vllm_api_stream_chat`, `vllm_api_general_stream`), just uncomment/modify the relevant lines to switch. For more usages of custom configuration files, please refer to 📚 [Running AISBench with a Custom Configuration File](../advanced_tutorials/run_custom_config.md). + +The selection, preparation, and usage of dataset tasks are described in the following steps: + +1. Select a dataset task from 📚 [Open Source Datasets](https://ais-bench-benchmark.readthedocs.io/en/latest/get_started/datasets.html#open-source-datasets). +2. Go to the 📚 [Detailed Introduction / Dataset Deployment](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README_en.md#dataset-deployment) for the dataset to prepare the dataset. +3. Refer to 📚 [Detailed Introduction / Available Dataset Tasks](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README_en.md#available-dataset-tasks) to select an available dataset task, and copy the corresponding task import method (e.g., `from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets`) into the custom configuration file. + +After modifying the configuration file, run the following command to start the service-oriented accuracy evaluation: + +```bash +ais_bench ais_bench/configs/model_api_test_en.py +``` + +::: +:::{tab-item} Alternative: Using Command-Line Arguments + +If you prefer the command-line argument approach, AISBench also supports specifying tasks directly via the `--models`, `--datasets`, `--summarizer` parameters. The following is the command-line approach that has **exactly the same execution effect** as the above custom configuration file approach. + +A single or multiple evaluation tasks executed by the AISBench command are defined by a combination of model tasks (single or multiple), dataset tasks (single or multiple), and result presentation tasks (single). Take the following AISBench command as an example: + ```shell ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example ``` + This command does not specify other command-line options, so it defaults to an accuracy evaluation scenario task, where: - `--models` specifies the model task, i.e., the `vllm_api_general_chat` model task. - - `--datasets` specifies the dataset task, i.e., the `demo_gsm8k_gen_4_shot_cot_chat_prompt` dataset task. +- `--summarizer` specifies the result presentation task, i.e., the `example` result presentation task (if `--summarizer` is not specified, the `example` task is used by default in the accuracy evaluation scenario). It is generally recommended to use the default, so there is no need to specify it in the command line. -- `--summarizer` specifies the result presentation task, i.e., the `example` result presentation task (if `--summarizer` is not specified, the `example` task is used by default in the accuracy evaluation scenario). It is generally recommended to use the default, so there is no need to specify it in the command line, and subsequent commands will omit it. - -## Task Meaning Query (Optional) -The specific information (introduction, usage constraints, etc.) of the selected model task `vllm_api_general_chat`, dataset task `demo_gsm8k_gen_4_shot_cot_chat_prompt`, and result presentation task `example` can be queried from the following links respectively: -- `--models`: 📚 [Service-Oriented Inference Backend](../base_tutorials/all_params/models.md#service-oriented-inference-backend) +For multi-task evaluation, please refer to: 📚 [Multi-Task Evaluation](../base_tutorials/scenes_intro/accuracy_benchmark.md#multi-task-evaluation) for accuracy scenarios and 📚 [Multi-Task Evaluation](../base_tutorials/scenes_intro/performance_benchmark.md#multi-task-evaluation) for performance scenarios. -- `--datasets`: 📚 [Open Source Datasets](../get_started/datasets.md#open-source-datasets) → 📚 [Detailed Introduction](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README_en.md) +For more flexible evaluation methods with self-combined tasks, you can refer to: 📚 [Running AISBench with a Custom Configuration File](../advanced_tutorials/run_custom_config.md#running-aisbench-with-a-custom-configuration-file). -- `--summarizer`: 📚 [Result Summary Tasks](../base_tutorials/all_params/summarizer.md) +The specific information (introduction, usage constraints, etc.) of the selected model task `vllm_api_general_chat`, dataset task `demo_gsm8k_gen_4_shot_cot_chat_prompt`, and result presentation task `example` can be queried from the following links respectively: -## Preparations Before Running the Command -- `--models`: To use the `vllm_api_general_chat` model task, you need to prepare an inference service that supports the `v1/chat/completions` sub-service. You can refer to 🔗 [Launching an OpenAI-Compatible Server with VLLM](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server) to start the inference service. -- `--datasets`: To use the `demo_gsm8k_gen_4_shot_cot_chat_prompt` dataset task, you need to prepare the gsm8k dataset, which can be downloaded from 🔗 [the gsm8k dataset zip package provided by opencompass](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip). Deploy the unzipped `gsm8k/` folder to the `ais_bench/datasets` folder in the root directory of the AISBench evaluation tool. +- `--models`: 📚 [Service-Oriented Inference Backend](https://ais-bench-benchmark.readthedocs.io/en/latest/base_tutorials/all_params/models.html#service-oriented-inference-backend) +- `--datasets`: 📚 [Open Source Datasets](https://ais-bench-benchmark.readthedocs.io/en/latest/get_started/datasets.html#open-source-datasets) → 📚 [Detailed Introduction](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README_en.md) +- `--summarizer`: 📚 [Result Summary Tasks](https://ais-bench-benchmark.readthedocs.io/en/latest/base_tutorials/all_params/summarizer.html) -## Modification of Configuration Files Corresponding to Tasks Each model task, dataset task, and result presentation task corresponds to a configuration file. You need to modify the content of these configuration files before running the command. The paths of these configuration files can be queried by adding `--search` to the original AISBench command. For example: + ```shell ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search ``` + > ⚠️ **Note**: Executing the command with the `search` option will print the absolute paths of the configuration files corresponding to the tasks. Executing the query command will yield the following results: + ```shell -06/28 11:52:25 - AISBench - INFO - Searching configs... ╒══════════════╤═══════════════════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ │ Task Type │ Task Name │ Config File Path │ ╞══════════════╪═══════════════════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡ @@ -43,9 +121,10 @@ Executing the query command will yield the following results: ``` -- The dataset task configuration file `demo_gsm8k_gen_4_shot_cot_chat_prompt.py` in the quick start does not require additional modifications. For an introduction to the content of the dataset task configuration file, please refer to 📚 [Configuring Open Source Datasets](../get_started/datasets.md#configuring-open-source-datasets). +- The dataset task configuration file `demo_gsm8k_gen_4_shot_cot_chat_prompt.py` in the quick start does not require additional modifications. For an introduction to the content of the dataset task configuration file, please refer to 📚 [Configuring Open Source Datasets](https://ais-bench-benchmark.readthedocs.io/en/latest/base_tutorials/all_params/datasets.html#configuring-open-source-datasets). The model configuration file `vllm_api_general_chat.py` contains configuration content related to model operation and needs to be modified according to the actual situation. The content that needs to be modified in the quick start is marked with comments. + ```python from ais_bench.benchmark.models import VLLMCustomAPIChat @@ -55,7 +134,7 @@ models = [ type=VLLMCustomAPIChat, abbr='vllm-api-general-chat', path="", # Specify the absolute path of the model serialized vocabulary file (configuration is generally not required for accuracy testing scenarios). - model="DeepSeek-R1", # Specify the name of the model loaded on the server, configured according to the actual model name pulled by the VLLM inference service (configure as an empty string to get it automatically) + model="", # Specify the name of the model loaded on the server, configured according to the actual model name pulled by the VLLM inference service (configure as an empty string to get it automatically) stream=False, request_rate=0, # Request sending frequency: send 1 request to the server every 1/request_rate seconds; if less than 0.1, all requests are sent at once use_timestamp=False, # Whether to schedule requests by dataset timestamp; used with timestamped datasets (e.g. Mooncake Trace) @@ -63,7 +142,7 @@ models = [ api_key="", # Custom API key, default is an empty string host_ip="localhost", # Specify the IP of the inference service host_port=8080, # Specify the port of the inference service - url="", # Custom access path for the inference service (required when the base URL is not http://host_ip:host_port, and will ignore host_ip and host_port) + url="", # Custom access path for the inference service (required when the base URL is not http://host_ip:host_port; after configuration, host_ip and host_port will be ignored) max_out_len=512, # Maximum number of tokens output by the inference service batch_size=1, # Maximum concurrency for sending requests trust_remote_code=False, # Whether to trust remote code in the tokenizer, default False; @@ -74,37 +153,46 @@ models = [ ) ] ``` -## Execute Command + After modifying the configuration file, run the following command to start the service-oriented accuracy evaluation: + ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt ``` +::: +:::: + ## View Task Execution Details -After executing the AISBench command, the status of the running task will be displayed on a real-time refreshing dashboard in the command line (press the "P" key to pause refreshing for copying dashboard information, and press "P" again to resume refreshing). Example: + +After executing the AISBench command, the task management dashboard will refresh in real time in the command line to show the task execution status (press the "P" key to pause/resume refreshing for copying dashboard information, and press "P" again to continue refreshing). The task management dashboard supports monitoring the detailed execution status of multiple tasks simultaneously, including task name, progress, time cost, status, log path, extended parameters, and other information. For example: + ``` Base path of result&log : outputs/default/20250628_151326 Task Progress Table (Updated at: 2025-11-06 10:08:21) Page: 1/1 Total 2 rows of data -Press Up/Down arrow to page, 'P' to PAUSE/RESUME screen refresh, 'Ctrl + C' to exit +Press Up/Down arrow to page, 'P' to PAUZE/RESUME screen refresh, 'Ctrl + C' to exit +----------------------------------+-----------+-------------------------------------------------+-------------+-------------+-------------------------------------------------+------------------------------------------------+ | Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters | -+==================================+===========+=================================================+=============+=============+=================================================+================================================+ ++==================================+===========+=================================================+=============+=============+================================================+================================================+ | vllm-api-general-chat/demo_gsm8k | 547141 | [############### ] 4/8 [0.5 it/s] | 0:00:11 | inferencing | logs/infer/vllm-api-general-chat/demo_gsm8k.out | {'POST': 5, 'RECV': 4, 'FINISH': 4, 'FAIL': 0} | +----------------------------------+-----------+-------------------------------------------------+-------------+-------------+-------------------------------------------------+------------------------------------------------+ + ``` Detailed logs of task execution are continuously written to the default output path, which is displayed on the real-time refreshing dashboard as `Log Path`. The `Log Path` (`logs/infer/vllm-api-general-chat/demo_gsm8k.out`) is located under the `Base path` (`outputs/default/20250628_151326`). Using the dashboard information above as an example, the path to the detailed task execution log is: + ```shell # {Base path}/{Log Path} outputs/default/20250628_151326/logs/infer/vllm-api-general-chat/demo_gsm8k.out ``` > 💡 To print detailed logs directly during execution, add the `--debug` parameter to the command: -`ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug` +> `ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug` The `Base path` (`outputs/default/20250628_151326`) contains all task execution details. After the command completes, the full execution details are structured as follows: + ```shell 20250628_151326/ ├── configs # Combined configuration file for model tasks, dataset tasks, and structure presentation tasks @@ -131,7 +219,9 @@ The `Base path` (`outputs/default/20250628_151326`) contains all task execution > ⚠️ **Note**: The content of task execution details written to disk varies across different evaluation scenarios. Please refer to the guide for the specific evaluation scenario. ### Output Results + Since there are only 8 data samples, the results will be generated quickly. Example output: + ```bash dataset version metric mode vllm_api_general_chat ----------------------- -------- -------- ----- ---------------------- diff --git a/docs/source_en/index.rst b/docs/source_en/index.rst index b49ef71d..06710951 100644 --- a/docs/source_en/index.rst +++ b/docs/source_en/index.rst @@ -21,7 +21,7 @@ To help you quickly get started with AISBench Benchmark Tool, we recommend learn * The :doc:`Quick Start ` provided in this tutorial will guide you through basic accuracy evaluation configuration and execution. * The :doc:`Dataset Preparation Guide ` will help you understand the supported datasets and how to prepare them for evaluation. * The Basic Tutorial section will introduce :doc:`Evaluation Scenario Introduction `, :doc:`Evaluation Result Explanation `, and :doc:`Detailed Parameter Description ` to help you better understand the use of major evaluation scenarios. -* For a deeper understanding of advanced usage of AISBench Benchmark Tool, you can refer to the :doc:`Advanced Tutorial `. +* For a deeper understanding of advanced usage of AISBench Benchmark Tool, you can refer to the :doc:`Advanced Tutorial `. **Strongly recommended** to read :doc:`Running AISBench with a Custom Configuration File `. The configuration file is essentially a Python script that supports all Python syntax including loops, conditional statements, list comprehensions, etc. You can write configurations for models, datasets, summarizers, etc. into a single file, write once and reuse multiple times, covering nearly all evaluation scenarios. * You can refer to the :doc:`Best Practices ` section to learn best practices for using AISBench Benchmark Tool in different scenarios. * Finally, you can refer to the :doc:`Frequently Asked Questions ` section to solve problems encountered during the use of AISBench Benchmark Tool. @@ -62,6 +62,7 @@ To help you quickly get started with AISBench Benchmark Tool, we recommend learn :hidden: extended_benchmark/lmm_generate/index + extended_benchmark/agent/index .. toctree:: :maxdepth: 2 diff --git a/docs/source_zh_cn/advanced_tutorials/custom_dataset.md b/docs/source_zh_cn/advanced_tutorials/custom_dataset.md index 296737b6..ba5ddd46 100644 --- a/docs/source_zh_cn/advanced_tutorials/custom_dataset.md +++ b/docs/source_zh_cn/advanced_tutorials/custom_dataset.md @@ -122,6 +122,8 @@ datasets = [ ] ``` +> 💡 上述配置文件方式本质上就是 [自定义配置文件方式](run_custom_config.md) 的简化应用。更复杂的场景(如多模型多数据集组合、自定义模型参数、裁判模型等)请参考 [自定义配置文件运行AISBench](run_custom_config.md#各场景自定义配置文件示例) 中"自定义数据集测评"示例。 + ### 数据集补充信息`.meta.json`使用指南 目前仅支持性能测评场景。ais_bench 会默认尝试对输入的数据集文件进行解析,因此在绝大多数情况下,`.meta.json` 文件都是 **不需要** 的。但是,如果原生数据集中没有指定max_tokens,或者需要通过配置进行数据采样等,则需要在 `.meta.json` 文件中进行指定。 diff --git a/docs/source_zh_cn/advanced_tutorials/judge_model_evaluate.md b/docs/source_zh_cn/advanced_tutorials/judge_model_evaluate.md index 0dd6c731..fc62cade 100644 --- a/docs/source_zh_cn/advanced_tutorials/judge_model_evaluate.md +++ b/docs/source_zh_cn/advanced_tutorials/judge_model_evaluate.md @@ -191,6 +191,11 @@ outputs/default/20260305_153318/logs/eval/vllm-api-general-chat/aime2025-judge.o ## 其他精度评测功能场景 从裁判模型的快速上手章节可以看到,除了需要额外修改数据配置文件中裁判模型的配置,其他测评执行方式是与常规测评执行方式是完全一致的,因此其他精度评测功能场景的执行方式也是完全一致的。 + +## 通过自定义配置文件实现 + +> 💡 上述裁判模型测评场景也可以通过 [自定义配置文件方式](run_custom_config.md) 实现。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将被测模型、裁判模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](run_custom_config.md#各场景自定义配置文件示例) 中"裁判模型测评"示例。 + ### 多任务测评 参考[精度评测场景多任务测评](../base_tutorials/scenes_intro/accuracy_benchmark.md#多任务测评) ### 多任务并行测评 diff --git a/docs/source_zh_cn/advanced_tutorials/multimodal_benchmark.md b/docs/source_zh_cn/advanced_tutorials/multimodal_benchmark.md index 9d2f191b..891b458c 100644 --- a/docs/source_zh_cn/advanced_tutorials/multimodal_benchmark.md +++ b/docs/source_zh_cn/advanced_tutorials/multimodal_benchmark.md @@ -28,6 +28,8 @@ ## 快速入门 + +> 💡 多模态测评场景也可以通过 [自定义配置文件方式](run_custom_config.md) 实现。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将多模态模型、多模态数据集、summarizer 等配置写入一个文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](run_custom_config.md)。 ### 多模态输入格式 服务化的多模态数据输入有多种格式,以图片+文本输入举例如下: - 方式1:本地文件格式,默认方法 diff --git a/docs/source_zh_cn/advanced_tutorials/multiturn_benchmark.md b/docs/source_zh_cn/advanced_tutorials/multiturn_benchmark.md index a98b8efc..fb917eda 100644 --- a/docs/source_zh_cn/advanced_tutorials/multiturn_benchmark.md +++ b/docs/source_zh_cn/advanced_tutorials/multiturn_benchmark.md @@ -151,6 +151,12 @@ ais_bench --models vllm_api_stream_chat --datasets sharegpt_gen -m perf --debug ### 性能细节查看 执行AISBench命令后,任务执行更多细节最终会落盘在默认的输出路径,这个输出路径在运行中的打屏日志中有提示,例如: + +## 通过自定义配置文件实现 + +> 💡 上述多轮对话性能测评场景也可以通过 [自定义配置文件方式](run_custom_config.md) 实现。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](run_custom_config.md#各场景自定义配置文件示例) 中"多轮对话性能测评"示例。 + +### 性能细节查看 ```shell 06/28 15:13:26 - AISBench - INFO - Current exp folder: outputs/default/20250628_151326 ``` diff --git a/docs/source_zh_cn/advanced_tutorials/rps_distribution.md b/docs/source_zh_cn/advanced_tutorials/rps_distribution.md index c286b92e..6fa2913c 100644 --- a/docs/source_zh_cn/advanced_tutorials/rps_distribution.md +++ b/docs/source_zh_cn/advanced_tutorials/rps_distribution.md @@ -329,6 +329,10 @@ $\lambda_i = \lambda_{\text{start}} \times \left(\frac{\lambda_{end}}{\lambda_{s 4. **压测场景下,控制连接数创建的频率,不控制请求发送速率(每个连接创建后,会不间断的执行请求发送和处理返回)** 5. **多轮对话场景下,仅第一轮的请求分布有效** +## 通过自定义配置文件实现 + +> 💡 上述 RPS 分布控制参数(`traffic_cfg`)在 [自定义配置文件方式](run_custom_config.md) 中同样适用。只需在模型配置的 dict 中添加 `traffic_cfg` 字段即可。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](run_custom_config.md)。 + --- ## 配置与可视化示例 diff --git a/docs/source_zh_cn/advanced_tutorials/run_custom_config.md b/docs/source_zh_cn/advanced_tutorials/run_custom_config.md index 025832ef..817ee256 100644 --- a/docs/source_zh_cn/advanced_tutorials/run_custom_config.md +++ b/docs/source_zh_cn/advanced_tutorials/run_custom_config.md @@ -2,13 +2,678 @@ AISBench常规命令调用方式是通过`--models`指定模型任务,通过`--datasets`指定数据集任务,通过`--summarizer`指定结果呈现任务来绝对运行的测评任务,AISBench同样也支持指定自定义的配置文件将这三类任务对应的配置文件信息组合在一起,从而实现自定义的任务组合运行。 +## 为什么使用自定义配置文件 + +AISBench 提供了两种运行方式:**命令行参数方式(CLI)** 与 **自定义配置文件方式**。在实际使用中,推荐优先使用自定义配置文件方式,原因如下: + +| 对比维度 | CLI 方式 | 配置文件方式 | +| --- | --- | --- | +| **可复用性** | 每次运行需要重新输入完整命令 | 配置文件可保存、版本管理、反复使用 | +| **表达能力** | 只能通过参数指定模型/数据集名称 | 可以精确控制模型参数、数据集采样范围、推理配置等所有细节 | +| **组合灵活性** | 仅支持笛卡尔积组合 | 支持 `model_dataset_combinations` 自定义任意模型-数据集配对 | +| **参数覆盖** | 无法修改预设模型/数据集内部参数 | 可直接修改 `abbr`、`test_range`、`host_ip`、`host_port` 等任意字段 | +| **批量运行** | 需要多次执行命令 | 一个配置文件即可同时运行多模型、多数据集组合 | +| **团队协作** | 命令难以共享和追溯 | 配置文件即代码,可提交到代码仓库进行 review 和复用 | + +**总结**:CLI 方式适合快速验证,配置文件方式适合正式的、可复现的、复杂的测评场景。 + +## 配置文件即 Python 脚本 + +AISBench 的自定义配置文件本质上就是一个 Python 脚本。这意味着你可以在配置文件中使用所有 Python 语法特性来灵活构建测评任务。 + +### 使用 for 循环批量构建模型配置 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + +datasets = gsm8k_0_shot_cot_str + +models = [] +for port in [8080, 8081, 8082]: + models.append( + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr=f'vllm-api-chat-port-{port}', + path="", + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=port, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) + ) + +work_dir = 'outputs/multi_port_benchmark/' +``` + +### 使用列表推导式批量添加数据集 + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat +datasets = [ + dict(d, abbr=f'my_{d["abbr"]}', reader_cfg=dict(d.get('reader_cfg', {}), test_range='[0:100]')) + for d in datasets +] + +models = vllm_api_general_chat +work_dir = 'outputs/my_benchmark/' +``` + +### 条件配置:根据环境变量切换 + +```python +import os +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat, VLLMCustomAPI + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + +datasets = gsm8k_0_shot_cot_str + +use_stream = os.environ.get('USE_STREAM', 'false').lower() == 'true' +model_type = VLLMCustomAPIChat if use_stream else VLLMCustomAPI + +models = [ + dict( + attr="service", + type=model_type, + abbr='vllm-api-conditional', + path="", + model="", + stream=use_stream, + request_rate=0, + retry=2, + host_ip=os.environ.get('HOST_IP', 'localhost'), + host_port=int(os.environ.get('HOST_PORT', '8080')), + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) +] + +work_dir = 'outputs/conditional_benchmark/' +``` + +### 使用 `.copy()` 复用并修改模型配置 + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_0_shot_cot_str + +model_high_temp = vllm_api_general_chat.copy() +model_high_temp[0]['abbr'] = vllm_api_general_chat[0]['abbr'] + '-high-temp' +model_high_temp[0]['generation_kwargs']['temperature'] = 0.9 + +model_low_temp = vllm_api_general_chat.copy() +model_low_temp[0]['abbr'] = vllm_api_general_chat[0]['abbr'] + '-low-temp' +model_low_temp[0]['generation_kwargs']['temperature'] = 0.1 + +models = model_high_temp + model_low_temp +work_dir = 'outputs/temperature_comparison/' +``` + +## 配置文件完整变量参考 + +自定义配置文件中可以定义以下顶层变量。所有变量均为可选,但至少需要定义 `models` 和 `datasets` 才能运行推理任务。 + +| 变量名 | 类型 | 是否必需 | 说明 | +| --- | --- | --- | --- | +| `models` | `list[dict]` | 是(推理时) | 模型配置列表。每个元素是一个字典,至少包含 `type`(模型类)、`abbr`(唯一标识)字段。服务化模型还需 `attr="service"`、`host_ip`、`host_port` 等;本地模型还需 `path`、`tokenizer_path` 等 | +| `datasets` | `list[dict]` | 是(推理时) | 数据集配置列表。每个元素是一个字典,至少包含 `type`(数据集类)、`abbr`(唯一标识)、`reader_cfg`、`infer_cfg`、`eval_cfg` 字段 | +| `summarizer` | `dict` | 否 | 结果汇总器配置。通常从 `ais_bench.benchmark.configs.summarizers.example` 导入。包含 `attr` 和 `summary_groups` 字段 | +| `model_dataset_combinations` | `list[dict]` | 否 | 自定义模型-数据集配对列表。每个元素为 `dict(models=[...], datasets=[...])`。不指定时,默认对 `models` 和 `datasets` 做笛卡尔积组合 | +| `work_dir` | `str` | 否 | 工作目录,推理结果和日志将输出到此目录下。默认为 `outputs/default/` | +| `infer` | `dict` | 否 | 推理流程配置。包含 `partitioner`(分区器)、`runner`(运行器,内含 `max_num_workers` 和 `task`)。不指定时使用默认推理流程 | +| `eval` | `dict` | 否 | 评测流程配置。结构同 `infer`。仅在需要独立评测阶段时使用(如 SWE-Bench、VBench 等场景) | + +### models 字段详解 + +每个模型配置字典的常用字段: + +| 字段 | 类型 | 说明 | +| --- | --- | --- | +| `type` | class | 模型类,如 `VLLMCustomAPIChat`、`VLLMCustomAPI`、`HuggingFaceBaseModel`、`HuggingFacewithChatTemplate` 等 | +| `abbr` | `str` | 模型唯一标识,用于结果表格中的列名。同一配置文件中相同 `abbr` 的模型与数据集组合会被视为重复任务而跳过 | +| `attr` | `str` | 模型属性,服务化模型为 `"service"`,本地模型为 `"local"` | +| `path` | `str` | 模型路径(本地模型必填,服务化模型可为空字符串) | +| `model` | `str` | 服务化推理时指定的模型名称 | +| `host_ip` | `str` | 推理服务 IP 地址(服务化模型) | +| `host_port` | `int` | 推理服务端口(服务化模型) | +| `stream` | `bool` | 是否使用流式推理 | +| `max_out_len` | `int` | 最大输出 token 数 | +| `batch_size` | `int` | 推理 batch size | +| `max_seq_len` | `int` | 最大输入序列长度 | +| `request_rate` | `int` | 请求速率限制,0 表示不限制 | +| `retry` | `int` | 请求失败重试次数 | +| `generation_kwargs` | `dict` | 生成参数,如 `temperature`、`top_k`、`top_p`、`seed` 等 | +| `tokenizer_path` | `str` | Tokenizer 路径(本地模型) | +| `model_kwargs` | `dict` | 模型加载参数(本地模型),如 `device_map` | +| `tokenizer_kwargs` | `dict` | Tokenizer 参数(本地模型),如 `padding_side` | +| `run_cfg` | `dict` | 多卡/多机运行配置(本地模型),如 `dict(num_gpus=1, num_procs=1)` | +| `pred_postprocessor` | `dict` | 模型输出后处理器,如 `dict(type=extract_non_reasoning_content)` | + +### datasets 字段详解 + +每个数据集配置字典的常用字段: + +| 字段 | 类型 | 说明 | +| --- | --- | --- | +| `type` | class | 数据集类,如 `GSM8KDataset`、`MATHDataset`、`SyntheticDataset` 等 | +| `abbr` | `str` | 数据集唯一标识,用于结果表格中的行名 | +| `path` | `str` | 数据集文件路径 | +| `reader_cfg` | `dict` | 读取器配置,包含 `input_columns`、`output_column`,可选 `test_range` 控制采样范围(如 `'[0:100]'`) | +| `infer_cfg` | `dict` | 推理配置,包含 `prompt_template`、`retriever`、`inferencer` | +| `eval_cfg` | `dict` | 评测配置,包含 `evaluator` 和可选的 `pred_postprocessor` | +| `judge_infer_cfg` | `dict` | 裁判模型推理配置(需要 LLM Judge 的数据集),包含 `judge_model`、`judge_dataset_type`、`prompt_template`、`retriever`、`inferencer` | + +### infer 字段详解 + +```python +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) +``` + ## 使用说明 ```bash ais_bench ais_bench/configs/{模型类型}_examples/{任务配置文件名} # 示例: ais_bench ais_bench/configs/api_examples/infer_vllm_api_general.py - ``` +``` + +## 各场景自定义配置文件示例 + +### 1. 服务化精度测评 + +通过 API 访问推理服务,使用真实数据集进行精度测评。适用于 vLLM、MindIE、TGI、Triton 等服务化部署场景。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as gsm8k_0_shot_cot_chat + +datasets = [*gsm8k_0_shot_cot_chat] + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-general-chat', + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + seed=None, + repetition_penalty=1.03, + ) + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/api-vllm-general-chat/' +``` + +### 2. 纯模型精度测评 + +使用 HuggingFace 本地模型直接进行推理测评,无需部署服务。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFaceBaseModel +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_chat_prompt import gsm8k_datasets as gsm8k_0_shot_cot_chat + +datasets = [*gsm8k_0_shot_cot_chat] + +models = [ + dict( + type=HuggingFaceBaseModel, + abbr='hf-base-model', + path='THUDM/chatglm-6b', + tokenizer_path='THUDM/chatglm-6b', + model_kwargs=dict(device_map='auto'), + tokenizer_kwargs=dict(padding_side='left'), + generation_kwargs=dict( + temperature=0.5, + top_k=10, + top_p=0.95, + do_sample=True, + seed=None, + repetition_penalty=1.03, + ), + max_out_len=100, + batch_size=1, + max_seq_len=2048, + batch_padding=True, + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/hf-base-model/' +``` + +### 3. 服务化性能测评 + +使用合成数据集对推理服务进行性能压测,输出 TTFT(首 Token 延迟)、TPOT(每 Token 延迟)、E2EL(端到端延迟)等指标。 + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import ( + synthetic_datasets, + ) + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import ( + models as vllm_api_general_stream, + ) + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import ( + models as vllm_api_stream_chat, + ) + +datasets = synthetic_datasets + +vllm_api_general_stream[0]["abbr"] = "demo-" + vllm_api_general_stream[0]["abbr"] +vllm_api_stream_chat[0]["abbr"] = "demo-" + vllm_api_stream_chat[0]["abbr"] + +models = vllm_api_general_stream + vllm_api_stream_chat + +work_dir = "outputs/demo_api-vllm-stream-perf/" +``` + +运行命令: + +```bash +ais_bench ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py -m perf +``` + +### 4. 合成数据集性能测评 + +自定义合成数据集的参数,控制请求数量、输入/输出 token 长度分布等。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate +from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever +from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer +from ais_bench.benchmark.datasets import SyntheticDataset, MATHEvaluator, math_postprocess_v2 + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import ( + models as vllm_api_general_stream, + ) + +synthetic_config = { + "Type": "string", + "RequestCount": 100, + "TrustRemoteCode": False, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": 1, "MaxValue": 500} + }, + "Output": { + "Method": "gaussian", + "Params": {"Mean": 200, "Var": 100, "MinValue": 1, "MaxValue": 500} + } + }, +} + +datasets = [ + dict( + abbr='synthetic_custom', + type=SyntheticDataset, + config=synthetic_config, + reader_cfg=dict(input_columns=['question', 'max_out_len'], output_column='answer'), + infer_cfg=dict( + prompt_template=dict(type=PromptTemplate, template="{question}"), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + eval_cfg=dict( + evaluator=dict(type=MATHEvaluator, version='v2'), + pred_postprocessor=dict(type=math_postprocess_v2), + ), + ) +] + +models = vllm_api_general_stream +work_dir = 'outputs/synthetic_perf_custom/' +``` + +### 5. 多模型多数据集组合 + +同时测评多个模型在多个数据集上的表现,利用笛卡尔积自动组合。 + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.datasets.mmlu.mmlu_gen_5_shot_str import mmlu_datasets as mmlu_5_shot_str + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat + mmlu_5_shot_str +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat + +work_dir = 'outputs/multi_model_multi_dataset/' +``` + +### 6. 自定义模型-数据集配对 + +通过 `model_dataset_combinations` 精确控制哪些模型与哪些数据集组合,避免不必要的笛卡尔积。 + +```python +from mmengine.config import read_base + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_0_shot_cot_str import gsm8k_datasets as gsm8k_0_shot_cot_str + from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat + +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), + dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), +] + +work_dir = 'outputs/custom_combinations/' +``` + +### 7. 裁判模型测评 + +对于需要 LLM Judge 评判的数据集(如 AIME 2025),在数据集的 `judge_infer_cfg` 中配置裁判模型。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.aime2025.aime2025_gen_0_shot_llmjudge import aime2025_datasets + +datasets = aime2025_datasets + +datasets[0]['judge_infer_cfg']['judge_model']['host_ip'] = 'localhost' +datasets[0]['judge_infer_cfg']['judge_model']['host_port'] = 8081 + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-judge-eval', + path="", + model="", + stream=True, + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/judge_eval/' +``` + +### 8. 稳态性能测评 + +通过控制 `request_rate` 参数和 `stream` 参数,模拟稳态负载下的性能表现。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPI + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import ( + synthetic_datasets, + ) + +datasets = synthetic_datasets + +models = [] +for rate in [0, 5, 10, 20]: + model_cfg = dict( + attr="service", + type=VLLMCustomAPI, + abbr=f'vllm-api-steady-rate-{rate}', + path="", + model="", + stream=True, + request_rate=rate, + use_timestamp=False, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + ) + models.append(model_cfg) + +work_dir = 'outputs/steady_state_perf/' +``` + +### 9. 多轮对话性能测评 + +使用 ShareGPT 或 MTBench 多轮对话数据集进行性能测评。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.sharegpt.sharegpt_gen import sharegpt_datasets + +datasets = sharegpt_datasets + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr="vllm-multiturn-api-chat-stream", + path="", + model="", + stream=True, + request_rate=0, + retry=2, + api_key="", + host_ip="localhost", + host_port=8080, + url="", + max_out_len=512, + batch_size=1, + trust_remote_code=False, + generation_kwargs=dict(temperature=0.01, ignore_eos=False), + pred_postprocessor=dict(type=extract_non_reasoning_content), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/multi_turn_benchmark/' +``` + +### 10. 自定义数据集测评 + +当需要使用自己的数据集进行测评时,可以通过自定义数据集配置实现。 + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import VLLMCustomAPIChat +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.openicl.icl_prompt_template import PromptTemplate +from ais_bench.benchmark.openicl.icl_retriever import ZeroRetriever +from ais_bench.benchmark.openicl.icl_inferencer import GenInferencer +from ais_bench.benchmark.datasets import GenericDataset, AccuracyEvaluator + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + +datasets = [ + dict( + abbr='my_custom_dataset', + type=GenericDataset, + path='/path/to/your/dataset.jsonl', + reader_cfg=dict( + input_columns=['question'], + output_column='answer', + ), + infer_cfg=dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict(role='HUMAN', prompt='{question}'), + ], + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), + ), + eval_cfg=dict( + evaluator=dict(type=AccuracyEvaluator), + ), + ) +] + +models = [ + dict( + attr="service", + type=VLLMCustomAPIChat, + abbr='vllm-api-custom-dataset', + model="", + request_rate=0, + retry=2, + host_ip="localhost", + host_port=8080, + max_out_len=512, + batch_size=1, + generation_kwargs=dict(temperature=0.5, top_k=10, top_p=0.95), + ) +] + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict( + type=LocalAPIRunner, + max_num_workers=2, + task=dict(type=OpenICLInferTask), + ), +) + +work_dir = 'outputs/custom_dataset/' +``` ## 自定义配置文件精度测评使用样例 @@ -29,15 +694,14 @@ with read_base(): from ais_bench.benchmark.configs.datasets.math.math500_gen_0_shot_cot_chat_prompt import math_datasets as math500_gen_0_shot_cot_chat from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general -# 只取部分样本进行 demo 测试 gsm8k_0_shot_cot_str[0]['abbr'] = 'demo_' + gsm8k_0_shot_cot_str[0]['abbr'] gsm8k_0_shot_cot_str[0]['reader_cfg']['test_range'] = '[0:8]' math500_gen_0_shot_cot_chat[0]['abbr'] = 'demo_' + math500_gen_0_shot_cot_chat[0]['abbr'] math500_gen_0_shot_cot_chat[0]['reader_cfg']['test_range'] = '[0:8]' -datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat # 指定数据集列表,可通过累加添加不同的数据集配置 -models = [ # 指定模型配置列表 +datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat +models = [ dict( attr="service", type=VLLMCustomAPIChat, @@ -46,8 +710,8 @@ models = [ # 指定模型配置列表 model="", request_rate = 0, retry = 2, - host_ip = "localhost", # 指定推理服务的IP - host_port = 8080, # 指定推理服务的端口 + host_ip = "localhost", + host_port = 8080, max_out_len = 512, batch_size=1, generation_kwargs = dict( @@ -107,12 +771,12 @@ with read_base(): models as vllm_api_stream_chat, ) -datasets = synthetic_datasets # 指定数据集列表 +datasets = synthetic_datasets vllm_api_general_stream[0]["abbr"] = "demo-" + vllm_api_general_stream[0]["abbr"] vllm_api_stream_chat[0]["abbr"] = "demo-" + vllm_api_stream_chat[0]["abbr"] -models = vllm_api_general_stream + vllm_api_stream_chat # 指定模型列表 +models = vllm_api_general_stream + vllm_api_stream_chat work_dir = "outputs/demo_api-vllm-stream-perf/" ``` @@ -134,7 +798,7 @@ ais_bench ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py -m perf --m ### 输出结果 ```bash -[2025-12-05 12:10:44,147] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-general-stream/syntheticdataset]: +[2025-12-05 12:10:44,147] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-general-stream/syntheticdataset]: ╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪═════╡ @@ -143,7 +807,7 @@ ais_bench ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py -m perf --m │ TTFT │ total │ 103.5 ms │ 102.4 ms │ 107.0 ms │ 103.1 ms │ 103.3 ms │ 104.2 ms │ 106.8 ms │ 10 │ ... [2025-12-05 12:10:44,149] [ais_bench] [INFO] Performance Result files located in outputs/demo_api-vllm-general-stream-chat-perf/20251205_121020/performances/demo-vllm-api-general-stream-chat. -[2025-12-05 12:10:44,149] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-stream-chat/syntheticdataset]: +[2025-12-05 12:10:44,149] [ais_bench] [INFO] Performance Results of task [demo-vllm-api-stream-chat/syntheticdataset]: ╒══════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤═════════════════╤════════════════╤═════════════════╤═════════════════╤═════╕ │ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ ╞══════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════════╪═════════════════╪═════════════════╪═════╡ @@ -166,12 +830,12 @@ with read_base(): from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat -models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat +models = vllm_api_general + vllm_api_general_chat + vllm_api_stream_chat datasets = gsm8k_0_shot_cot_str + math500_gen_0_shot_cot_chat model_dataset_combinations = [ - dict(models=[models[0]], datasets=[datasets[0]]), # 组合1,使用模型0(vllm_api_general)与数据集0(gsm8k_0_shot_cot_str)进行组合 - dict(models=[models[1]], datasets=[datasets[1]]), # 组合2,使用模型1(vllm_api_general_chat)与数据集1(math500_gen_0_shot_cot_chat)进行组合 - dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), # 组合3,使用模型2(vllm_api_stream_chat)与数据集0(gsm8k_0_shot_cot_str)和数据集1(math500_gen_0_shot_cot_chat)进行组合 + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), + dict(models=[models[2]], datasets=[datasets[0], datasets[1]]), ... ] ``` @@ -189,8 +853,8 @@ vllm_api_general_copy[0]['port'] = 8081 models = vllm_api_general_copy + vllm_api_general datasets = math500_gen_0_shot_cot_chat model_dataset_combinations = [ - dict(models=[models[1]], datasets=datasets), # 组合1,使用模型1(vllm_api_general)与数据集(math500_gen_0_shot_cot_chat)进行组合 - dict(models=[models[0]], datasets=datasets), # 组合2,使用模型0(vllm_api_general_copy)与数据集0(math500_gen_0_shot_cot_chat)进行组合,由于vllm_api_general_copy与vllm_api_general的abbr相同,所以会被认为与组合1是相同任务,会被跳过,即便内部参数存在区别 + dict(models=[models[1]], datasets=datasets), + dict(models=[models[0]], datasets=datasets), ] ``` @@ -198,20 +862,24 @@ model_dataset_combinations = [ ```python vllm_api_general_copy = vllm_api_general.copy() -vllm_api_general_copy[0]['abbr'] = vllm_api_general[0]['abbr'] + '-copy' # 修改abbr,标识模型 +vllm_api_general_copy[0]['abbr'] = vllm_api_general[0]['abbr'] + '-copy' ``` 这样vllm_api_general_copy[0]与vllm_api_general[0]的abbr不同,组合2与组合1是不同任务,会被正常执行。 ## 预设自定义配置文件文件样例列表 -|文件名|简介| +| 文件名 | 简介 | | --- | --- | -|[infer_vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_general.py)|基于gsm8k数据集使用vllm api(0.6+版本)访问v1/completions子服务进行评测,prompt格式为字符串格式,自定义了数据集路径| -|[infer_mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_mindie_stream_api_general.py)|基于gsm8k数据集使用mindie stream api访问infer子服务进行评测,prompt格式为字符串格式,自定义了数据集路径| -|[infer_vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_general_chat.py)|基于gsm8k数据集使用vllm api(0.6+版本)访问v1/chat/completions子服务进行评测,prompt格式为对话格式,自定义了数据集路径| -|[infer_vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_stream_chat.py)|基于gsm8k数据集使用vllm api(0.6+版本)访问v1/chat/completions子服务使用流式推理进行评测,prompt格式为对话格式,自定义了数据集路径| -|[infer_hf_base_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_base_model.py)|基于gsm8k数据集使用huggingface base模型的推理接口进行评测,prompt格式为字符串格式,自定义了数据集路径| -|[infer_hf_chat_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_chat_model.py)|基于gsm8k数据集使用huggingface chat模型的推理接口进行评测,prompt格式为字符串格式,自定义了数据集路径| +| [infer_vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_general.py) | 基于gsm8k数据集使用vllm api(0.6+版本)访问v1/completions子服务进行评测,prompt格式为字符串格式,自定义了数据集路径 | +| [infer_vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_general_chat.py) | 基于gsm8k数据集使用vllm api(0.6+版本)访问v1/chat/completions子服务进行评测,prompt格式为对话格式,自定义了数据集路径 | +| [infer_vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_stream_chat.py) | 基于gsm8k数据集使用vllm api(0.6+版本)访问v1/chat/completions子服务使用流式推理进行评测,prompt格式为对话格式,自定义了数据集路径 | +| [infer_vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_vllm_api_old.py) | 基于gsm8k数据集使用旧版vllm api访问v1/completions子服务进行评测,prompt格式为字符串格式 | +| [infer_mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/infer_mindie_stream_api_general.py) | 基于gsm8k数据集使用mindie stream api访问infer子服务进行评测,prompt格式为字符串格式,自定义了数据集路径 | +| [infer_hf_base_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_base_model.py) | 基于gsm8k数据集使用huggingface base模型的推理接口进行评测,prompt格式为字符串格式,自定义了数据集路径 | +| [infer_hf_chat_model.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/hf_example/infer_hf_chat_model.py) | 基于gsm8k数据集使用huggingface chat模型的推理接口进行评测,prompt格式为对话格式,自定义了数据集路径 | +| [demo_infer_vllm_api.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/demo_infer_vllm_api.py) | Demo示例:同时评测v1/chat/completions与v1/completions两个接口在GSM8K与MATH数据集上的精度表现 | +| [demo_infer_vllm_api_perf.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/demo_infer_vllm_api_perf.py) | Demo示例:同时评测v1/chat/completions与v1/completions两个接口使用合成数据集进行流式性能测评 | +| [all_dataset_configs.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/all_dataset_configs.py) | 所有支持的数据集配置导入汇总,可在自定义配置文件中直接 `from ... import` 使用 | **注**: 上述自定义配置文件如果要评测其他数据集,请从[ais_bench/configs/api_examples/all_dataset_configs.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/api_examples/all_dataset_configs.py)导入其他数据集。 diff --git a/docs/source_zh_cn/advanced_tutorials/stable_stage.md b/docs/source_zh_cn/advanced_tutorials/stable_stage.md index a5d80643..178a1bd0 100644 --- a/docs/source_zh_cn/advanced_tutorials/stable_stage.md +++ b/docs/source_zh_cn/advanced_tutorials/stable_stage.md @@ -188,6 +188,10 @@ ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_cha ![full_plot_example.img](../img/request_concurrency/full_plot_example.png) 具体这个html中的图标如何查看请参考📚 [性能测试可视化并发图使用说明](../base_tutorials/results_intro/performance_visualization.md) +## 通过自定义配置文件实现 + +> 💡 上述稳态性能测评场景也可以通过 [自定义配置文件方式](run_custom_config.md) 实现。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](run_custom_config.md#各场景自定义配置文件示例) 中"稳态性能测评"示例。 + ## 其他功能场景 ### 性能结果重计算 参考📚 [常规性能测试性能结果重计算](../base_tutorials/scenes_intro/performance_benchmark.md#性能结果重计算) diff --git a/docs/source_zh_cn/advanced_tutorials/synthetic_dataset.md b/docs/source_zh_cn/advanced_tutorials/synthetic_dataset.md index 81260dc4..50fcbb2c 100644 --- a/docs/source_zh_cn/advanced_tutorials/synthetic_dataset.md +++ b/docs/source_zh_cn/advanced_tutorials/synthetic_dataset.md @@ -294,3 +294,7 @@ synthetic_config = { 1. **`tokenid`模式**:该模式下的`tokenid`取值范围取决于在模型配置文件中指定的模型的词表范围 2. **`string`模式**:当MinValue=MaxValue时生成固定长度序列 + +## 七. 通过自定义配置文件实现 + +> 💡 上述合成数据集测评场景也可以通过 [自定义配置文件方式](run_custom_config.md) 实现。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。详见 [自定义配置文件运行AISBench](run_custom_config.md#各场景自定义配置文件示例) 中"合成数据集性能测评"示例。 diff --git a/docs/source_zh_cn/base_tutorials/all_params/cli_args.md b/docs/source_zh_cn/base_tutorials/all_params/cli_args.md index d43e4d98..aa8c5383 100644 --- a/docs/source_zh_cn/base_tutorials/all_params/cli_args.md +++ b/docs/source_zh_cn/base_tutorials/all_params/cli_args.md @@ -21,9 +21,10 @@ ais_bench [OPTIONS] 适用于所有模式,可同时与精度或性能参数联合使用。 | 参数| 说明| 示例| | ---- | ---- | ----| -| `--models`| 指定模型推理后端任务名称(对应 `ais_bench/benchmark/configs/models` 路径下一个已经实现的默认模型配置文件),支持传入多个任务名称。详情参考📚 [支持的模型](./models.md)| `--models vllm_api_general` | -| `--datasets` | 指定数据集任务名称(对应 `ais_bench/benchmark/configs/datasets` 路径下一个已经实现的默认数据集配置文件),可传入多个。详情参考📚 [支持的数据集类型](./datasets.md)| `--datasets gsm8k_gen` | -| `--summarizer` | 指定结果总结任务名称(对应 `ais_bench/benchmark/configs/summarizers` 路径下一个已经实现的默认模型配置文件)。详情参考📚 [支持的结果汇总任务](./summarizer.md) | `--summarizer medium`| +|`config`|指定自定义配置文件路径|`ais_bench /path/to/custom_config.py {other optional arguments}`| +| `--models`| 指定模型推理后端任务名称(对应 `ais_bench/benchmark/configs/models` 路径下一个已经实现的默认模型配置文件),支持传入多个任务名称。详情参考📚 [支持的模型](./models.md)。
⚠️注意:指定了自定义配置文件路径后此参数无效| `--models vllm_api_general` | +| `--datasets` | 指定数据集任务名称(对应 `ais_bench/benchmark/configs/datasets` 路径下一个已经实现的默认数据集配置文件),可传入多个。详情参考📚 [支持的数据集类型](./datasets.md)。
⚠️注意:指定了自定义配置文件路径后此参数无效| `--datasets gsm8k_gen` | +| `--summarizer` | 指定结果总结任务名称(对应 `ais_bench/benchmark/configs/summarizers` 路径下一个已经实现的默认模型配置文件)。详情参考📚 [支持的结果汇总任务](./summarizer.md) 。
⚠️注意:指定了自定义配置文件路径后此参数无效| `--summarizer medium`| | `--mode` 或 `-m`| 运行模式,可选:`all`、`infer`、`eval`、`viz`、`perf`、`perf_viz`;默认 `all`。
详细请见 📚 [运行模式说明](./mode.md)。 | `--mode infer`
`-m all`| | `--reuse` 或 `-r`| 指定已有工作目录下的时间戳,继续执行并覆盖原有结果。结合`--mode`参数值,可用于推理中断续推,或基于已有推理结果执行精度计算、可视化结果打印。若不加参,则自动选取 `--work-dir` 下最新时间戳。| `--reuse 20250126_144254`
`-r 20250126_144254` | | `--work-dir` 或 `-w` | 指定评测工作目录,用于保存输出结果。默认 `outputs/default`。| `--work-dir /path/to/work`
`-w /path/to/work` | diff --git a/docs/source_zh_cn/base_tutorials/all_params/models.md b/docs/source_zh_cn/base_tutorials/all_params/models.md index 7ccc7845..b1359e83 100644 --- a/docs/source_zh_cn/base_tutorials/all_params/models.md +++ b/docs/source_zh_cn/base_tutorials/all_params/models.md @@ -10,27 +10,27 @@ AISBench Benchmark 支持多种服务化推理后端,包括 vLLM、SGLang、Tr 以在 GPU 上部署的 vLLM 推理服务为例,您可以参考 [vLLM 官方文档](https://docs.vllm.ai/en/stable/getting_started/quickstart.html) 启动服务。 不同服务化后端对应的模型配置如下: -| 模型配置名称| 简介| 使用前提| 支持的测评模式 | 接口类型 | 支持的数据集 Prompt 格式 | 配置文件路径| -| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | -| `vllm_api_general` | 通过 vLLM 兼容 OpenAI 的 API 访问推理服务,接口为 `v1/completions`| 基于 vLLM 版本支持 `v1/completions` 子服务| 生成式测评、PPL模式测评 | 文本接口 | 字符串格式| [vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general.py)| -| `vllm_api_general_stream`| 流式访问 vLLM 推理服务,接口为 `v1/completions`| 基于 vLLM 版本支持 `v1/completions` 子服务 | 生成式测评| 流式接口 | 字符串格式| [vllm_api_general_stream.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py) | -| `vllm_api_general_chat` | 通过 vLLM 兼容 OpenAI 的 API 访问推理服务,接口为 `v1/chat/completions` | 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评、PPL模式测评 | 文本接口 | 字符串格式、对话格式、多模态格式 | [vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) | -| `vllm_api_stream_chat`| 流式访问 vLLM 推理服务,接口为 `v1/chat/completions`| 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评 | 流式接口 | 字符串格式、对话格式、多模态格式 | [vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) | -| `vllm_api_stream_chat_multiturn`| 多轮对话场景的流式访问 vLLM 推理服务,接口为 `v1/chat/completions`| 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评 | 流式接口 | 对话格式 | [vllm_api_stream_chat_multiturn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat_multiturn.py) | -| `vllm_api_function_call_chat`| function call精度测评场景访问 vLLM 推理服务的API ,接口为 `v1/chat/completions`(只适用于[BFCL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/BFCL/README.md)测评场景| 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评 | 文本接口 | 对话格式 | [vllm_api_function_call_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_function_call_chat.py) | -| `vllm_api_old` | 通过 vLLM 兼容 API 访问推理服务,接口为 `generate`| 基于 vLLM 版本支持 `generate` 子服务 | 生成式测评 | 文本接口 | 字符串格式、多模态格式| [vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_old.py)| -| `mindie_stream_api_general` | 通过 MindIE 流式 API 访问推理服务,接口为 `infer`| 基于 MindIE 版本支持 `infer` 子服务 | 生成式测评 | 流式接口 | 字符串格式、多模态格式| [mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/mindie_api/mindie_stream_api_general.py) | -| `triton_api_general` | 通过 Triton API 访问推理服务,接口为 `v2/models/{model name}/generate` | 启动支持 Triton API 的推理服务 | 生成式测评 | 文本接口 | 字符串格式、多模态格式| [triton_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_api_general.py) | -| `triton_stream_api_general` | 通过 Triton 流式 API 访问推理服务,接口为 `v2/models/{model name}/generate_stream` | 启动支持 Triton API 的推理服务 | 生成式测评 | 流式接口 | 字符串格式、多模态格式 | [triton_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_stream_api_general.py) | -| `tgi_api_general` | 通过 TGI API 访问推理服务,接口为 `generate`| 启动支持 TGI API 的推理服务 | 生成式测评 | 文本接口 | 字符串格式、多模态格式| [tgi_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_api_general.py)| -| `tgi_stream_api_general` | 通过 TGI 流式 API 访问推理服务,接口为 `generate_stream`| 启动支持 TGI API 的推理服务 | 生成式测评 | 流式接口 | 字符串格式、多模态格式| [tgi_stream_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_stream_api_general.py) | +| 模型配置名称| 简介| 使用前提| 支持的测评模式 | 接口类型 | 支持的数据集 Prompt 格式 | 配套文件导入方式 | 配置文件路径| +| ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | +| `vllm_api_general` | 通过 vLLM 兼容 OpenAI 的 API 访问推理服务,接口为 `v1/completions`| 基于 vLLM 版本支持 `v1/completions` 子服务| 生成式测评、PPL模式测评 | 文本接口 | 字符串格式|`from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general`| [vllm_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general.py)| +| `vllm_api_general_stream`| 流式访问 vLLM 推理服务,接口为 `v1/completions`| 基于 vLLM 版本支持 `v1/completions` 子服务 | 生成式测评| 流式接口 | 字符串格式| `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream` | [vllm_api_general_stream.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py) | +| `vllm_api_general_chat` | 通过 vLLM 兼容 OpenAI 的 API 访问推理服务,接口为 `v1/chat/completions` | 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评、PPL模式测评 | 文本接口 | 字符串格式、对话格式、多模态格式 | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat` | [vllm_api_general_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py) | +| `vllm_api_stream_chat`| 流式访问 vLLM 推理服务,接口为 `v1/chat/completions`| 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评 | 流式接口 | 字符串格式、对话格式、多模态格式 | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat` | [vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py) | +| `vllm_api_stream_chat_multiturn`| 多轮对话场景的流式访问 vLLM 推理服务,接口为 `v1/chat/completions`| 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评 | 流式接口 | 对话格式 | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat_multiturn import models as vllm_api_stream_chat_multiturn` | [vllm_api_stream_chat_multiturn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat_multiturn.py) | +| `vllm_api_function_call_chat`| function call精度测评场景访问 vLLM 推理服务的API ,接口为 `v1/chat/completions`(只适用于[BFCL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/BFCL/README.md)测评场景| 基于 vLLM 版本支持 `v1/chat/completions` 子服务 | 生成式测评 | 文本接口 | 对话格式 | `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_function_call_chat import models as vllm_api_function_call_chat` | [vllm_api_function_call_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_function_call_chat.py) | +| `vllm_api_old` | 通过 vLLM 兼容 API 访问推理服务,接口为 `generate`| 基于 vLLM 版本支持 `generate` 子服务 | 生成式测评 | 文本接口 | 字符串格式、多模态格式| `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_old import models as vllm_api_old` | [vllm_api_old.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_old.py)| +| `mindie_stream_api_general` | 通过 MindIE 流式 API 访问推理服务,接口为 `infer`| 基于 MindIE 版本支持 `infer` 子服务 | 生成式测评 | 流式接口 | 字符串格式、多模态格式| `from ais_bench.benchmark.configs.models.mindie_api.mindie_stream_api_general import models as mindie_stream_api_general` | [mindie_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/mindie_api/mindie_stream_api_general.py) | +| `triton_api_general` | 通过 Triton API 访问推理服务,接口为 `v2/models/{model name}/generate` | 启动支持 Triton API 的推理服务 | 生成式测评 | 文本接口 | 字符串格式、多模态格式| `from ais_bench.benchmark.configs.models.triton_api.triton_api_general import models as triton_api_general` | [triton_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_api_general.py) | +| `triton_stream_api_general` | 通过 Triton 流式 API 访问推理服务,接口为 `v2/models/{model name}/generate_stream` | 启动支持 Triton API 的推理服务 | 生成式测评 | 流式接口 | 字符串格式、多模态格式 | `from ais_bench.benchmark.configs.models.triton_api.triton_stream_api_general import models as triton_stream_api_general` | [triton_stream_api_general.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/triton_api/triton_stream_api_general.py) | +| `tgi_api_general` | 通过 TGI API 访问推理服务,接口为 `generate`| 启动支持 TGI API 的推理服务 | 生成式测评 | 文本接口 | 字符串格式、多模态格式| `from ais_bench.benchmark.configs.models.tgi_api.tgi_api_general import models as tgi_api_general` | [tgi_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_api_general.py)| +| `tgi_stream_api_general` | 通过 TGI 流式 API 访问推理服务,接口为 `generate_stream`| 启动支持 TGI API 的推理服务 | 生成式测评 | 流式接口 | 字符串格式、多模态格式| `from ais_bench.benchmark.configs.models.tgi_api.tgi_stream_api_general import models as tgi_stream_api_general` | [tgi_stream_api_general](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/tgi_api/tgi_stream_api_general.py) | ### 服务化推理后端配置参数说明 服务化推理后端配置文件采用Python语法格式配置,示例如下: ```python from ais_bench.benchmark.models import VLLMCustomAPI -models = [ +models = [ # 相当于自定义配置文件中通过 `from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general`中导入的models dict( attr="service", type=VLLMCustomAPI, @@ -91,19 +91,19 @@ models = [ - 当使用 IPv6 字面量(如 `::1`、`2001:db8::1`)作为 `host_ip` 时,工具会在生成的访问 URL 中自动为其添加方括号(例如 `http://[2001:db8::1]:8080/`),无需在配置中手动编写方括号。 ## 本地模型后端 -|模型配置名称|简介|使用前提|支持的prompt格式(字符串格式或对话格式)|对应源码配置文件路径| -| --- | --- | --- | --- | --- | -|`hf_base_model`|HuggingFace Base 模型后端|已安装评测工具基础依赖,需在配置文件中指定 HuggingFace 模型权重路径(当前不支持自动下载)|字符串格式|[hf_base_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_base_model.py)| -|`hf_chat_model`| HuggingFace Chat 模型后端|已安装评测工具基础依赖,需在配置文件中指定 HuggingFace 模型权重路径(当前不支持自动下载)|对话格式|[hf_chat_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_chat_model.py)| -|`hf_qwenvl_model`| HuggingFace Chat QwenVL模型后端|已安装评测工具基础依赖,需在配置文件中指定 HuggingFace 模型权重路径(当前不支持自动下载)|对话格式|[hf_qwenvl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_qwenvl_model.py)| -|`vllm_offline_vl_model`| vllm Chat QwenVL离线推理模型后端|已安装评测工具基础依赖,需在配置文件中指定模型模型权重路径(当前不支持自动下载)|对话格式|[vllm_offline_vl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_offline_models/vllm_offline_vl_model.py)| +|模型配置名称|简介|使用前提|支持的prompt格式(字符串格式或对话格式)| 配套文件导入方式 |对应源码配置文件路径| +| --- | --- | --- | --- | --- | --- | +|`hf_base_model`|HuggingFace Base 模型后端|已安装评测工具基础依赖,需在配置文件中指定 HuggingFace 模型权重路径(当前不支持自动下载)|字符串格式|`from ais_bench.benchmark.configs.models.hf_models.hf_base_model import models as hf_base_model`|[hf_base_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_base_model.py)| +|`hf_chat_model`| HuggingFace Chat 模型后端|已安装评测工具基础依赖,需在配置文件中指定 HuggingFace 模型权重路径(当前不支持自动下载)|对话格式|`from ais_bench.benchmark.configs.models.hf_models.hf_chat_model import models as hf_chat_model`|[hf_chat_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_chat_model.py)| +|`hf_qwenvl_model`| HuggingFace Chat QwenVL模型后端|已安装评测工具基础依赖,需在配置文件中指定 HuggingFace 模型权重路径(当前不支持自动下载)|对话格式|`from ais_bench.benchmark.configs.models.hf_models.hf_qwenvl_model import models as hf_qwenvl_model`|[hf_qwenvl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/hf_models/hf_qwenvl_model.py)| +|`vllm_offline_vl_model`| vllm Chat QwenVL离线推理模型后端|已安装评测工具基础依赖,需在配置文件中指定模型模型权重路径(当前不支持自动下载)|对话格式|`from ais_bench.benchmark.configs.models.vllm_offline_models.vllm_offline_vl_model import models as vllm_offline_vl_model`|[vllm_offline_vl_model](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_offline_models/vllm_offline_vl_model.py)| ### 本地huggingface模型后端配置参数说明 本地huggingface模型后端配置文件采用Python语法格式配置,示例如下: ```python from ais_bench.benchmark.models import HuggingFacewithChatTemplate -models = [ +models = [ # 相当于自定义配置文件中通过 `from ais_bench.benchmark.configs.models.hf_models.hf_chat_model import models as hf_chat_model`中导入的models dict( attr="local", # 后端类型标识 type=HuggingFacewithChatTemplate, # 模型类型 diff --git a/docs/source_zh_cn/base_tutorials/all_params/summarizer.md b/docs/source_zh_cn/base_tutorials/all_params/summarizer.md index cd579a41..73b9e44e 100644 --- a/docs/source_zh_cn/base_tutorials/all_params/summarizer.md +++ b/docs/source_zh_cn/base_tutorials/all_params/summarizer.md @@ -1,7 +1,7 @@ # 支持的结果汇总任务 -| 任务名称 | 简介 | 配置文件路径 | -| -------------- | -------------- | -------------- | -| `example` | 简化版精度评测结果汇总模板,覆盖当前支持的所有数据集,是默认使用的模板。 | [example.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/example.py) | -| `medium` | 通用精度评测结果汇总模板,适用于多个基础数据集。| [medium.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/medium.py) | -| `default_perf` | 全量性能评测结果汇总模板,汇总所有请求的性能数据。支持通过 `default_perf.py` 手动配置性能统计指标。 | [default\_perf.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/default_perf.py) | -| `stable_stage` | 稳定阶段性能评测结果汇总模板,仅汇总系统达到配置最大并发时的请求数据。支持通过 `stable_stage.py` 手动配置性能统计指标。 | [stable\_stage.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/stable_stage.py) | +| 任务名称 | 简介 | 配置文件导入方式 | 配置文件路径 | +| -------------- | -------------- | -------------- | -------------- | +| `example` | 简化版精度评测结果汇总模板,覆盖当前支持的所有数据集,是默认使用的模板。 | `from ais_bench.benchmark.configs.summarizers.example import summarizer` | [example.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/example.py) | +| `medium` | 通用精度评测结果汇总模板,适用于多个基础数据集。| `from ais_bench.benchmark.configs.summarizers.medium import summarizer` | [medium.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/medium.py) | +| `default_perf` | 全量性能评测结果汇总模板,汇总所有请求的性能数据。支持通过 `default_perf.py` 手动配置性能统计指标。 | `from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer` | [default\_perf.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/default_perf.py) | +| `stable_stage` | 稳定阶段性能评测结果汇总模板,仅汇总系统达到配置最大并发时的请求数据。支持通过 `stable_stage.py` 手动配置性能统计指标。 | `from ais_bench.benchmark.configs.summarizers.perf.stable_stage import summarizer` | [stable\_stage.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/summarizers/perf/stable_stage.py) | diff --git a/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark.md b/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark.md index dab8fef2..5b82c283 100644 --- a/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark.md +++ b/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark.md @@ -12,22 +12,92 @@ ## 主要功能场景 ### 单任务测评 -请参考主页📚 [快速入门](../../get_started/quick_start.md),不做赘述。 +请参考主页📚 [快速入门](../../get_started/quick_start.md)。快速入门中已经提供了两种启动方式: ### 多任务测评 支持同时配置多个模型或多个数据集任务,通过单次命令进行批量测评,适用于大规模模型横向对比或多数据集精度对比分析。 -#### 命令说明 -用户可通过`--models`和`--datasets`参数指定多个配置任务,子任务数为`--models`配置任务数和`--datasets`配置任务数的乘积,即一个模型配置和一个数据集配置组成一个子任务,命令示例: -```bash -ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt -``` -上述命令指定了2个模型任务(`vllm_api_general_chat` `vllm_api_stream_chat`)和2个数据集任务(`gsm8k_gen_4_shot_cot_str` `aime2024_gen_0_shot_chat_prompt`),将执行以下4个组合精度测试任务: + +#### 子任务组合说明 + +多任务测评场景下,子任务数为`models`配置任务数和`datasets`配置任务数的乘积,即一个模型配置和一个数据集配置组成一个子任务。下面以同时测评2个模型任务(`vllm_api_general_chat`、`vllm_api_stream_chat`)和2个数据集任务(`gsm8k_gen_4_shot_cot_str`、`aime2024_gen_0_shot_chat_prompt`)为例,将执行以下4个组合精度测试任务: + [vllm_api_general_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py)模型任务 + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) 数据集任务 + [vllm_api_general_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py)模型任务 + [aime2024_gen_0_shot_chat_prompt](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_chat_prompt.py) 数据集任务 + [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py)模型任务 + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) 数据集任务 + [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py)模型任务 + [aime2024_gen_0_shot_chat_prompt](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_chat_prompt.py) 数据集任务 +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +参考快速入门中的 [model_api_test_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/model_api_test_zh_cn.py) 文件,在`with read_base():`中导入多个模型任务和数据集任务,然后将其合并到 `models`、`datasets` 列表即可。完整样例请参考 [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_chat + vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +修改好配置文件后,执行命令: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py +``` + +#### 自定义模型-数据集配对(可选) + +默认情况下,上述配置中 `models` 列表与 `datasets` 列表会自动按笛卡尔积组合,子任务数为模型数 × 数据集数(本例为 2 × 2 = 4 个)。若希望精确控制哪些模型与哪些数据集配对(例如让部分模型只跑部分数据集、避免无意义的组合),可在配置文件中通过 `model_dataset_combinations` 字段显式声明配对关系: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets +models = vllm_api_general_chat + vllm_api_stream_chat + +# 关键:通过 model_dataset_combinations 精确控制配对 +# 下例仅生成 2 个子任务(笛卡尔积会生成 4 个): +# - vllm_api_general_chat + gsm8k_gen_4_shot_cot_str +# - vllm_api_stream_chat + aime2024_gen_0_shot_chat_prompt +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), +] +``` + +> ⚠️ **注意**:模型与数据集的唯一标识由 `abbr` 字段决定。同一配置文件中,相同 `abbr` 的模型或数据集重复出现的组合会被视为重复任务而被跳过。当通过 `.copy()` 等方式复用模型/数据集配置时,必须显式修改 `abbr` 以保证唯一性。详见 📚 [自定义模型与数据集组合](../../advanced_tutorials/run_custom_config.md#自定义模型与数据集组合)。 + +::: + +:::{tab-item} 备选:使用命令行参数 + +用户可通过`--models`和`--datasets`参数指定多个配置任务,命令示例: + +```bash +ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt +``` + #### 修改任务对应的配置文件 模型任务和数据集任务对应的配置文件实际路径通过执行加`--search`命令查询: ```bash @@ -58,6 +128,9 @@ ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_g ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt ``` +::: +:::: + 执行过程中会在📚 [`--work-dir`](../all_params/cli_args.md#公共参数)路径(默认是`outputs/default/`)下创建时间戳目录用于保存执行细节。 任务结束后结果呈现的打屏日志示例如下: @@ -110,12 +183,50 @@ aime2024 604a78 accuracy gen 50.00 ├── summary_20250628_172032.md └── summary_20250628_172032.txt ``` + ### 多任务并行测评 -默认情况下,多个子任务采用串行执行,单个任务内默认开启Continuous Batch,会根据用户配置的最大并发拉起多个进程发送和处理请求,允许配置较大的并发。在单个任务并发较小时,可以通过设置📚 [`--max-num-workers`](../all_params/cli_args.md#精度测评参数)参数实现多任务并行,示例如下: +默认情况下,多个子任务采用串行执行,单个任务内默认开启Continuous Batch,会根据用户配置的最大并发拉起多个进程发送和处理请求,允许配置较大的并发。在单个任务并发较小时,可以通过设置📚 [`--max-num-workers`](../all_params/cli_args.md#公共参数)参数实现多任务并行,示例如下: + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +在自定义配置文件中不再需要设置 `max_num_workers`,而是通过命令行参数 [`--max-num-workers`](../all_params/cli_args.md#公共参数) 传递。配置文件样例与[多任务测评](#多任务测评)完全一致,完整样例请参考 [multi_task_parallel_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py): + +```python +# 完整样例与多任务测评中的配置一致,区别仅在执行命令 +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_chat + vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +执行命令(通过 `--max-num-workers 4` 指定并行数): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py --max-num-workers 4 +``` + +::: +:::{tab-item} 备选:使用命令行参数 ```bash ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt --max-num-workers 4 ``` + +::: +:::: 示例中指定任务最大并发数为4,四个子任务将会同时执行,可以在命令行看板上看到: ``` Base path of result&log : outputs/default/20251106_113926 @@ -144,6 +255,40 @@ Press Up/Down arrow to page, 'P' to PAUZE/RESUME screen refresh, 'Ctrl + C' to 在测评过程中发生意外中断或服务器异常导致的推理任务失败时,可通过`--reuse`开启断点管理功能实现任务续测,亦支持仅对失败用例进行自动重测,无需重复运行全部任务。示例如下: 1、假设用户使用如下命令首次执行推理测评,若由于任务异常退出导致的任务中断或由于服务端异常导致部分请求失败 + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +首次执行命令(基于 [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py)): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py +``` + +此时部分推理结果会被保存下来,在📚 [`--work-dir`](../all_params/cli_args.md#公共参数)生成如下文件内容: + +```bash +# output/default下 +20250628_151326/ # 测试任务创建的时间戳目录 +├── configs # 模型任务、数据集任务和结构呈现任务对应的配置文件合成的一个配置 +│ └── 20250628_151326_29317.py +├── logs # 执行过程中日志,命令中如果加--debug,不会有过程日志落盘(都直接打印出来了) +│ └── infer # 推理阶段日志 +└── predictions # 推理结果目录,记录每条请求的输入、模型输出及答案(用于精度评估) + └── vllm-api-general-chat + └── tmp_demo_gsm8k # 已完成请求的推理输出 + └── tmp_0_2766386_1749107195.json # 缓存文件,命名格式为:tmp_{任务进程ID}_{进程编号}_{时间戳}.json +``` + +2、通过`--reuse`参数指定任务时间戳目录续推(`--reuse` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py --reuse 20250628_151326 +``` + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt ``` @@ -153,11 +298,11 @@ ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_ch # output/default下 20250628_151326/ # 测试任务创建的时间戳目录 ├── configs # 模型任务、数据集任务和结构呈现任务对应的配置文件合成的一个配置 -│   └── 20250628_151326_29317.py +│ └── 20250628_151326_29317.py ├── logs # 执行过程中日志,命令中如果加--debug,不会有过程日志落盘(都直接打印出来了) -│   └── infer # 推理阶段日志 +│ └── infer # 推理阶段日志 └── predictions # 推理结果目录,记录每条请求的输入、模型输出及答案(用于精度评估) -    └── vllm-api-general-chat + └── vllm-api-general-chat └── tmp_demo_gsm8k # 已完成请求的推理输出 └── tmp_0_2766386_1749107195.json # 缓存文件,命名格式为:tmp_{任务进程ID}_{进程编号}_{时间戳}.json ``` @@ -165,16 +310,70 @@ ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_ch ```bash ais_bench --models vllm_api_general --datasets gsm8k_gen --reuse 20250628_151326 ``` + +::: +:::: + 日志中会打印如下内容,提示续推任务开启: + ```bash 02/20 13:14:15 - AISBench - INFO - Found 10 tmp items, run infer task from the last interrupted position ``` + 续推结束后,会重新所有请求的精度结果并打印,生成结果与📚 [快速入门](../../get_started/quick_start.md)示例一致。 > ⚠️ 注意:中断续测与失败重测可能改变请求顺序,可能引发结果微小波动。 💡[多任务测评](#多任务测评) 也支持全量和部分任务的中断续测 & 失败用例重测。 + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + 例如,执行如下多任务评测命令出现中断: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py +``` + +通过如下方式对全量任务中断续测: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py --reuse 20250628_151326 +``` + +也可以通过编辑自定义配置文件后仅对部分任务中断续测。完整样例请参考 [multi_task_resume_partial_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +datasets = gsm8k_datasets +models = vllm_api_general_chat +# ...其余参数配置详见配置文件 +``` + +然后执行: + +```bash +# 仅对 vllm_api_general_chat + gsm8k_gen_4_shot_cot_str 任务中断续测 +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py --reuse 20250628_151326 + +# 对vllm_api_general_chat + gsm8k_gen_4_shot_cot_str, vllm_api_general_chat + aime2024_gen_0_shot_chat_prompts两个任务续测 +ais_bench ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py --reuse 20250628_151326 +``` + +> 💡 如果需要对部分组合(例如 `vllm_api_general_chat + aime2024`、`vllm_api_stream_chat + aime2024`)续测,只需在自定义配置文件中指定对应模型任务和数据集任务后通过 `--reuse` 指定时间戳即可,详见 📚 [自定义模型-数据集配对](../../advanced_tutorials/run_custom_config.md#6-自定义模型-数据集配对)。 + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_chat_prompt ``` @@ -192,27 +391,171 @@ ais_bench --models vllm_api_general_chat --datasets gsm8k_gen_4_shot_cot_str aim ais_bench --models vllm_api_general_chat vllm_api_stream_chat --datasets aime2024_gen_0_shot_chat_prompt --reuse 20250628_151326 ``` +::: +:::: + ### 合并子数据集推理 部分数据集会分类成不同的子数据集,在推理时会被划分为多个子任务行推理,例如:📚 [MMLU](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/mmlu/README.md)、📚 [CEVAL](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/ceval/README.md)。AISBench Benchmark支持将存在多个小规模数据集的数据集合并为一个任务进行统一测评。示例如下: + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +修改自定义配置文件,引入支持合并推理的数据集任务即可。完整样例请参考 [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + +models = vllm_api_general +# ...其余参数配置详见配置文件 +``` + +执行命令(`--merge-ds` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py --merge-ds +``` + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_general --datasets ceval_gen --merge-ds ``` -> ⚠️ 注意:合并模式下将只生成整体结果,子数据集精度不再单独列出。同时对合并模式下中断或失败的推理结果进行数据集中断续测 & 失败用例重测也必须在命令中加`--merge-ds` + +::: +:::: + +> ⚠️ 注意:合并模式下将只生成整体结果,子数据集精度不再单独列出。同时对合并模式下中断或失败的推理结果进行数据集中断续测 & 失败用例重测也必须在命令中加`--merge-ds`。 ### 固定请求数测评 -当集规模过大,只想针数据对部分样本执行性能测试时,可使用 📚 [`--num-prompts`](../all_params/cli_args.md#性能测评参数) 参数指定读取的数据条数。示例如下: +当数据集规模过大,只想针对部分样本执行精度测试时,可使用以下两种方式控制读取的数据范围,二者作用一致,按使用习惯选择即可: + +- **基础方式**:通过命令行参数 📚 [`--num-prompts`](../all_params/cli_args.md#公共参数) 直接指定读取的数据条数,无需修改配置文件,使用最简单。 +- **进阶方式(功能更强大)**:在自定义配置文件中设置数据集的 `reader_cfg.test_range` 字段,支持更灵活的采样范围(如指定起始位置、自定义步长等),详细用法可参考 📚 [自定义配置文件](../../advanced_tutorials/run_custom_config.md)。 + +示例如下: + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +**方式一:基础方式 — 通过 `--num-prompts` 指定读取条数** + +完整样例请参考 [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +执行命令(通过 `--num-prompts 1` 指定仅读取 1 条样本): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py --num-prompts 1 +``` + +**方式二:进阶方式 — 通过 `test_range` 灵活指定读取范围** + +如果需要更灵活的范围控制(如指定起始索引、自定义步长等),可在自定义配置文件中直接设置数据集的 `reader_cfg.test_range` 字段,无需通过命令行参数。完整样例请参考 [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# 关键:通过 reader_cfg.test_range 灵活控制采样范围 +# 例如:'[0:8]' 表示读取前 8 条样本;'[10:20]' 表示读取索引 10 到 20 的样本 +datasets[0]['reader_cfg']['test_range'] = '[0:8]' + +models = vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +执行命令(已在配置文件中指定 test_range,无需再传 `--num-prompts`): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py +``` + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --num-prompts 1 ``` 上述命令仅对示例数据集中的第一条记录进行推理并只对这一条记录进行精度评估。 -> ⚠️ 注意:当前数据集会按照默认队列顺序依次读取,不支持随机抽样或打乱顺序。 + +::: +:::: + +> ⚠️ 注意:当前数据集会按照默认队列顺序依次读取,不支持随机抽样或打乱顺序。同时配置文件中设置 `reader_cfg.test_range` 与命令行 `--num-prompts` 时,命令行参数 `--num-prompts` 优先级更高。 ### 多次独立重复推理 > 该功能开启后,由于`数据集`/`请求数量`将按照`数据点级别`成倍扩充,从而导致推理时间显著变长,且使用内存显著提高。请在阅读 📚 [精度评测场景:评估指标解析](../results_intro/accuracy_metric.md) 后,**确认当前场景是否需要开启该功能**。 -该场景旨在从可靠性、稳定性、整体准确性等多维度探究模型能力,开启方式为:在 `服务化推理后端配置参数` 中的超参 `generation_kwargs` 中配置 🔗[`num_return_sequences`参数数值](../all_params/models.md#服务化推理后端配置参数说明),格式按照以下示例内容(取值仅供参考): +该场景旨在从可靠性、稳定性、整体准确性等多维度探究模型能力,开启方式为:在 `服务化推理后端配置参数` 中的超参 `generation_kwargs` 中配置 🔗[`num_return_sequences`参数数值](../all_params/models.md#服务化推理后端配置参数说明)。 + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +完整样例请参考 [multi_repeat_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +# 关键:通过 generation_kwargs.num_return_sequences 启用多次独立重复推理 +models[0]["generation_kwargs"] = dict( + temperature=0.01, + ignore_eos=False, + num_return_sequences=5, # 具体作用和约束请参考文档 accuracy_metric.md +) +# ...其余参数配置详见配置文件 +``` + +执行命令: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py +``` + +::: +:::{tab-item} 备选:使用命令行参数 + +修改模型任务配置文件中的 `generation_kwargs`: ```python models = [ @@ -222,11 +565,14 @@ models = [ num_return_sequences = 5, # 具体作用和约束请参考文档 accuracy_metric.md ... # 其它参数 ), - ... + ... # 其它参数 ) ] ``` +::: +:::: + 精度评估阶段结束后,结果会记录在日志和打屏在运行窗口,格式按照以下示例内容(数据仅供参考): ```bash @@ -240,6 +586,25 @@ models = [ 上表中,**具体指标解读**和**参数约束** 请参考📚 [精度评测场景:评估指标解析](accuracy_metric.md) +## 通过自定义配置文件实现 + +> 💡 上述所有功能场景(多任务测评、多任务并行、中断续测、合并子数据集、固定请求数测评、多次独立重复推理、推理结果重评估等)均提供了两种启动方式(**⭐ 推荐:使用自定义配置文件**、**备选:使用命令行参数**)。自定义配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。 + +本章节涉及的所有自定义配置文件样例已统一存放在 `ais_bench/configs/accuracy_benchmark/` 目录下,便于查阅与复用: + +| 文件名 | 对应场景 | +| --- | --- | +| [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py) | 单任务测评 | +| [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_zh_cn.py) | 多任务测评 | +| [multi_task_parallel_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_parallel_zh_cn.py) | 多任务并行测评 | +| [multi_task_resume_partial_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_task_resume_partial_zh_cn.py) | 中断续测 & 失败用例重测(部分任务) | +| [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/ceval_merge_zh_cn.py) | 合并子数据集推理 | +| [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/fixed_prompts_zh_cn.py) | 固定请求数测评 | +| [multi_repeat_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/multi_repeat_zh_cn.py) | 多次独立重复推理 | +| [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py) | 推理结果重评估 | + +> 关于自定义配置文件语法的完整说明(包括可定义的顶层变量、字段详解、Python 高级用法等),请参考 📚 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md);其中"各场景自定义配置文件示例"章节还提供了 10 种典型场景的完整样例(如服务化性能测评、合成数据集性能测评、稳态性能测评、多轮对话性能测评、裁判模型测评、自定义数据集测评等)。 + ## 其他功能场景 ### 推理结果重评估 主要功能场景下评测任务的执行流程包括完整的推理 → 评估 → 汇总流程: @@ -254,6 +619,54 @@ graph LR; 整个执行流程中的每个环节都是独立解耦的,推理结果是可以反复重评估的,如果第一次执行精度评测的到的精度数据有问题(比如没有准确得提取出response中有价值的内容),就可以修改答案提取的方式,执行推理结果重评估。具体操作如下。 假设上次执行性能测评的命令是: + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/single_task_zh_cn.py +``` +同时提示落盘的时间戳为`20250628_151326`,但是8条case的精度数据有问题,只得了0分: +```bash +dataset version metric mode vllm_api_general_chat +----------------------- -------- -------- ----- ---------------------- +demo_gsm8k 401e4c accuracy gen 00.00 +``` +查看`20250628_151326/predictions/vllm-api-general-chat/gsm8k.json`,发现推理结果中实际给了正确的答案。 + +**重评估步骤:** + +1. 编辑自定义配置文件(如 [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py)),按照实际需求覆盖对应数据集的 `eval_cfg` 中答案提取函数(参考下面示例);其中 `pred_postprocessor` 负责从模型输出中提取答案,可根据实际情况替换或自定义。完整样例如下: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import gsm8k_postprocess, gsm8k_dataset_postprocess + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + +models = vllm_api_general_chat +# ...其余参数配置详见配置文件 + +# 关键:替换或修改答案的提取函数实现 +datasets[0]['eval_cfg']['pred_postprocessor'] = dict(type=gsm8k_postprocess) +datasets[0]['eval_cfg']['dataset_postprocessor'] = dict(type=gsm8k_dataset_postprocess) +``` + +2. 在第一次精度评测命令的基础上叠加 `--mode eval` 和 `--reuse {复用的推理结果所在的时间戳}` 反复重评估(`--mode` 与 `--reuse` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark/inference_re_eval_zh_cn.py --mode eval --reuse 20250628_151326 +``` + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt ``` @@ -276,6 +689,7 @@ ais_bench --datasets gsm8k_gen_4_shot_cot_chat_prompt --search ╘═════════════╧═══════════════════════════════════════╧═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╛ ``` + 打开`gsm8k_gen_4_shot_cot_chat_prompt.py`替换或修改答案的提取函数 ```python # ...... @@ -295,4 +709,7 @@ gsm8k_eval_cfg = dict(evaluator=dict(type=Gsm8kEvaluator), ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode eval --reuse 20250628_151326 -``` \ No newline at end of file +``` + +::: +:::: \ No newline at end of file diff --git a/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark_local.md b/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark_local.md index 2187ddd6..bc6ed5a7 100644 --- a/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark_local.md +++ b/docs/source_zh_cn/base_tutorials/scenes_intro/accuracy_benchmark_local.md @@ -8,18 +8,203 @@ - 模型任务准备:从📚 [本地模型后端](../all_params/models.md#本地模型后端)中选择要执行的模型任务。 ## 主要功能 -纯模型精度测评场景下主要功能与服务化精度测评场景相似。 + +纯模型精度测评场景下主要功能与服务化精度测评场景相似,但需要将模型任务替换为本地 HuggingFace 模型任务(如 [`HuggingFacewithChatTemplate`](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/models/huggingface_chat_model.py) 或 [`HuggingFaceBaseModel`](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/models/huggingface_base_model.py))。 + ### 纯模型多任务测评 -参考[服务化精度多任务测评使用方法](accuracy_benchmark.md#多任务测评) + +支持同时配置多个数据集任务,通过单次命令进行批量测评。完整样例请参考 [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + +datasets = gsm8k_datasets + aime2024_datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + # ...其余参数配置详见配置文件 + ) +] +``` + +执行命令: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py +``` + +#### 自定义模型-数据集配对(可选) + +默认情况下,上述配置中 `models` 列表与 `datasets` 列表会自动按笛卡尔积组合,子任务数为模型数 × 数据集数(本例为 1 × 2 = 2 个)。若希望精确控制哪些模型与哪些数据集配对(例如只让该模型跑部分数据集),可在配置文件中通过 `model_dataset_combinations` 字段显式声明配对关系: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_chat_prompt import aime2024_datasets + +datasets = gsm8k_datasets + aime2024_datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + ) +] + +# 关键:通过 model_dataset_combinations 精确控制配对 +# 下例仅生成 1 个子任务(笛卡尔积会生成 2 个): +# - hf-chat-model + gsm8k +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), +] +``` + +> ⚠️ **注意**:模型与数据集的唯一标识由 `abbr` 字段决定。同一配置文件中,相同 `abbr` 的模型或数据集重复出现的组合会被视为重复任务而被跳过。当通过 `.copy()` 等方式复用模型/数据集配置时,必须显式修改 `abbr` 以保证唯一性。详见 📚 [自定义模型与数据集组合](../../advanced_tutorials/run_custom_config.md#自定义模型与数据集组合)。 + +> 💡 详细使用方法也可参考[服务化精度多任务测评使用方法](accuracy_benchmark.md#多任务测评)。 + ### 纯模型多任务并行测评 -参考[服务化精度多任务并行测评使用方法](accuracy_benchmark.md#多任务并行测评)。 + +支持通过 [`--max-num-workers`](../all_params/cli_args.md#公共参数) 命令行参数实现多任务并行。配置文件样例与[纯模型多任务测评](#纯模型多任务测评)完全一致,区别仅在执行命令。 + +执行命令(以 `max-num-workers 4` 为例): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py --max-num-workers 4 +``` + > ⚠️ 注意:纯模型精度测评多任务并行会占用不同GPU单元,并行任务所需的GPU单元应小于等于可使用的GPU总数。 + +> 💡 详细使用方法也可参考[服务化精度多任务并行测评使用方法](accuracy_benchmark.md#多任务并行测评)。 + ### 纯模型中断续测 -在纯模型精度测评过程中,如遇任务中断,可通过 `--reuse` 参数指定任务时间戳目录,继续未完成的推理任务,实现断点续测。该功能无需重复运行全部任务,仅对未完成部分进行补充推理。使用详情可参考[服务化精度中断续测使用方法](accuracy_benchmark.md#中断续测--失败用例重测)。 + +在纯模型精度测评过程中,如遇任务中断,可通过 `--reuse` 参数指定任务时间戳目录,继续未完成的推理任务,实现断点续测。该功能无需重复运行全部任务,仅对未完成部分进行补充推理。 + +首次执行命令: + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py +``` + +通过 `--reuse` 参数指定任务时间戳目录续推(`--reuse` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py --reuse 20250628_151326 +``` + > ⚠️ 注意,纯模型精度测评当前不支持失败用例自动重测。 + +> 💡 详细使用方法也可参考[服务化精度中断续测使用方法](accuracy_benchmark.md#中断续测--失败用例重测)。 + ### 纯模型合并子数据集推理 -参考[服务化精度合并子数据集推理使用方法](accuracy_benchmark.md#合并子数据集推理)。 + +支持将存在多个小规模数据集的数据集合并为一个任务进行统一测评。完整样例请参考 [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.ceval.ceval_gen_5_shot_str import ceval_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + # ...其余参数配置详见配置文件 + ) +] +``` + +执行命令(`--merge-ds` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py --merge-ds +``` + +> 💡 详细使用方法也可参考[服务化精度合并子数据集推理使用方法](accuracy_benchmark.md#合并子数据集推理)。 + +## 通过自定义配置文件实现 + +> 💡 上述所有功能场景(多任务测评、多任务并行、中断续测、合并子数据集等)均可以通过 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 实现。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。 + +本章节涉及的所有自定义配置文件样例已统一存放在 `ais_bench/configs/accuracy_benchmark_local/` 目录下,便于查阅与复用: + +| 文件名 | 对应场景 | +| --- | --- | +| [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/single_task_zh_cn.py) | 单任务测评 | +| [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/multi_task_zh_cn.py) | 纯模型多任务测评 / 多任务并行测评 | +| [ceval_merge_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/ceval_merge_zh_cn.py) | 合并子数据集推理 | +| [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py) | 纯模型推理结果重评估 | + +详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md#各场景自定义配置文件示例) 中"纯模型精度测评"示例。 ## 其他功能 + ### 纯模型推理结果重评估 -参考[服务化精度推理结果重评估使用方法](accuracy_benchmark.md#推理结果重评估)。 \ No newline at end of file + +完整样例请参考 [inference_re_eval_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.models import HuggingFacewithChatTemplate +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask +from ais_bench.benchmark.datasets import gsm8k_postprocess, gsm8k_dataset_postprocess + +with read_base(): + from ais_bench.benchmark.configs.summarizers.example import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = [ + dict( + type=HuggingFacewithChatTemplate, + abbr='hf-chat-model', + path='THUDM/chatglm-6b', # 替换为实际的本地模型权重路径 + tokenizer_path='THUDM/chatglm-6b', + # ...其余参数配置详见配置文件 + ) +] + +# 关键:替换或修改答案的提取函数实现 +datasets[0]['eval_cfg']['pred_postprocessor'] = dict(type=gsm8k_postprocess) +datasets[0]['eval_cfg']['dataset_postprocessor'] = dict(type=gsm8k_dataset_postprocess) +``` + +执行命令(`--mode eval` 与 `--reuse` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +ais_bench ais_bench/configs/accuracy_benchmark_local/inference_re_eval_zh_cn.py --mode eval --reuse 20250628_151326 +``` + +> 💡 详细使用方法也可参考[服务化精度推理结果重评估使用方法](accuracy_benchmark.md#推理结果重评估)。 \ No newline at end of file diff --git a/docs/source_zh_cn/base_tutorials/scenes_intro/performance_benchmark.md b/docs/source_zh_cn/base_tutorials/scenes_intro/performance_benchmark.md index fff58afa..77b2c31f 100644 --- a/docs/source_zh_cn/base_tutorials/scenes_intro/performance_benchmark.md +++ b/docs/source_zh_cn/base_tutorials/scenes_intro/performance_benchmark.md @@ -81,10 +81,27 @@ models = [ ] ``` ### 执行命令 + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +完成配置后,执行命令启动服务化性能评测: + +```bash +ais_bench ais_bench/configs/performance_benchmark/single_task_zh_cn.py --mode perf +``` + +::: +:::{tab-item} 备选:使用命令行参数 + 修改好配置文件后,执行命令启动服务化性能评测: + ```bash ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf ``` + +::: +:::: #### 查看任务执行细节 执行AISBench命令后,正在执行的任务状态会在命令行实时刷新的看板上显示(键盘按"P"键可以停止刷新,用于复制看板信息,再按"P"可以继续刷新),例如: ``` @@ -197,12 +214,93 @@ outputs/default/20251106_103326/logs/infer/vllm-api-stream-chat/demo_gsm8k.out - 服务化模型后端配置:从[服务化推理后端](../all_params/models.md#服务化推理后端)中选择接口类型为`流式接口`的子服务(⚠️ 其他不支持)。 ## 主要功能场景 -### 单任务评测 -参考[服务化性能测评快速入门](#服务化性能测评快速入门) +### 单任务测评 +参考[服务化性能测评快速入门](#服务化性能测评快速入门)。快速入门已经提供了两种启动方式: + +- ⭐ 推荐:使用自定义配置文件 [服务化性能测评快速入门-使用自定义配置文件](#-推荐使用自定义配置文件) +- 备选:使用命令行参数 [服务化性能测评快速入门-使用命令行参数](#备选使用命令行参数) + ### 多任务测评 支持同时配置多个模型或多个数据集任务,通过单次命令进行批量测评,适用于多个测试命令串行执行。 -#### 命令说明 -用户可通过`--models`和`--datasets`参数指定多个配置任务,子任务数为`--models`配置任务数和`--datasets`配置任务数的乘积,即一个模型配置和一个数据集配置组成一个子任务,示例: + +#### 子任务组合说明 + +多任务测评场景下,子任务数为`models`配置任务数和`datasets`配置任务数的乘积,即一个模型配置和一个数据集配置组成一个子任务。 + +下面以同时测评2个模型任务(`vllm_api_general_stream`、`vllm_api_stream_chat`)和2个数据集任务(`gsm8k_gen_4_shot_cot_str`、`aime2024_gen_0_shot_str`)为例,将执行以下4个组合性能测试任务: + ++ [vllm_api_general_stream](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py)模型任务 + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) 数据集任务 ++ [vllm_api_general_stream](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_stream.py)模型任务 + [aime2024_gen_0_shot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_str) 数据集任务 ++ [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py)模型任务 + [gsm8k_gen_4_shot_cot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/gsm8k/gsm8k_gen_4_shot_cot_str.py) 数据集任务 ++ [vllm_api_stream_chat](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py)模型任务 + [aime2024_gen_0_shot_str](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/aime2024/aime2024_gen_0_shot_str.py) 数据集任务 + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +在`with read_base():`中导入多个模型任务和数据集任务,然后将其合并到 `models`、`datasets` 列表即可。完整样例请参考 [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/multi_task_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_str import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets + +models = vllm_api_general_stream + vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +修改好配置文件后,执行命令: + +```bash +ais_bench ais_bench/configs/performance_benchmark/multi_task_zh_cn.py --mode perf +``` + +#### 自定义模型-数据集配对(可选) + +默认情况下,上述配置中 `models` 列表与 `datasets` 列表会自动按笛卡尔积组合,子任务数为模型数 × 数据集数(本例为 2 × 2 = 4 个)。若希望精确控制哪些模型与哪些数据集配对(例如让部分模型只跑部分数据集、避免无意义的组合),可在配置文件中通过 `model_dataset_combinations` 字段显式声明配对关系: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.gsm8k.gsm8k_gen_4_shot_cot_str import gsm8k_datasets + from ais_bench.benchmark.configs.datasets.aime2024.aime2024_gen_0_shot_str import aime2024_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +datasets = gsm8k_datasets + aime2024_datasets +models = vllm_api_general_stream + vllm_api_stream_chat + +# 关键:通过 model_dataset_combinations 精确控制配对 +# 下例仅生成 2 个子任务(笛卡尔积会生成 4 个): +# - vllm_api_general_stream + gsm8k_gen_4_shot_cot_str +# - vllm_api_stream_chat + aime2024_gen_0_shot_str +model_dataset_combinations = [ + dict(models=[models[0]], datasets=[datasets[0]]), + dict(models=[models[1]], datasets=[datasets[1]]), +] +``` + +> ⚠️ **注意**:模型与数据集的唯一标识由 `abbr` 字段决定。同一配置文件中,相同 `abbr` 的模型或数据集重复出现的组合会被视为重复任务而被跳过。当通过 `.copy()` 等方式复用模型/数据集配置时,必须显式修改 `abbr` 以保证唯一性。详见 📚 [自定义模型与数据集组合](../../advanced_tutorials/run_custom_config.md#自定义模型与数据集组合)。 + +::: + +:::{tab-item} 备选:使用命令行参数 + +用户可通过`--models`和`--datasets`参数指定多个配置任务,示例: ```bash ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf ``` @@ -240,6 +338,8 @@ ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k ```bash ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k_gen_4_shot_cot_str aime2024_gen_0_shot_str --mode perf ``` +::: +:::: 执行过程中会在📚 [`--work-dir`](../all_params/cli_args.md#公共参数)路径(默认是`outputs/default/`)下创建时间戳目录用于保存执行细节。 4个性能评测任务结束后会一次性打印4个任务的性能结果: @@ -328,7 +428,64 @@ ais_bench --models vllm_api_general_stream vllm_api_stream_chat --datasets gsm8k ### 自定义序列长度测评 + +自定义序列长度测评需要指定特殊的数据集任务 `synthetic_gen_string`,并在模型任务的 `generation_kwargs` 中配置 `ignore_eos = True` 以确保达到最大输出长度。 + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +完整样例请参考 [synthetic_gen_string_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/synthetic_gen_string_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import synthetic_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# 关键:自定义输入输出分布(可通过修改synthetic_config调整) +synthetic_config = { + "Type": "string", + "RequestCount": 1000, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": 50, "MaxValue": 500} + }, + "Output": { + "Method": "uniform", + "Params": {"MinValue": 20, "MaxValue": 200} + } + } +} + +datasets = [] +for ds in synthetic_datasets: + ds = dict(ds) + ds["config"] = synthetic_config + datasets.append(ds) + +models = vllm_api_stream_chat +# 关键:性能测试时需将 ignore_eos 设置为 True 以确保达到最大输出长度 +models[0]["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) +# ...其余参数配置详见配置文件 +``` + +执行命令: + +```bash +ais_bench ais_bench/configs/performance_benchmark/synthetic_gen_string_zh_cn.py --mode perf +``` + +::: +:::{tab-item} 备选:使用命令行参数 + #### 1 配置自定义序列数据集输入输出分布 + 自定义序列长度测评需要指定特殊的数据集任务`synthetic_gen_string`,执行如下命令来检索`synthetic_gen_string`对应的配置文件所在路径”: ```bash ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string --search @@ -364,6 +521,7 @@ synthetic_config = { 💡更多的自定义输入输出分布可参考📚 [随机合成数据集](../../advanced_tutorials/synthetic_dataset.md) #### 2 确保推理服务达到设置的最大输出 + 为了确保推理服务达到设置的最大输出,需要在📚 [服务化模型配置](../all_params/models.md#服务化推理后端配置参数说明)的 `generation_kwargs` 中配置特殊的后处理参数`ignore_eos = True`,以控制请求的最大输出长度(不提前结束)。 例如修改`vllm_api_stream_chat`模型任务对应的配置文件[vllm_api_stream_chat.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/models/vllm_api/vllm_api_stream_chat.py)内容: @@ -385,23 +543,236 @@ models = [ ``` #### 3 启动性能测评 + 执行以下命令: ```bash ais_bench --models vllm_api_stream_chat --datasets synthetic_gen_string -m perf ``` + +::: +:::: 完成后,输出目录结构同[多任务测评](#多任务测评)章节所示,会在 performance/vllm-api-stream-chat/synthetic* 下生成相应的 CSV/JSON/HTML 文件。 + > ⚠️ 注意: > - 部分服务化后端不支持 `ignore_eos` 后处理参数,此时实际输出的 `Token` 数可能无法达到所配置的最大输出长度,需要通过其他后处理参数的配置达到最大输出长度(例如限定最小输出的后处理参数等)。 +### 自定义序列的多任务组合测评 + +实际性能测评中,经常需要在同一推理服务上批量验证不同并发(`batch_size`)、不同请求频率(`request_rate`)、不同生成参数(`generation_kwargs`)下的表现,同时还要对比不同请求个数、不同输入/输出长度的组合。**自定义配置文件仅靠 Python 脚本即可批量生成上述所有组合任务**,无需手工复制粘贴多个配置文件。 + +下面的示例演示:将 `batch_size`、`request_rate`、`request_count`、`input_range`、`output_range` 这 5 个参数统一收束到 `tasks_params` 字典中,**用户希望配几个任务就在对应列表中追加几个元素**,无需关心其余模板代码;后端代码会自动按 `tasks_params` 的列表长度生成对应数量的模型任务与数据集任务,**并通过 `model_dataset_combinations` 按索引一一配对**(`models[i]` 配 `datasets[i]`,而非笛卡尔积),数据集名称按索引自动生成(如 `synthetic-string-0`、`synthetic-string-1`...)。 + +完整样例请参考 [multi_task_synthetic_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/multi_task_synthetic_zh_cn.py): + +```python +import copy +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.synthetic.synthetic_gen_string import synthetic_datasets as base_synthetic_datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as base_vllm_api_stream_chat + +# 关键:统一收束 batch_size / request_rate / request_count / input_range / output_range 五个参数 +# 用户希望配几个任务,就在对应列表中追加几个元素(同一分组内的列表长度需保持一致) +# 注意:models 与 datasets 的列表长度需保持一致,二者会按下标一一配对,而非笛卡尔积 +tasks_params = { + "models": { + "batch_size": [1, 2, 4, 8, 16, 32], + "request_rate": [0, 0, 0, 0, 0, 0], + }, + "datasets": { + "request_count": [100, 100, 100, 100, 100, 100], + "input_range": [(1, 2), (2, 4), (4, 8), (8, 16), (16, 32), (32, 64)], + "output_range": [(1, 2), (2, 4), (4, 8), (8, 16), (16, 32), (32, 64)], + }, +} + +# 关键:通过 deepcopy 复制同一个基础模型配置,按 tasks_params["models"] 批量覆盖 batch_size / request_rate +models = [] +for idx, (batch_size, request_rate) in enumerate(zip(tasks_params["models"]["batch_size"], + tasks_params["models"]["request_rate"])): + model_cfg = copy.deepcopy(base_vllm_api_stream_chat[0]) + model_cfg["abbr"] = f"vllm-api-stream-chat-bs{batch_size}-rr{request_rate}" + model_cfg["host_ip"] = "localhost" + model_cfg["host_port"] = 8080 + model_cfg["max_out_len"] = 512 + model_cfg["batch_size"] = batch_size + model_cfg["request_rate"] = request_rate + # 关键:每个模型任务使用独立的 generation_kwargs + model_cfg["generation_kwargs"] = dict(temperature=0.01, ignore_eos=True) + models.append(model_cfg) + +# 关键:按 tasks_params["datasets"] 批量构建合成数据集任务,名称按索引自动生成 +datasets = [] +for idx, (request_count, input_range, output_range) in enumerate( + zip(tasks_params["datasets"]["request_count"], + tasks_params["datasets"]["input_range"], + tasks_params["datasets"]["output_range"]) +): + ds = dict(base_synthetic_datasets[0]) + ds["abbr"] = f"synthetic-string-{idx}" + ds["config"] = { + "Type": "string", + "RequestCount": request_count, + "StringConfig": { + "Input": { + "Method": "uniform", + "Params": {"MinValue": input_range[0], "MaxValue": input_range[1]}, + }, + "Output": { + "Method": "uniform", + "Params": {"MinValue": output_range[0], "MaxValue": output_range[1]}, + }, + }, + } + datasets.append(ds) + +# 关键:按索引一一配对 models[i] 与 datasets[i],避免笛卡尔积 +# 例如 models[0](batch_size=1) 仅与 datasets[0](input_range=(1,2)) 配对,而非与所有数据集交叉组合 +model_dataset_combinations = [ + dict(models=[models[idx]], datasets=[datasets[idx]]) + for idx in range(min(len(models), len(datasets))) +] + +work_dir = "outputs/default/" + +infer = dict( + partitioner=dict(type=NaivePartitioner), + runner=dict(type=LocalAPIRunner, task=dict(type=OpenICLInferTask)), +) +``` + +执行命令: + +```bash +ais_bench ais_bench/configs/performance_benchmark/multi_task_synthetic_zh_cn.py --mode perf +``` + +上述示例中: + +- `tasks_params["models"]` 内 `batch_size` 与 `request_rate` 两个列表**逐位配对**,共同决定模型任务数量。本例中两个列表各有 6 个元素,因此生成 6 个模型任务,分别对应 `batch_size` = 1 / 2 / 4 / 8 / 16 / 32 与 `request_rate` = 0。每个任务使用唯一的 `abbr`(如 `vllm-api-stream-chat-bs1-rr0`)以保证结果可区分。 +- `tasks_params["datasets"]` 内 `request_count`、`input_range`、`output_range` 三个列表**逐位配对**,共同决定数据集任务数量。本例中三个列表各有 6 个元素,因此生成 6 个数据集任务;数据集名称按列表下标自动生成(`synthetic-string-0` ~ `synthetic-string-5`)。 +- 用户**仅需修改 `tasks_params` 的列表长度与元素值**,即可灵活调整任务数量与各项参数,无需触碰下方模板代码。 +- 通过 `model_dataset_combinations` 字段按索引一一配对 `models[i]` 与 `datasets[i]`,本例共生成 **6 个子任务**(而非笛卡尔积的 36 个),对应关系如下: + + | 子任务 | 模型 (`batch_size` / `request_rate`) | 数据集 (`input_range` / `output_range`) | + | --- | --- | --- | + | 1 | 1 / 0 | (1, 2) / (1, 2) | + | 2 | 2 / 0 | (2, 4) / (2, 4) | + | 3 | 4 / 0 | (4, 8) / (4, 8) | + | 4 | 8 / 0 | (8, 16) / (8, 16) | + | 5 | 16 / 0 | (16, 32) / (16, 32) | + | 6 | 32 / 0 | (32, 64) / (32, 64) | + +> ⚠️ **注意**:`tasks_params["models"]` 内 `batch_size` 与 `request_rate` 列表长度必须一致;`tasks_params["datasets"]` 内 `request_count`、`input_range`、`output_range` 列表长度必须一致;否则 `zip()` 截断后会丢失末尾元素。同时,由于本场景要求一一配对,`models` 列表与 `datasets` 列表的长度也建议保持一致;当长度不一致时,会以较短列表的长度为准。 + +> ⚠️ **注意**:性能测评时需将 `generation_kwargs` 中的 `ignore_eos` 设置为 `True`,以确保输出长度达到 `max_out_len` 限制;否则输出可能在到达限制长度前提前结束。 + +> 💡 关于 `model_dataset_combinations` 字段的更多用法(例如一对多、多对一等更复杂的配对),可参考 📚 [自定义模型与数据集组合](../../advanced_tutorials/run_custom_config.md#自定义模型与数据集组合)。 + +> 💡 关于合成数据集的更多分布类型(`uniform` / `gaussian` / `zipf`)与参数说明,可参考 📚 [随机合成数据集](../../advanced_tutorials/synthetic_dataset.md)。 + + ### 固定请求数测评 -当集规模过大,只想针数据对部分样本执行性能测试时,可使用 📚 [`--num-prompts`](../all_params/cli_args.md#性能测评参数) 参数指定读取的数据条数。示例如下: +当数据集规模过大,只想针对部分样本执行性能测试时,可使用以下两种方式控制读取的数据范围,二者作用一致,按使用习惯选择即可: + +- **基础方式**:通过命令行参数 📚 [`--num-prompts`](../all_params/cli_args.md#公共参数) 直接指定读取的数据条数,无需修改配置文件,使用最简单。 +- **进阶方式(功能更强大)**:在自定义配置文件中设置数据集的 `reader_cfg.test_range` 字段,支持更灵活的采样范围(如指定起始位置、自定义步长等),详细用法可参考 📚 [自定义配置文件](../../advanced_tutorials/run_custom_config.md)。 + +示例如下: + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +**方式一:基础方式 — 通过 `--num-prompts` 指定读取条数** + +完整样例请参考 [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +models = vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +执行命令(通过 `--num-prompts 1` 指定仅读取 1 条样本): + +```bash +ais_bench ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py --mode perf --num-prompts 1 +``` + +**方式二:进阶方式 — 通过 `test_range` 灵活指定读取范围** + +如果需要更灵活的范围控制(如指定起始索引、自定义步长等),可在自定义配置文件中直接设置数据集的 `reader_cfg.test_range` 字段,无需通过命令行参数。完整样例请参考 [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py): + +```python +from mmengine.config import read_base +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.summarizers.perf.default_perf import summarizer + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# 关键:通过 reader_cfg.test_range 灵活控制采样范围 +# 例如:'[0:8]' 表示读取前 8 条样本;'[10:20]' 表示读取索引 10 到 20 的样本 +datasets[0]['reader_cfg']['test_range'] = '[0:8]' + +models = vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +执行命令(已在配置文件中指定 test_range,无需再传 `--num-prompts`): + +```bash +ais_bench ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py --mode perf +``` + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt -m perf --num-prompts 1 ``` 上述命令仅对示例数据集中的第一条记录进行推理并测量性能。 -> ⚠️ 注意:当前数据集会按照默认队列顺序依次读取,不支持随机抽样或打乱顺序。 +::: +:::: + +> ⚠️ 注意:当前数据集会按照默认队列顺序依次读取,不支持随机抽样或打乱顺序。同时配置文件中设置 `reader_cfg.test_range` 与命令行 `--num-prompts` 时,命令行参数 `--num-prompts` 优先级更高。 + + +## 通过自定义配置文件实现 + +> 💡 上述所有功能场景(多任务测评、自定义序列长度、固定请求数、性能结果重计算等)均提供了两种启动方式(**⭐ 推荐:使用自定义配置文件**、**备选:使用命令行参数**)。自定义配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用。 + +本章节涉及的所有自定义配置文件样例已统一存放在 `ais_bench/configs/performance_benchmark/` 目录下,便于查阅与复用: + +| 文件名 | 对应场景 | +| --- | --- | +| [single_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/single_task_zh_cn.py) | 单任务测评 | +| [multi_task_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/multi_task_zh_cn.py) | 多任务测评 | +| [synthetic_gen_string_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/synthetic_gen_string_zh_cn.py) | 自定义序列长度测评 | +| [multi_task_synthetic_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/multi_task_synthetic_zh_cn.py) | 自定义序列的多任务组合测评 | +| [fixed_prompts_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/fixed_prompts_zh_cn.py) | 固定请求数测评 | +| [perf_recalculate_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/perf_recalculate_zh_cn.py) | 性能结果重计算 | + +> 关于自定义配置文件语法的完整说明(包括可定义的顶层变量、字段详解、Python 高级用法等),请参考 📚 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md);其中"各场景自定义配置文件示例"章节还提供了 10 种典型场景的完整样例(如服务化性能测评、合成数据集性能测评、稳态性能测评、多轮对话性能测评、裁判模型测评、自定义数据集测评等)。 ## 其他功能场景 ### 性能结果重计算 @@ -417,6 +788,63 @@ graph LR; 执行流程的每个环节是独立解耦的,计算和汇总可以基于性能采样的结果反复执行。如果直接打印出的性能数据不包含相关维度的数据(例如缺少percentage 95%的数据),就需要做一些配置修改来重计算,具体操作如下。 假设上次执行性能测评的命令是: + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +```bash +ais_bench ais_bench/configs/performance_benchmark/single_task_zh_cn.py --mode perf +``` +打印出的`Performance Parameters`表格如下所示: +```bash +[2025-11-06 11:11:33,463] [ais_bench] [INFO] Performance Results of task: vllm-api-general-stream/gsm8k: +╒══════════════════════════╤═════════╤═════════════════╤════════════════╤═════════════════╤═════════════════╤═════════════════╤════════════════╤═════════════════╤══════╕ +│ Performance Parameters │ Stage │ Average │ Min │ Max │ Median │ P75 │ P90 │ P99 │ N │ +╞══════════════════════════╪═════════╪═════════════════╪════════════════╪═════════════════╪═════════════════╪═════════════════╪════════════════╪═════════════════╪══════╡ +│ E2EL │ total │ 2753.3518 ms │ 2189.5185 ms │ 3339.4463 ms │ 2755.8153 ms │ 3039.7431 ms │ 3219.6642 ms │ 3313.0408 ms │ 1319 │ +...... +``` + +如果想知道"P95"维度的性能数据,需要在自定义配置文件(如 [perf_recalculate_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/performance_benchmark/perf_recalculate_zh_cn.py))中修改 `summarizer` 的 `stats_list` 字段,完整样例如下: + +```python +from mmengine.config import read_base +from ais_bench.benchmark.summarizers import DefaultPerfSummarizer +from ais_bench.benchmark.calculators import DefaultPerfMetricCalculator +from ais_bench.benchmark.partitioners import NaivePartitioner +from ais_bench.benchmark.runners.local_api import LocalAPIRunner +from ais_bench.benchmark.tasks import OpenICLInferTask + +with read_base(): + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + +# 关键:自定义结果呈现任务中的 stats_list,调整要呈现的性能维度 +summarizer = dict( + attr="performance", + type=DefaultPerfSummarizer, + calculator=dict( + type=DefaultPerfMetricCalculator, + stats_list=["Average", "Min", "Max", "Median", "P75", "P90", "P95", "P99"], + ) +) + +models = vllm_api_stream_chat +# ...其余参数配置详见配置文件 +``` + +其中`stats_list`中最多同时承载8个性能维度的数据。 + +修改完毕后可以执行如下命令重计算性能指标(`--mode perf_viz --pressure --reuse` 是公共参数,使用自定义配置文件时仍可通过命令行追加): + +```bash +## 注意必须指定 --mode perf_viz 以触发重计算 +ais_bench ais_bench/configs/performance_benchmark/perf_recalculate_zh_cn.py --mode perf_viz --pressure --debug --reuse 20250628_151326 +``` + +::: +:::{tab-item} 备选:使用命令行参数 + ```bash ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --mode perf ``` @@ -430,7 +858,7 @@ ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_cha ...... ``` -如果想知道“P95”维度的性能数据,需要修改`--summarizer`对应的默认结果呈现任务default_perf对应的配置文件内容,default_perf的路径通过`--search`命令查询: +如果想知道"P95"维度的性能数据,需要修改`--summarizer`对应的默认结果呈现任务default_perf对应的配置文件内容,default_perf的路径通过`--search`命令查询: ```bash ╒══════════════╤══════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ │ Task Type │ Task Name │ Config File Path │ @@ -462,6 +890,8 @@ summarizer = dict( ## 注意必须指定--summarizer default_perf ais_bench --models vllm_api_stream_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer default_perf --mode perf_viz --pressure --debug --reuse 20250628_151326 ``` +::: +:::: 性能结果打屏如下: ```bash [2025-11-06 11:11:33,463] [ais_bench] [INFO] Performance Results of task: vllm-api-general-stream/gsm8k: diff --git a/docs/source_zh_cn/best_practices/practice_ascend.md b/docs/source_zh_cn/best_practices/practice_ascend.md index b828c444..8f3547d8 100644 --- a/docs/source_zh_cn/best_practices/practice_ascend.md +++ b/docs/source_zh_cn/best_practices/practice_ascend.md @@ -1,6 +1,8 @@ # 基于昇腾800I-A2测评DeepSeek-R1数学能力,100%论文复现 ### 复现使用的aisbench评测工具版本 本文复现使用aisbench测评工具版本为[v3.0-20250331](https://github.com/AISBench/benchmark/releases/tag/v3.0-20250331) + +> 💡 本文档中的测评命令均可以通过 [自定义配置文件方式](../advanced_tutorials/run_custom_config.md) 实现,将模型、数据集、summarizer 等配置写入一个 Python 文件,一次编写、多次复用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。详见 [自定义配置文件运行AISBench](../advanced_tutorials/run_custom_config.md)。 ### 一 背景与目标 #### 1. 1复现意义 diff --git a/docs/source_zh_cn/best_practices/practice_nvidia.md b/docs/source_zh_cn/best_practices/practice_nvidia.md index 9ecb9e56..cae93b84 100644 --- a/docs/source_zh_cn/best_practices/practice_nvidia.md +++ b/docs/source_zh_cn/best_practices/practice_nvidia.md @@ -1,6 +1,8 @@ # 基于英伟达A100加速卡测评DeepSeek-R1-Distill-Qwen-14B的数学能力,100%论文复现 ### 复现使用的aisbench评测工具版本 本文复现使用aisbench测评工具版本为[v3.0-20250412](https://github.com/AISBench/benchmark/releases/tag/v3.0-20250412) + +> 💡 本文档中的测评命令均可以通过 [自定义配置文件方式](../advanced_tutorials/run_custom_config.md) 实现,将模型、数据集、summarizer 等配置写入一个 Python 文件,一次编写、多次复用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。详见 [自定义配置文件运行AISBench](../advanced_tutorials/run_custom_config.md)。 ### 一 背景与目标 #### 1. 1 复现意义 diff --git a/docs/source_zh_cn/best_practices/replicate_llm_datasets_accuracy.md b/docs/source_zh_cn/best_practices/replicate_llm_datasets_accuracy.md index 46570f54..53085f0f 100644 --- a/docs/source_zh_cn/best_practices/replicate_llm_datasets_accuracy.md +++ b/docs/source_zh_cn/best_practices/replicate_llm_datasets_accuracy.md @@ -1,4 +1,7 @@ # 复现大语言模型(LLM)论文(技术报告)中的数据集测评结果(以DeepSeek R1使用的GPQA数据集为例) + +> 💡 本文档中的测评命令均可以通过 [自定义配置文件方式](../advanced_tutorials/run_custom_config.md) 实现,将模型、数据集、summarizer 等配置写入一个 Python 文件,一次编写、多次复用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。详见 [自定义配置文件运行AISBench](../advanced_tutorials/run_custom_config.md)。 + ## 前言-方法论 如果想要通过AISBench测评工具复现论文精度,需要对齐模型的技术报告或论文中对此数据集的测试方法,在评测工具这边需要对齐的如下: **模型相关配置**: diff --git a/docs/source_zh_cn/conf.py b/docs/source_zh_cn/conf.py index 1ecc6687..28d3a189 100644 --- a/docs/source_zh_cn/conf.py +++ b/docs/source_zh_cn/conf.py @@ -37,6 +37,7 @@ 'sphinx.ext.imgconverter', # 支持图片格式转换 'sphinx.ext.mathjax', # 支持数学公式 'sphinx.ext.viewcode', # 查看代码源文件 + 'sphinx_design', # 支持 tab-set、card 等 UI 组件 ] # 4. 若使用 Markdown,需指定源文件后缀 @@ -60,6 +61,7 @@ 'dollarmath', # 支持 $ 分隔的数学公式 'html_admonition', # 支持 HTML 警告框 'replacements', # 支持文本替换 + 'colon_fence', # 支持 ::: 栅栏指令(用于 tab-set 等 sphinx_design 组件) ] # (可选)配置 Mermaid 输出格式 diff --git a/docs/source_zh_cn/extended_benchmark/agent/harbor_bench.md b/docs/source_zh_cn/extended_benchmark/agent/harbor_bench.md index be0a19f1..80ca1543 100644 --- a/docs/source_zh_cn/extended_benchmark/agent/harbor_bench.md +++ b/docs/source_zh_cn/extended_benchmark/agent/harbor_bench.md @@ -73,6 +73,8 @@ Terminal-Bench-2 预制打包镜像信息: 在 AISBench 工具根目录下修改 `ais_bench/configs/agent_example/harbor_terminal_bench_2_task.py`: +> 💡 上述 `harbor_terminal_bench_2_task.py` 即为 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 的具体应用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。你可以参考此示例文件自行编写满足特定需求的配置文件。详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md)。 + ```python models = [ dict( diff --git a/docs/source_zh_cn/extended_benchmark/agent/swe_bench.md b/docs/source_zh_cn/extended_benchmark/agent/swe_bench.md index 731113de..7c1915b1 100644 --- a/docs/source_zh_cn/extended_benchmark/agent/swe_bench.md +++ b/docs/source_zh_cn/extended_benchmark/agent/swe_bench.md @@ -21,6 +21,8 @@ SWE-bench是一个基准测试,用于评估大语言模型在从GitHub收集 - `mini_swe_agent_swe_bench_multilingual.py`:SWE-bench Multilingual(`SWE-bench/SWE-bench_Multilingual`),包含多语言 issue 描述的数据集。 - `mini_swe_agent_swe_bench_multilingual_mini.py`:SWE-bench Multilingual Mini(**15**/**30**/**60** 条),AISBench官方构造的 Multilingual 子集,用于显著降低评测成本;子集筛选/构造方式见数据集卡与构造仓库:`https://modelers.cn/datasets/AISBench/SWE-Bench_Multilingual_mini`、`https://github.com/AISBench/datasets/tree/main/mini_datasets/swe_bench_multiligual_mini`。 +> 💡 上述示例配置文件即为 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 的具体应用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。你可以参考这些示例文件自行编写满足特定需求的配置文件。详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md)。 + ## 2. 前置依赖 diff --git a/docs/source_zh_cn/extended_benchmark/agent/swe_bench_pro.md b/docs/source_zh_cn/extended_benchmark/agent/swe_bench_pro.md index bb808762..4a3481f4 100644 --- a/docs/source_zh_cn/extended_benchmark/agent/swe_bench_pro.md +++ b/docs/source_zh_cn/extended_benchmark/agent/swe_bench_pro.md @@ -19,6 +19,8 @@ SWE-Bench Pro 是一个用于评估大语言模型在长时域软件工程任务 - `mini_swe_agent_swe_bench_pro_mini.py`:SWE-bench Pro Mini,适合先跑通流程/快速迭代。 - `mini_swe_agent_swe_bench_pro_full.py`:SWE-bench Pro Full,完整测试集。 +> 💡 上述示例配置文件即为 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 的具体应用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。你可以参考这些示例文件自行编写满足特定需求的配置文件。详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md)。 + ## 2. 前置依赖 运行前请确保以下依赖可用: diff --git a/docs/source_zh_cn/extended_benchmark/agent/tau2_bench.md b/docs/source_zh_cn/extended_benchmark/agent/tau2_bench.md index 5378163e..97a3454d 100644 --- a/docs/source_zh_cn/extended_benchmark/agent/tau2_bench.md +++ b/docs/source_zh_cn/extended_benchmark/agent/tau2_bench.md @@ -64,6 +64,8 @@ ### 3. 配置τ²-Bench任务的自定义配置文件 1. 在AISBench工具根目录下修改`ais_bench/configs/agent_example/tau2_bench_task.py`中必要的配置(主要是配置被测推理服务和模拟用户的推理服务的信息) + +> 💡 上述 `tau2_bench_task.py` 即为 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 的具体应用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。你可以参考此示例文件自行编写满足特定需求的配置文件。详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md)。 ```python # ...... models = [ diff --git a/docs/source_zh_cn/extended_benchmark/lmm_generate/gedit_bench.md b/docs/source_zh_cn/extended_benchmark/lmm_generate/gedit_bench.md index 80e6ca37..8e653592 100644 --- a/docs/source_zh_cn/extended_benchmark/lmm_generate/gedit_bench.md +++ b/docs/source_zh_cn/extended_benchmark/lmm_generate/gedit_bench.md @@ -89,6 +89,8 @@ pip install yunchang==0.6.0 #### 测评配置准备 在容器中`${PATH_TO_WORKSPACE}/benchmark/ais_bench/configs/lmm_example`目录下,打开`multi_device_run_qwen_image_edit.py`文件,编辑如下内容设置模型配置: + +> 💡 上述 `multi_device_run_qwen_image_edit.py` 即为 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 的具体应用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。你可以参考此示例文件自行编写满足特定需求的配置文件。详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md)。 ```python # ...... # ====== User configuration parameters ========= diff --git a/docs/source_zh_cn/extended_benchmark/lmm_generate/vbench.md b/docs/source_zh_cn/extended_benchmark/lmm_generate/vbench.md index cbadee8f..dc22747c 100644 --- a/docs/source_zh_cn/extended_benchmark/lmm_generate/vbench.md +++ b/docs/source_zh_cn/extended_benchmark/lmm_generate/vbench.md @@ -4,6 +4,8 @@ AISBench **已适配 VBench 1.0**。仓库目录 `ais_bench/configs/vbench_examples/` 下放的是 **独立配置文件** 示例,在 **GPU** 或 **NPU** 上对**生成视频**做质量/语义类维度测评。**当前 AISBench 不包含多模态视频生成**,请先完成视频生成后再进行测评(Standard模式参考[数据集生成](#数据集生成)章节)。 +> 💡 上述 `vbench_examples/` 下的示例配置文件即为 [自定义配置文件方式](../../advanced_tutorials/run_custom_config.md) 的具体应用。配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法。你可以参考这些示例文件自行编写满足特定需求的配置文件。详见 [自定义配置文件运行AISBench](../../advanced_tutorials/run_custom_config.md)。 + ## 目录 - [依赖与环境](#依赖与环境) diff --git a/docs/source_zh_cn/get_started/quick_start.md b/docs/source_zh_cn/get_started/quick_start.md index 8c28e1b5..bb10ed62 100644 --- a/docs/source_zh_cn/get_started/quick_start.md +++ b/docs/source_zh_cn/get_started/quick_start.md @@ -1,37 +1,117 @@ # 快速入门 -## 命令含义 -AISBench命令执行的单个或多个评测任务是由模型任务(单个或多个)、数据集任务(单个或多个)和结果呈现任务(单个)的组合定义的,AISBench的其他命令行则规定了评测任务的场景(精度评测场景、性能评测场景等)。以如下AISBench命令为例: + +## 运行命令前置准备 + +- 需要准备支持`v1/chat/completions`子服务的推理服务,可以参考🔗 [VLLM启动OpenAI 兼容服务器](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server)启动推理服务 +- 需要准备gsm8k数据集,可以从🔗 [opencompass + 提供的gsm8k数据集压缩包](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip)下载。将解压后的`gsm8k/`文件夹部署到AISBench评测工具根路径下的`ais_bench/datasets`文件夹下。 + +## 启动测评(两种方式任选其一) + +| ⭐ 推荐:使用自定义配置文件 | 备选:使用命令行参数(原快速入门方式) | +| :--- | :--- | +| 修改一个文件,集中管理所有配置,在任意路径写配置 | 通过 `--models` `--datasets` 参数指定 | +| 一次编写,多次复用 | 每次运行需输入完整命令 | +| 支持 Python 全部语法,灵活扩展 | 仅支持笛卡尔积组合 | + +::::{tab-set} +:::{tab-item} ⭐ 推荐:使用自定义配置文件 + +AISBench 提供了预置的自定义配置文件 [model_api_test_zh_cn.py](https://github.com/AISBench/benchmark/tree/master/ais_bench/configs/model_api_test_zh_cn.py),将常见的推理服务化测试配置(模型选择、服务地址、端口、生成参数等)集中在一个文件中,无需分别查找和修改多个配置文件。该文件本质上是 Python 脚本,支持所有 Python 语法,你可以自由扩展。 + +打开 `ais_bench/configs/model_api_test_zh_cn.py`,根据实际情况修改以下配置(如果是`pip3 install ais_bench_benchmark`方式直接安装工具,可以在任意路径自行创建`model_api_test_zh_cn.py`,将以下配置内容写入该文件): + +```python +from mmengine.config import read_base + +with read_base(): +# 模型任务,选择其中一个,其他模型任务参考:https://ais-bench-benchmark-rf.readthedocs.io/zh-cn/latest/base_tutorials/all_params/models.html 获取更多模型任务 + # vllm_api_general 是基础模型,仅支持文本生成 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general import models as vllm_api_general + # vllm_api_general_chat 是对话模型,支持对话 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_chat import models as vllm_api_general_chat + # vllm_api_stream_chat 是流式对话模型,支持流式对话 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_stream_chat import models as vllm_api_stream_chat + # vllm_api_general_stream 是流式模型,支持流式生成 + from ais_bench.benchmark.configs.models.vllm_api.vllm_api_general_stream import models as vllm_api_general_stream + +# 数据集任务,参考:https://ais-bench-benchmark-rf.readthedocs.io/zh-cn/latest/get_started/datasets.html 获取更多数据集任务 + from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets + +models = vllm_api_general_chat + +models[0]["path"] = "" # 指定模型序列化词表文件的绝对路径(精度测试场景一般不需要配置) +models[0]["model"] = "" # 指定服务端加载的模型名称,根据 VLLM 推理服务实际拉取的模型名称配置(配置为空字符串则自动获取) +models[0]["request_rate"] = 0 # 请求发送频率:每 1/request_rate 秒向服务端发送 1 条请求;小于 0.001 时一次性发送所有请求 +models[0]["api_key"] = "" # 自定义 API key,默认为空字符串 +models[0]["host_ip"] = "localhost" # 指定推理服务的 IP +models[0]["host_port"] = 8080 # 指定推理服务的端口 +models[0]["url"] = "" # 自定义访问推理服务的 URL 路径(当基础 URL 不是 http://host_ip:host_port 的组合时需要配置;配置后 host_ip 和 host_port 将被忽略) +models[0]["max_out_len"] = 512 # 推理服务输出的最大 token 数 +models[0]["batch_size"] = 1 # 发送请求的最大并发数 +models[0]["trust_remote_code"] = False # tokenizer 是否信任远程代码,默认为 False +models[0]["generation_kwargs"] = dict( # 模型推理参数,参考 VLLM 文档配置;AISBench 评测工具不做处理,直接附加到发送的请求中 + temperature=0.01, + ignore_eos=False, +) + +# datasets[0]["path"] = ais_bench/datasets/gsm8k # 指定数据集目录的绝对路径(精度测试场景需要配置) + +work_dir = 'outputs/default/' # 指定任务结果和日志的保存工作目录(默认为 outputs/default/) + +``` + +> 💡 配置文件中已预置了常用模型类型的导入(`vllm_api_general`、`vllm_api_general_chat`、`vllm_api_stream_chat`、`vllm_api_general_stream`),只需取消/修改注释即可切换。更多自定义配置文件的用法请参考 📚 [自定义配置文件运行AISBench](../advanced_tutorials/run_custom_config.md)。 + +数据集任务的选取、准备和使用参考如下步骤: + +1. 在📚 [开源数据集](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/get_started/datasets.html#id3)内选取数据集任务 +2. 进入数据的 📚 [详细介绍/数据集部署](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README.md#数据集部署)准备数据集 +3. 参考📚 [详细介绍/可用数据集任务](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README.md#可用数据集任务)选取可用数据集任务,并将对应的任务导入方式(例如`from ais_bench.benchmark.configs.datasets.demo.demo_gsm8k_gen_4_shot_cot_chat_prompt import gsm8k_datasets as datasets`)复制到自定义配置文件中 + +修改好配置文件后,执行如下命令启动服务化精度评测: + +```bash +ais_bench ais_bench/configs/model_api_test_zh_cn.py +``` + +::: +:::{tab-item} 备选:使用命令行参数 + +如果你更习惯使用命令行参数方式,AISBench 同样支持通过 `--models`、`--datasets`、`--summarizer` 参数直接指定任务。以下是与上述自定义配置文件方式**执行效果完全相同**的命令行方式。 + +AISBench命令执行的单个或多个评测任务是由模型任务(单个或多个)、数据集任务(单个或多个)和结果呈现任务(单个)的组合定义的。以如下AISBench命令为例: + ```shell ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --summarizer example ``` + 此命令没有指定其他命令行,默认是一个精度评测场景的任务,其中: -- `--models`指定了模型任务,即`vllm_api_general_chat`模型任务。 +- `--models`指定了模型任务,即`vllm_api_general_chat`模型任务。 - `--datasets`指定了数据集任务,即`demo_gsm8k_gen_4_shot_cot_chat_prompt`数据集任务。 +- `--summarizer`指定了结果呈现任务,即`example`结果呈现任务(不指定`--summarizer`精度评测场景默认使用`example`任务),一般使用默认,不需要在命令行中指定。 -- `--summarizer`指定了结果呈现任务,即`example`结果呈现任务(不指定`--summarizer`精度评测场景默认使用`example`任务),一般使用默认,不需要在命令行中指定,后续命令不指定。 - -## 任务含义查询(可选) -所选模型任务`vllm_api_general_chat`、数据集任务`demo_gsm8k_gen_4_shot_cot_chat_prompt`和结果呈现任务`example`的具体信息(简介,使用约束等)可以分别从如下链接中查询含义: -- `--models`: 📚 [服务化推理后端](../base_tutorials/all_params/models.md#服务化推理后端) +多任务测评请参考:📚 精度场景的[多任务测评](../base_tutorials/scenes_intro/accuracy_benchmark.md#多任务测评) 和 性能场景的[多任务测评](../base_tutorials/scenes_intro/performance_benchmark.md#多任务测评)。 -- `--datasets`: 📚 [开源数据集](../get_started/datasets.md#开源数据集) → 📚 [详细介绍](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README.md) +如需自行组合测评任务,实现更灵活的测评方式,可参考:📚 [自定义配置文件运行AISBench](../advanced_tutorials/run_custom_config.md#自定义配置文件运行AISBench)。 -- `--summarizer`: 📚 [结果汇总任务](../base_tutorials/all_params/summarizer.md) +所选模型任务`vllm_api_general_chat`、数据集任务`demo_gsm8k_gen_4_shot_cot_chat_prompt`和结果呈现任务`example`的具体信息(简介,使用约束等)可以分别从如下链接中查询含义: -## 运行命令前置准备 -- `--models`: 使用`vllm_api_general_chat`模型任务,需要准备支持`v1/chat/completions`子服务的推理服务,可以参考🔗 [VLLM启动OpenAI 兼容服务器](https://docs.vllm.com.cn/en/latest/getting_started/quickstart.html#openai-compatible-server)启动推理服务 -- `--datasets`: 使用`demo_gsm8k_gen_4_shot_cot_chat_prompt`数据集任务,需要准备gsm8k数据集,可以从🔗 [opencompass -提供的gsm8k数据集压缩包](http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip)下载。将解压后的`gsm8k/`文件夹部署到AISBench评测工具根路径下的`ais_bench/datasets`文件夹下。 +- `--models`: 📚 [服务化推理后端](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/all_params/models.html#id2) +- `--datasets`: 📚 [开源数据集](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/get_started/datasets.html#id3) → 📚 [详细介绍](https://github.com/AISBench/benchmark/tree/master/ais_bench/benchmark/configs/datasets/demo/README.md) +- `--summarizer`: 📚 [结果汇总任务](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/all_params/summarizer.html) -## 任务对应配置文件修改 每个模型任务、数据集任务和结果呈现任务都对应一个配置文件,运行命令前需要修改这些配置文件的内容。这些配置文件路径可以通过在原有AISBench命令基础上加上`--search`来查询,例如: + ```shell ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --search ``` + > ⚠️ **注意**: 执行带search命令会打印出任务对应的配置文件的绝对路径。 执行查询命令可以得到如下查询结果: + ```shell ╒══════════════╤═══════════════════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕ │ Task Type │ Task Name │ Config File Path │ @@ -43,9 +123,10 @@ ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_ch ``` -- 快速入门中数据集任务配置文件`demo_gsm8k_gen_4_shot_cot_chat_prompt.py`不需要做额外修改,数据集任务配置文件内容介绍可参考📚 [配置开源数据集](../get_started/datasets.md#配置开源数据集) +- 快速入门中数据集任务配置文件`demo_gsm8k_gen_4_shot_cot_chat_prompt.py`不需要做额外修改,数据集任务配置文件内容介绍可参考📚 [配置开源数据集](https://ais-bench-benchmark.readthedocs.io/zh-cn/latest/base_tutorials/all_params/datasets.html#id6) 模型配置文件`vllm_api_general_chat.py`中包含了模型运行相关的配置内容,是需要依据实际情况修改的。快速入门中需要修改的内容用注释标明。 + ```python from ais_bench.benchmark.models import VLLMCustomAPIChat @@ -63,7 +144,7 @@ models = [ api_key="", # 自定义API key,默认是空字符串 host_ip="localhost", # 指定推理服务的IP host_port=8080, # 指定推理服务的端口 - url="", # 自定义访问推理服务的URL路径(当base url不是http://host_ip:host_port的组合时需要配置,配置后host_ip和host_port将被忽略) + url="", # 自定义访问推理服务的URL路径(当base url不是http://host_ip:host_port的组合时需要配置, 配置后host_ip和host_port会被忽略) max_out_len=512, # 推理服务输出的token的最大数量 batch_size=1, # 请求发送的最大并发数 trust_remote_code=False, # tokenizer是否信任远程代码,默认False; @@ -75,13 +156,19 @@ models = [ ] ``` -## 执行命令 修改好配置文件后,执行命令启动服务化精度评测: + ```bash ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt ``` + +::: +:::: + ## 查看任务执行细节 -执行AISBench命令后,正在执行的任务状态会在命令行实时刷新的看板上显示(键盘按"P"键可以停止刷新,用于复制看板信息,再按"P"可以继续刷新),例如: + +执行AISBench命令后,任务管理界面会在命令行实时刷新显示任务执行状态(键盘按"P"键可以暂停/恢复刷新,用于复制看板信息,再按"P"键可以继续刷新)。任务管理界面支持同时监控多个任务的详细执行状态,包括任务名称、进度、时间成本、状态、日志路径、扩展参数等信息,例如: + ``` Base path of result&log : outputs/default/20250628_151326 Task Progress Table (Updated at: 2025-11-06 10:08:21) @@ -97,47 +184,48 @@ Press Up/Down arrow to page, 'P' to PAUZE/RESUME screen refresh, 'Ctrl + C' to ``` 任务执行的细节日志会不断落盘在默认的输出路径,这个输出路径在实时刷新的看板上显示,即`Log Path`。`Log Path`(`logs/infer/vllm-api-general-chat/demo_gsm8k.out`)是在`Base path`(`outputs/default/20250628_151326`)下的路径,以上述的看板信息为例,任务执行的详细日志路径为: + ```shell # {Base path}/{Log Path} outputs/default/20250628_151326/logs/infer/vllm-api-general-chat/demo_gsm8k.out ``` > 💡 如果希望执行过程中将详细日志直接打印,执行命令时可以加上 `--debug`: -`ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug` - - - +> `ais_bench --models vllm_api_general_chat --datasets demo_gsm8k_gen_4_shot_cot_chat_prompt --debug` `Base path`(`outputs/default/20250628_151326`)下包含了所有任务的执行细节,命令执行结束后所有的执行细节如下: + ```shell 20250628_151326/ ├── configs # 模型任务、数据集任务和结构呈现任务对应的配置文件合成的一个配置 -│   └── 20250628_151326_29317.py +│ └── 20250628_151326_29317.py ├── logs # 执行过程中日志,命令中如果加--debug,不会有过程日志落盘(都直接打印出来了) -│   ├── eval -│   │   └── vllm-api-general-chat -│   │   └── demo_gsm8k.out # 基于predictions/文件夹下的推理结果的精度评测过程的日志 -│   └── infer -│   └── vllm-api-general-chat -│   └── demo_gsm8k.out # 推理过程日志 +│ ├── eval +│ │ └── vllm-api-general-chat +│ │ └── demo_gsm8k.out # 基于predictions/文件夹下的推理结果的精度评测过程的日志 +│ └── infer +│ └── vllm-api-general-chat +│ └── demo_gsm8k.out # 推理过程日志 ├── predictions -│   └── vllm-api-general-chat -│   └── demo_gsm8k.json # 推理结果(推理服务返回的所有输出) +│ └── vllm-api-general-chat +│ └── demo_gsm8k.json # 推理结果(推理服务返回的所有输出) ├── results -│   └── vllm-api-general-chat -│   └── demo_gsm8k.json # 精度评测计算的原始分数 +│ └── vllm-api-general-chat +│ └── demo_gsm8k.json # 精度评测计算的原始分数 └── summary ├── summary_20250628_151326.csv # 最终精度分数呈现(表格格式) ├── summary_20250628_151326.md # 最终精度分数呈现(markdown格式) └── summary_20250628_151326.txt # # 最终精度分数呈现(文本格式) ``` -> ⚠️ **注意**: 不同评测场景落盘任务执行细节内容不同,具体请参考具体评测场景的指南。 +> ⚠️ **注意**: 不同评测场景落盘任务执行细节内容不同,具体请参考具体评测场景的指南。 ### 输出结果 + 因为只有8条数据,会很快跑出结果,结果显示的示例如下 + ```bash dataset version metric mode vllm_api_general_chat ----------------------- -------- -------- ----- ---------------------- demo_gsm8k 401e4c accuracy gen 62.50 -``` \ No newline at end of file +``` diff --git a/docs/source_zh_cn/index.rst b/docs/source_zh_cn/index.rst index b03f38ce..d964092d 100644 --- a/docs/source_zh_cn/index.rst +++ b/docs/source_zh_cn/index.rst @@ -21,7 +21,7 @@ AISBench Benchmark 是基于 `OpenCompass ` 将引导你完成基本的精度评测配置和运行。 * :doc:`数据集准备指南 ` 将帮助你了解支持的数据集及其准备方法。 * 基础教程部分将介绍 :doc:`评测场景介绍 ` 、:doc:`评测结果说明 ` 以及 :doc:`详细参数说明 ` 等内容,帮助你更好地理解主要的评测场景的使用。 -* 如果想要更深入地了解 AISBench 评测工具的高级用法,可以参考 :doc:`进阶教程 `。 +* 如果想要更深入地了解 AISBench 评测工具的高级用法,可以参考 :doc:`进阶教程 `。**强烈推荐**阅读 :doc:`自定义配置文件运行AISBench `,配置文件本质上是 Python 脚本,支持循环、条件判断、列表推导等所有 Python 语法,可将模型、数据集、summarizer 等配置写入一个文件,一次编写、多次复用,覆盖几乎所有评测场景。 * 你可以参考 :doc:`最佳实践` 部分,了解在不同场景下使用 AISBench 评测工具的最佳实践。 * 最后,你可以参考 :doc:`常见问题 ` 部分,解决在使用 AISBench 评测工具过程中遇到的问题。