[Bug] mmstar数据集精度测试结果全0

### 操作系统及版本

openeuler22.04.5 LTS

### 安装工具的python环境

docker容器中的python环境

### python版本

3.11

### AISBench工具版本

3.1.20260327

### AISBench执行命令

ais_bench --models vllm_api_general_chat --datasets mmstar_gen --debug

### 模型配置文件或自定义配置文件内容

from ais_bench.benchmark.models import VLLMCustomAPIChat
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChat,
        abbr="vllm-api-general-chat",
        path="",
        model="qwen3-vl-32b",
        stream=False,
        request_rate=0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="0.0.0.0",
        host_port=8000,
        url="",
        max_out_len=16000,
        batch_size=128,
        trust_remote_code=False,
        generation_kwargs=dict(
            temperature=0.0,
            ignore_eos=True,
        ),
#        pred_postprocessor=dict(type=extract_non_reasoning_content),
    )
]

### 预期行为

期望得到qwen3-vl-32b-instruct模型对于mmstar数据集的精度测试结果

### 实际行为

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| mmstar | d9b8ec | Overall | gen | 0.00 |
| mmstar | d9b8ec | coarse perception | gen | 0.00 |
| mmstar | d9b8ec | fine-grained perception | gen | 0.00 |
| mmstar | d9b8ec | instance reasoning | gen | 0.00 |
| mmstar | d9b8ec | logical reasoning | gen | 0.00 |
| mmstar | d9b8ec | math | gen | 0.00 |
| mmstar | d9b8ec | science & technology | gen | 0.00 |

### 前置检查

- [x] 我已读懂主页文档的快速入门，无法解决问题
- [x] 我已检索过FAQ，无重复问题
- [x] 我已搜索过现有Issue，无重复问题
- [x] 我已更新到最新版本，问题仍存在

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] mmstar数据集精度测试结果全0 #254

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

pred_postprocessor=dict(type=extract_non_reasoning_content),

预期行为

实际行为

前置检查

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dataset	version	metric	mode
mmstar	d9b8ec	Overall	gen
mmstar	d9b8ec	coarse perception	gen
mmstar	d9b8ec	fine-grained perception	gen
mmstar	d9b8ec	instance reasoning	gen
mmstar	d9b8ec	logical reasoning	gen
mmstar	d9b8ec	math	gen
mmstar	d9b8ec	science & technology	gen

Uh oh!

[Bug] mmstar数据集精度测试结果全0 #254

Description

操作系统及版本

安装工具的python环境

python版本

AISBench工具版本

AISBench执行命令

模型配置文件或自定义配置文件内容

pred_postprocessor=dict(type=extract_non_reasoning_content),

预期行为

实际行为

前置检查

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions