Skip to content

[疑问] AisBench和Evalscope在性能测试指标计算上有什么差异吗?相同的服务配置,测试相同的场景,性能结果差异较大。 #375

Description

@Zz-Dai

疑问描述

使用910B4-1单机8卡部署DeepSeek-V4-Flash-W8A8-MTP模型,测试5120输入/512输出,Aisbench测试结果和EvalScope测试结果在TPOT指标上相差较大

Aisbench测试结果:

场景 并发量 TTFT (avg) (s) TTFT (P99) (s) TPOT (avg) (s) TPOT (P99) (s) 总吞吐 TPS (输入+输出) 输出吞吐 TPS 平均每并发TPS (仅输出)
输入5120输出512 1 0.494 0.516 0.027 0.034 400.833 36.414 36.414
2 0.613 0.940 0.027 0.034 766.749 69.655 34.828
4 0.645 1.480 0.029 0.039 1448.444 131.588 32.897
8 0.757 2.557 0.033 0.044 2047.012 224.660 28.082
16 0.884 4.778 0.041 0.056 4087.934 371.367 23.210
32 1.436 9.003 0.066 0.092 5008.737 455.017 14.219

测试命令:
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --summarizer default_perf --mode perf --num-prompts 16
修改gsm8k数据集的test.jsonl的输出长度为5120(实际测试的input tokens为5124)
对应vllm_api_stream_chat.py配置

from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.postprocess.model_postprocessors import extract_non_reasoning_content

models = [
    dict(
        attr="service",
        type=VLLMCustomAPIChatStream,
        abbr="vllm-api-stream-chat",
        path="/home/weights/DeepSeek-V4-Flash-w8a8-mtp",
        model="dsv4",
        request_rate=0.0,
        use_timestamp=False,
        retry=2,
        api_key="",
        host_ip="localhost",
        host_port=8900,
        url="",
        max_out_len=512,
        batch_size=1,
        trust_remote_code=True,
        generation_kwargs=dict(
            temperature=0,
            ignore_eos = True
        ),
        pred_postprocessor=dict(type=extract_non_reasoning_content)
    )
]

EvalScope测试结果:

场景 并发量 TTFT (avg) (s) TTFT (P99) (s) TPOT (avg) (s) TPOT (P99) (s) 总吞吐 TPS(输入+输出) 输出吞吐 TPS 平均每并发TPS(仅输出)
输入5120输出512 1 0.492 0.518 0.044 0.050 244.720 22.230 22.230
2 0.678 0.949 0.043 0.050 486.950 44.240 22.120
4 1.146 9.082 0.047 0.065 870.210 79.050 19.763
8 0.939 6.871 0.053 0.080 1582.640 143.770 17.971
16 1.071 5.489 0.062 0.097 2701.080 245.380 15.336
32 1.621 9.354 0.095 0.148 3523.770 320.120 10.004

测试命令:

## perf.py
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments

task_cfg = Arguments(
    parallel=[1,2,4,8,16,32,34,35],
    number=[16,32,64,128,256,512,544,560],
    model='dsv4',
    url='http://localhost:8900/v1/chat/completions',
    api='openai',
    dataset='sharegpt',
    dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json',
    min_tokens=512,
    max_tokens=512,
    prefix_length=0,
    min_prompt_length=5120,
    max_prompt_length=5120,
    tokenizer_path='/home/weights/DeepSeek-V4-Flash-w8a8-mtp',
    extra_args={'ignore_eos': True}
)
results = run_perf_benchmark(task_cfg)

python perf.py

可以看到:
相同场景下,Aisbench的TPOT指标都优于EvalScope,体现在输出吞吐上就是Aisbench测试的输出吞吐比Evalscope测试的要高,请求这两个工具在性能统计上存在区别吗?

Appendix

vllm bench测试:

测试命令:

for CONCURRENCY in 1 2 4; do
    NUM_PROMPTS=$((CONCURRENCY * 16))

    echo "Running benchmark with concurrency: ${CONCURRENCY}, num_prompts: ${NUM_PROMPTS}"

    vllm bench serve \
        --backend vllm \
	--host 80.48.37.142 \
        --port 8900 \
        --model dsv4 \
        --tokenizer /home/weights/DeepSeek-V4-Flash-w8a8-mtp \
        --dataset-name sharegpt \
        --dataset-path  dsv4-evalscope/ShareGPT_V3_unfiltered_cleaned_split.json\
        --num-prompts ${NUM_PROMPTS} \
        --request-rate ${CONCURRENCY} \
        --input-len 5120 \
        --output-len 512 \
        --ignore-eos \
        --save-result \
        --result-dir ./benchmark_results
done

测试结果:

并发量 TTFT (avg) (s) TTFT (P99) (s) TPOT (avg) (s) TPOT (P99) (s)
1 4.481 8.457 0.059 0.084
2 5.849 8.515 0.065 0.085
4 23.227 77.846 0.125 0.184

前置检查

  • 我已读懂主页文档的快速入门,无法解答我的疑惑

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions