APITest 测试优化：新增GPU Cache模式，显著加速测试 by cangtianhuang · Pull Request #648 · PFCCLab/PaddleAPITest

cangtianhuang · 2026-06-15T06:07:04Z

Title: ⚡ 新增 GPU Cache Mode 加速 Big Tensor Accuracy 测试

🧭 背景

PaddleAPITest 在 big tensor / accuracy 场景下，框架自身开销较明显，历史 heatmap 显示主要集中在：

大 shape numpy 输入生成
Paddle / Torch tensor 构造和 CPU/GPU 数据搬运
accuracy 对比时大 tensor 输出回传 CPU
case 结束后频繁清理 GPU allocator cache

本 PR 新增显式 opt-in 的 --use_gpu_cache_mode，用于 big tensor 场景下复用 GPU 输入、减少 CPU/GPU 往返，并在 GPU 上完成部分 accuracy compare，从而降低 PaddleAPITest 框架侧固定开销

🔎 问题定位

针对 paddle.matmul(Tensor([1, 32, 1024], bfloat16), Tensor([1024, 32768], bfloat16)) 的单 case heatmap 显示，默认 accuracy 路径中：

模式	`case.test_total` avg	主要热点
默认 accuracy	1111.42 ms/iter	`gen_numpy_input` 669.29 ms；`torch_assert_accuracy` 220.74 ms
cached numpy accuracy	356.94 ms/iter	`torch_assert_accuracy` 160.75 ms；numpy 生成约 0.17 ms

这说明原路径的主要框架开销来自两个阶段：

大 tensor numpy 输入生成和 CPU 到 GPU 的输入构造
大 tensor 输出回传 CPU 后做完整 accuracy compare

cached numpy 能解决 numpy 生成问题，但 CPU 侧 compare 和输入/输出搬运仍会成为后续瓶颈

🛠️ 实现方案

GPU cache mode 作为显式性能模式提供，不改变默认 accuracy / paddle_only 语义。启用后整体数据流调整为：

GPU 生成/缓存 Torch 输入
        │
        ├── Torch 路径直接复用 GPU tensor
        │
        └── Paddle 路径通过 DLPack clone 出同源 Paddle tensor

Paddle/Torch forward/backward
        │
        └── 非 test tolerance 路径优先在 GPU 上完成 assert_close

该模式通过环境变量和参数显式启用，默认关闭；test_tol=True 等需要 CPU tolerance 输出的路径仍保留原 CPU compare 行为

🔧 主要变更

1. 新增 `--use_gpu_cache_mode` 性能模式

在 engineV2 / engineV4 中新增并透传 use_gpu_cache_mode 参数。启用后：

设置 USE_GPU_CACHE_MODE=True
设置 SKIP_GPU_CLEANUP=True
自动忽略 --use_cached_numpy=True
在符合条件的 tensor 上使用 GPU tensor cache
accuracy compare 尽量保持在 GPU 上完成
跳过部分 case 结束后的 GPU empty_cache()，减少 allocator 清理开销

该模式默认关闭，需要用户显式 opt-in

2. 引入 GPU tensor cache

新增 GPU cache 相关逻辑：

新增 cached_gpu_inputs 缓存
新增 GPU cache mode 判断和清理 helper
对符合条件的 tensor，直接在 GPU 上构造 Torch tensor
通过 DLPack clone 出 Paddle tensor
Paddle / Torch 两侧使用同源随机数据，减少 CPU numpy 生成和 CPU->GPU 搬运

3. 优化 big tensor accuracy compare 路径

调整 accuracy compare 行为：

默认模式仍保持原 CPU compare 行为
test_tol=True 仍保留 CPU compare
启用 GPU cache mode 且非 test tolerance 路径时：
- Paddle 输出通过 DLPack 转为 Torch tensor
- 与 Torch 输出在 GPU 上直接 torch.testing.assert_close()
- 减少大 tensor 输出搬回 CPU 的开销
Torch 输出在 GPU cache mode 下尽量不 .cpu().detach()，避免不必要的数据搬运

4. 降低显存 cleanup 开销

GPU cache mode 下跳过部分 torch.cuda.empty_cache() 和 paddle.device.cuda.empty_cache()，减少 big tensor 连续执行时 allocator cache 清理导致的同步开销

5. 适配运行脚本和 profiling 工具

更新运行脚本示例，方便显式启用 gpu cache mode
更新 matmul heatmap profiling 工具，使其可以表达 engine 等价的 --use_gpu_cache_mode=True
兼容缺失 cache event 列表的场景，避免 heatmap 已跑完但导出阶段失败

📁 改动文件

类别	文件	功能说明
Engine 参数	`engineV2.py`	新增并透传 `use_gpu_cache_mode`，设置 GPU cache mode 相关环境变量和行为
Engine 参数	`engineV4.py`	适配 GPU cache mode 参数和运行路径
GPU cache	`tester/api_config/config_analyzer.py`	新增 GPU tensor cache、mode 判断、cache 清理和 GPU tensor 构造路径
Accuracy compare	`tester/base.py`	支持 GPU cache mode 下 GPU 侧 compare，减少 CPU 回传
Accuracy 流程	`tester/accuracy.py`	透传并使用 gpu cache mode，避免不必要的 Torch 输出 CPU detach
运行脚本	`run-example.sh`	增加 gpu cache mode 示例
运行脚本	`run-v4.sh`	增加 gpu cache mode 示例
Profiling	`tools/prof/paddleapitest_matmul_heatmap.py`	支持 `--use-gpu-cache-mode`，生成 gpu cache mode heatmap 数据

✅ 验证

在 gpu3 上运行 gpu cache mode heatmap：

CUDA_VISIBLE_DEVICES=3 python3 tools/prof/paddleapitest_matmul_heatmap.py \
  --mode accuracy \
  --device gpu:0 \
  --use-gpu-cache-mode \
  --warmup 2 \
  --repeat 10 \
  --output-dir profiler_outputs/matmul_accuracy_gpu_cache_mode_heatmap_gpu3

GPU cache mode 下 steady-state 结果：

阶段	avg
`case.test_total`	114.90 ms/iter
`APITestAccuracy.test`	114.87 ms/iter
`APITestBase.torch_assert_accuracy`	1.96 ms/iter
`APITestBase.gen_torch_output_and_output_grad`	0.41 ms/iter
`APITestBase.gen_paddle_input`	0.22 ms/iter
`APITestBase.gen_torch_input`	0.21 ms/iter
`APITestBase.gen_numpy_input`	0.11 ms/iter
`TensorConfig.get_paddle_tensor`	0.11 ms/iter
`TensorConfig.get_torch_tensor`	0.11 ms/iter
`TensorConfig.get_numpy_tensor`	0.07 ms/iter
`TensorConfig.get_gpu_paddle_tensor`	0.04 ms/iter
`TensorConfig.get_gpu_torch_tensor`	0.03 ms/iter

性能收益如下：

对比对象	变化	加速比	耗时下降
默认 accuracy	1111.42 -> 114.90 ms/iter	约 9.7x	约 89.7%
cached numpy accuracy	356.94 -> 114.90 ms/iter	约 3.1x	约 67.8%

结论

GPU cache mode 较好解决了原主要框架侧瓶颈：

gen_numpy_input 从默认 accuracy 的 669.29 ms/iter 降到 0.11 ms/iter
torch_assert_accuracy 从 cached numpy 后的 160.75 ms/iter 降到 1.96 ms/iter
Paddle / Torch 输入 tensor 获取均降到约 0.1 ms/iter
单 case 总耗时从默认 accuracy 的 1111.42 ms/iter 降到 114.90 ms/iter

当前剩余主要卡点是 APITestAccuracy.test 内尚未进一步拆分的 Paddle/Torch matmul forward/backward kernel 执行和同步等待，留待后续长期解决。

cangtianhuang and others added 6 commits June 15, 2026 13:53

⚡️ Support gpu cache mode to accelerate testing

bc51803

🚸 Optimize run sh

06b839e

Merge remote-tracking branch 'upstream/main' into perf/big-tensor

5eaf2b0

Merge remote-tracking branch 'upstream/main' into perf/big-tensor

9aed219

Merge remote-tracking branch 'upstream/main' into perf/big-tensor

81370a5

✨ Add apitest self heatmap test

8ceb9fe

cangtianhuang merged commit d803faf into PFCCLab:main Jun 22, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

APITest 测试优化：新增GPU Cache模式，显著加速测试#648

APITest 测试优化：新增GPU Cache模式，显著加速测试#648
cangtianhuang merged 6 commits into
PFCCLab:mainfrom
cangtianhuang:perf/big-tensor

cangtianhuang commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cangtianhuang commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧭 背景

🔎 问题定位

🛠️ 实现方案

🔧 主要变更

1. 新增 --use_gpu_cache_mode 性能模式

2. 引入 GPU tensor cache

3. 优化 big tensor accuracy compare 路径

4. 降低显存 cleanup 开销

5. 适配运行脚本和 profiling 工具

📁 改动文件

✅ 验证

结论

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cangtianhuang commented Jun 15, 2026 •

edited

Loading

1. 新增 `--use_gpu_cache_mode` 性能模式