APITest 测试优化:新增GPU Cache模式,显著加速测试#648
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title: ⚡ 新增 GPU Cache Mode 加速 Big Tensor Accuracy 测试
🧭 背景
PaddleAPITest 在 big tensor / accuracy 场景下,框架自身开销较明显,历史 heatmap 显示主要集中在:
本 PR 新增显式 opt-in 的
--use_gpu_cache_mode,用于 big tensor 场景下复用 GPU 输入、减少 CPU/GPU 往返,并在 GPU 上完成部分 accuracy compare,从而降低 PaddleAPITest 框架侧固定开销🔎 问题定位
针对
paddle.matmul(Tensor([1, 32, 1024], bfloat16), Tensor([1024, 32768], bfloat16))的单 case heatmap 显示,默认 accuracy 路径中:case.test_totalavggen_numpy_input669.29 ms;torch_assert_accuracy220.74 mstorch_assert_accuracy160.75 ms;numpy 生成约 0.17 ms这说明原路径的主要框架开销来自两个阶段:
cached numpy 能解决 numpy 生成问题,但 CPU 侧 compare 和输入/输出搬运仍会成为后续瓶颈
🛠️ 实现方案
GPU cache mode 作为显式性能模式提供,不改变默认 accuracy / paddle_only 语义。启用后整体数据流调整为:
该模式通过环境变量和参数显式启用,默认关闭;
test_tol=True等需要 CPU tolerance 输出的路径仍保留原 CPU compare 行为🔧 主要变更
1. 新增
--use_gpu_cache_mode性能模式在 engineV2 / engineV4 中新增并透传
use_gpu_cache_mode参数。启用后:USE_GPU_CACHE_MODE=TrueSKIP_GPU_CLEANUP=True--use_cached_numpy=Trueempty_cache(),减少 allocator 清理开销该模式默认关闭,需要用户显式 opt-in
2. 引入 GPU tensor cache
新增 GPU cache 相关逻辑:
cached_gpu_inputs缓存3. 优化 big tensor accuracy compare 路径
调整 accuracy compare 行为:
test_tol=True仍保留 CPU comparetorch.testing.assert_close().cpu().detach(),避免不必要的数据搬运4. 降低显存 cleanup 开销
GPU cache mode 下跳过部分
torch.cuda.empty_cache()和paddle.device.cuda.empty_cache(),减少 big tensor 连续执行时 allocator cache 清理导致的同步开销5. 适配运行脚本和 profiling 工具
--use_gpu_cache_mode=True📁 改动文件
engineV2.pyuse_gpu_cache_mode,设置 GPU cache mode 相关环境变量和行为engineV4.pytester/api_config/config_analyzer.pytester/base.pytester/accuracy.pyrun-example.shrun-v4.shtools/prof/paddleapitest_matmul_heatmap.py--use-gpu-cache-mode,生成 gpu cache mode heatmap 数据✅ 验证
在
gpu3上运行 gpu cache mode heatmap:GPU cache mode 下 steady-state 结果:
case.test_totalAPITestAccuracy.testAPITestBase.torch_assert_accuracyAPITestBase.gen_torch_output_and_output_gradAPITestBase.gen_paddle_inputAPITestBase.gen_torch_inputAPITestBase.gen_numpy_inputTensorConfig.get_paddle_tensorTensorConfig.get_torch_tensorTensorConfig.get_numpy_tensorTensorConfig.get_gpu_paddle_tensorTensorConfig.get_gpu_torch_tensor性能收益如下:
结论
GPU cache mode 较好解决了原主要框架侧瓶颈:
gen_numpy_input从默认 accuracy 的 669.29 ms/iter 降到 0.11 ms/itertorch_assert_accuracy从 cached numpy 后的 160.75 ms/iter 降到 1.96 ms/iter当前剩余主要卡点是
APITestAccuracy.test内尚未进一步拆分的 Paddle/Torch matmul forward/backward kernel 执行和同步等待,留待后续长期解决。