APITest 测试修复:修复被外部强杀时的 checkpoint 遗留问题#646
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
当前 PaddleAPITest 的 engineV2 / engineV4 在批量执行 API case 时,原先会在 worker 开始执行 case 后较早写入 checkpoint。这样在一些非正常中断场景下会产生不一致:
SIGKILL/SIGTERM杀掉后,下一次 resume 会错误跳过该 case本 PR 调整了 checkpoint 写入语义,只有主进程确认 case 已经完成结果归档/分类后,才将该 case 写入 checkpoint。engineV2 / engineV4 对外部 kill 都不写 checkpoint,使该 case 在下一次运行中重新进入待测集合;engineV4 不在当前运行中立即重试,避免持续 kill 的 case 阻塞当前批次。
同时,engineV4 的 compute-sanitizer 模式改为从源头隔离子进程结果日志:sanitizer child 不再直接写入主结果日志目录,而是先写入 case 级隔离日志目录;只有当 sanitizer 返回码表明子进程结果可采信时,父 worker 才将隔离日志合并回主
.tmp日志。若 sanitizer 报告错误,则丢弃隔离日志,由主进程写入唯一的cuda_error分类,避免先产生冲突再做事后去重。主要变更
1. 调整 engineV2 checkpoint 写入时机
run_test_case()开始阶段的 checkpoint 写入。SIGKILL/SIGTERM杀掉的场景特殊处理:2. 调整 engineV4 checkpoint 与 external kill 处理
run_test_case()开始阶段的 checkpoint 写入。done/timeout/crashed/error分类后统一写 checkpoint。SIGKILL/SIGTERM场景:3. 改进 engineV4 compute-sanitizer crash 语义
crash_source="child"。4. 隔离 compute-sanitizer 子进程结果日志
TEST_LOG_PATH/.tmp。TEST_LOG_PATH/.tmp/sanitizer/slot_<slot>_<pid>/...0、2:子进程流程完整走到 wrapper 层,隔离.tmp产物合并回主.tmp,后续仍由现有aggregate_logs()统一聚合;98、99、sanitizer error exitcode 或其它非零退出:不合并隔离结果日志,由主进程写入最终分类。cuda_error,避免同一 config 同时出现在pass和cuda_error等多个结果日志中。Options/ Paddle 版本信息,避免每个 case 在主日志中重复出现启动参数。.tmp/sanitizer残留。5. 新增 stale result log 清理逻辑
在
tester/api_config/log_writer.py中新增cleanup_uncheckpointed_result_logs():.tmp日志;checkpoint.txt的终态日志行;checkpoint.txt不存在,则跳过清理,避免误删历史结果日志;清理范围覆盖除 checkpoint 外的各类结果日志,例如 pass、paddle error、accuracy error、timeout、crash、OOM、CUDA error 等。
行为变化
checkpoint 语义变化
变更前:
变更后:
这能避免 case 尚未完成归档/分类时被提前 checkpoint;外部 kill case 不写 checkpoint,会在下次运行中重试。
external kill 处理策略
compute-sanitizer 结果归档策略
returncode == 0或returncode == 2时,隔离结果合并回主日志。cuda_error。returncode == 98/99等 fatal 退出不再合并隔离结果日志,分别由主进程归档为oom/cuda_error。Options。.tmp/sanitizer隔离目录在批量测试启动和结束时自动清理。result log 恢复策略