tune-all应该去找runonce_gemm_cuda runonce_gemm_cute runonce_gemm_ref. 之前的binary runonce_gemm 改成runonce_gemm_cute,这样才能兼容多个输出的情况
tune-all应该去找runonce_gemm_cuda runonce_gemm_cute runonce_gemm_ref. 之前的binary runonce_gemm 改成runonce_gemm_cute,这样才能兼容多个输出的情况