Add VMP prepared to DFT variant#60
Conversation
|
That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?) also, I think that there is an avx variant of |
I put the wrong variant in the PR description, sorry. The AVX variant is correctly used (and is the bottleneck in my desktop with ~50% of my OnionPIR exec time).
I am not sure why the product should be slower. Just to be clear, the version I added is the usual VMP dft to dft which takes a prepared matrix, but changing the input vector to be in reim4 DFT instead of reim DFT (PMAT x reim4 -> reim). Then instead of calling In my tests:
For those I get the very noticeable performance improvements (1.5x to 2.5x). I have also tried it in a smaller micro-benchmark for a half-external product, with
On my laptop these parameters give me a 1.2x speedup for a GLWEGadget x bivariate polynomial half-external product. |
|
Wow, I did not expect that spqlios product would be that efficient. But if we count the cycles, it can make sense: 30 SIMD complex products (approx 200 avx2 SIMD floating point multiplications) can indeed in certain cases be faster than 2 random access memory read... [which seem to be happening right now] Ok, if it results in a 2.5x end-to-end speedup in the use-case, we will have to go in the direction of the PR. I need to think about it a little bit. |
ngama75
left a comment
There was a problem hiding this comment.
Ok in principle,
mostly some renaming to do (in addition to the fixes you mentioned already) before we can merge.
| module->func.znx_small_single_product = fft64_znx_small_single_product; | ||
| module->func.znx_small_single_product_tmp_bytes = fft64_znx_small_single_product_tmp_bytes; | ||
| module->func.vmp_prepare_contiguous = fft64_vmp_prepare_contiguous_ref; | ||
| module->func.vmp_prepare_contiguous_vec = fft64_vmp_prepare_contiguous_vec_ref; |
There was a problem hiding this comment.
Let's rename using this convention:
vmp_prepare_vector_ref
vmp_prepare_vector_avx
vmp_prepare_vector_tmp_bytes
vmp_apply_pvec_to_dft_ref
vmp_apply_pvec_to_dft_avx
vmp_apply_pvec_to_dft_tmp_bytes
| /** @brief opaque type that represents a vector of znx in DFT space */ | ||
| typedef struct vec_znx_dft_t VEC_ZNX_DFT; | ||
| /** @brief opaque type that represents a vector of znx in prepared for vmp DFT space */ | ||
| typedef struct vec_znx_dft_t VMP_PVEC; |
There was a problem hiding this comment.
| typedef struct vec_znx_dft_t VMP_PVEC; | |
| typedef struct vmp_pvec_t VMP_PVEC; |
(so that the compiler consider the two types as distinct)
|
Fixed the 2 issues I found, applied the renamings and corrected the typedef in 7e43a9f. |
Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft vs the reim dft it would usually be, which can result in significant performance gains in certain memory-bound vmp workflows.
Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft instead of the the reim dft that the current version uses, which can result in significant performance gains in certain memory-bound vmp workflows.
Motivated by my findings while implementing a variation on the Hypets' tutorial OnionPIR protocol. On my desktop machine, the
reim4_extract_1blk_from_contiguous_reim_avx(reim to reim4 copy) insidevmp_apply_dft_to_dftaccounted for the majority of execution time.On a PIR protocol like OnionPIR, the vectors we use in a VMP are the columns of the database, and therefore it makes sense to have them prepared in the most efficient format.
In my current experiments, using
vmp_apply_prepared_to_dftinstead ofvmp_apply_dft_to_dftcan result in a performance gain of 1.5x to 2.5x (depending on the machine used) on the overall PIR server execution time.