Add VMP prepared to DFT variant by fraret · Pull Request #60 · tfhe/spqlios-arithmetic

fraret · 2026-06-11T12:56:58Z

Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft instead of the the reim dft that the current version uses, which can result in significant performance gains in certain memory-bound vmp workflows.

Motivated by my findings while implementing a variation on the Hypets' tutorial OnionPIR protocol. On my desktop machine, the reim4_extract_1blk_from_contiguous_reim_avx (reim to reim4 copy) inside vmp_apply_dft_to_dft accounted for the majority of execution time.
On a PIR protocol like OnionPIR, the vectors we use in a VMP are the columns of the database, and therefore it makes sense to have them prepared in the most efficient format.

In my current experiments, using vmp_apply_prepared_to_dft instead of vmp_apply_dft_to_dft can result in a performance gain of 1.5x to 2.5x (depending on the machine used) on the overall PIR server execution time.

ngama75 · 2026-06-11T13:38:55Z

That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?)

also, I think that there is an avx variant of reim4_extract_1blk_from_contiguous_reim_ref in the code: if it is not used, it may be a bug.

fraret · 2026-06-11T14:15:06Z

also, I think that there is an avx variant of reim4_extract_1blk_from_contiguous_reim_ref in the code: if it is not used, it may be a bug.

I put the wrong variant in the PR description, sorry. The AVX variant is correctly used (and is the bottleneck in my desktop with ~50% of my OnionPIR exec time).

That's very strange! the product should be an order of magnitude slower: what are the dimensions of the vector and the matrix that you are using in the vector-matrix product? (and how many product are you doing with the same prepared matrix?)

I am not sure why the product should be slower. Just to be clear, the version I added is the usual VMP dft to dft which takes a prepared matrix, but changing the input vector to be in reim4 DFT instead of reim DFT (PMAT x reim4 -> reim). Then instead of calling reim4_extract_1blk_from_contiguous_reim_avx to copy a block to the tmp space, we can read directly from the input vector.

In my tests:

N = 4096
The matrix is prepared and then used as many times as columns in the DB (128 or 1024 for my tests)
I am using nrows values between 512 and 4096 for testing (ell_tilde = 4 and either 128 or 1024 DB rows depending on laptop/desktop due to RAM limitations)
For ncols, it should be the glwegadget number of limbs, which is 12
The vectors are also of depth either 512 or 4096

For those I get the very noticeable performance improvements (1.5x to 2.5x).

I have also tried it in a smaller micro-benchmark for a half-external product, with

N = 16384 (2^14)
ell_tilde = nrows = 15
ncols = 30
vectors of ell_tilde (15) lenght/depth

On my laptop these parameters give me a 1.2x speedup for a GLWEGadget x bivariate polynomial half-external product.

ngama75 · 2026-06-11T15:52:30Z

Wow, I did not expect that spqlios product would be that efficient. But if we count the cycles, it can make sense: 30 SIMD complex products (approx 200 avx2 SIMD floating point multiplications) can indeed in certain cases be faster than 2 random access memory read... [which seem to be happening right now]

Ok, if it results in a 2.5x end-to-end speedup in the use-case, we will have to go in the direction of the PR.
It just means that it is a new layout, and in the overall PIR use-case, it cannot be fully opaque.

I need to think about it a little bit.

MGeorgie

ok

ngama75

Ok in principle,
mostly some renaming to do (in addition to the fixes you mentioned already) before we can merge.

ngama75 · 2026-06-20T06:22:14Z

  module->func.znx_small_single_product = fft64_znx_small_single_product;
  module->func.znx_small_single_product_tmp_bytes = fft64_znx_small_single_product_tmp_bytes;
  module->func.vmp_prepare_contiguous = fft64_vmp_prepare_contiguous_ref;
+  module->func.vmp_prepare_contiguous_vec = fft64_vmp_prepare_contiguous_vec_ref;


Let's rename using this convention:

vmp_prepare_vector_ref vmp_prepare_vector_avx vmp_prepare_vector_tmp_bytes vmp_apply_pvec_to_dft_ref vmp_apply_pvec_to_dft_avx vmp_apply_pvec_to_dft_tmp_bytes

ngama75 · 2026-06-20T06:24:02Z

 /** @brief opaque type that represents a vector of znx in DFT space */
 typedef struct vec_znx_dft_t VEC_ZNX_DFT;
+/** @brief opaque type that represents a vector of znx in prepared for vmp DFT space */
+typedef struct vec_znx_dft_t VMP_PVEC;


Suggested change

typedef struct vec_znx_dft_t VMP_PVEC;

typedef struct vmp_pvec_t VMP_PVEC;

(so that the compiler consider the two types as distinct)

fraret · 2026-06-22T08:59:18Z

Fixed the 2 issues I found, applied the renamings and corrected the typedef in 7e43a9f.
It should hopefully be ready for merge .

Intended to be a more performant alternative to vmp_apply_dft_to_dft when one can prepare the input vector (eg. PIR). The prepared vec is in reim4 dft vs the reim dft it would usually be, which can result in significant performance gains in certain memory-bound vmp workflows.

fraret requested review from MGeorgie and ngama75 June 11, 2026 12:56

fraret commented Jun 19, 2026

View reviewed changes

Comment thread spqlios/arithmetic/vector_matrix_product.c Outdated

fraret commented Jun 19, 2026

View reviewed changes

Comment thread spqlios/arithmetic/vec_znx_arithmetic.h Outdated

MGeorgie approved these changes Jun 19, 2026

View reviewed changes

ngama75 reviewed Jun 20, 2026

View reviewed changes

fraret added 2 commits June 22, 2026 11:05

Prepared vec for VMP renames

083e1cd

fraret force-pushed the vmp_prepared_vec branch from 7e43a9f to 083e1cd Compare June 22, 2026 09:05

fraret added the check-on-arm64 Check on arm64 label Jun 22, 2026

MGeorgie merged commit 53b8857 into main Jun 22, 2026
5 checks passed

fraret deleted the vmp_prepared_vec branch June 25, 2026 14:46

fraret mentioned this pull request Jun 26, 2026

Changes for a v0.2.0 Prividema/prividema-fhe#70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add VMP prepared to DFT variant#60

Add VMP prepared to DFT variant#60
MGeorgie merged 2 commits into
mainfrom
vmp_prepared_vec

fraret commented Jun 11, 2026 •

edited

Loading

Uh oh!

ngama75 commented Jun 11, 2026 •

edited

Loading

Uh oh!

fraret commented Jun 11, 2026 •

edited

Loading

Uh oh!

ngama75 commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

MGeorgie left a comment

Uh oh!

ngama75 left a comment

Uh oh!

ngama75 Jun 20, 2026

Uh oh!

ngama75 Jun 20, 2026

Uh oh!

fraret commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	typedef struct vec_znx_dft_t VMP_PVEC;
	typedef struct vmp_pvec_t VMP_PVEC;

Uh oh!

Conversation

fraret commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngama75 commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fraret commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngama75 commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

MGeorgie left a comment

Choose a reason for hiding this comment

Uh oh!

ngama75 left a comment

Choose a reason for hiding this comment

Uh oh!

ngama75 Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

ngama75 Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

fraret commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fraret commented Jun 11, 2026 •

edited

Loading

ngama75 commented Jun 11, 2026 •

edited

Loading

fraret commented Jun 11, 2026 •

edited

Loading