Add SIMD vectorization for check-to-variable loop by eugeneyuchunlin · Pull Request #41 · trmue/relay

eugeneyuchunlin · 2026-06-16T22:27:30Z

What

Restructure the min-sum accumulator loop in compute_check_to_variable to break the serial dependency chain that blocked instruction-level parallelism.

Why

The original implementation of compute_check_to_variable had a loop-carried dependency when finding the minimum and second minimum message: each iteration's compare-select depended on the previous result, preventing LLVM from vectorizing the loop. Storing min_ind also increased CPU–memory traffic.

How

The change introduces independent accumulator lanes to break the serial dependency chain and provides LLVM compilation an opportunity to exploit instruction-level parallelism. This is accomplished by an inline helper function min_two_magnitudes that maintains multiple independent accumulators while scanning a row. The function defines a constant variable LANES that controls the degree of unrolling. The variable is set 4 due to the following two reasons:

ARM NEON only provides 128-bit registers, fitting exactly 4 x f32 lanes
qLDPC codes typically have a small number of connectivity (~6). So, additional lanes beyond 4 yield empty iterations rather than useful work. This applies to both ARM NEON and x85 AVX2.

The minimum message index is no longer tracked. Instead, outgoing messages compare against the minimum value during emission, which removes a random-access load on every output edge.

Output is numerically identical to the original implementation.

Improvement

Arm system

CPU: Apple M1 Pro
RAM: Unified LPDDR5 16GB
OS: macOS Tahoe 26.1

min_sum_144_12_12/100_samples
                        time:   [573.20 ms 574.61 ms 576.10 ms]
                        change: [-28.382% -28.197% -27.979%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking min_sum_144_12_12/100_samples_par: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.9s or enable flat sampling.
min_sum_144_12_12/100_samples_par
                        time:   [103.94 ms 106.36 ms 108.73 ms]
                        change: [-30.643% -27.995% -25.589%] (p = 0.00 < 0.05)
                        Performance has improved.

x86 system

System info:

CPU: AMD R7 9800x3d
RAM: DDR5 32GB
OS: Ubuntu 26.04

min_sum_144_12_12/100_samples
                        time:   [461.68 ms 463.17 ms 464.82 ms]
                        change: [-15.678% -15.259% -14.830%] (p = 0.00 < 0.05)
                        Performance has improved.
min_sum_144_12_12/100_samples_par
                        time:   [69.213 ms 69.694 ms 70.329 ms]
                        change: [-16.957% -15.543% -13.889%] (p = 0.00 < 0.05)
                        Performance has improved.

The original implementation of compute_check_to_variable had a serial loop to find out the minimum and second minimum message. The running state introduced a loop-carried dependency, limiting instruction-level parallelism (ILP) and making vectorization difficult. This change introduces an inline helper function min_two_magnitudes that maintains multiple independent accumulators while scanning a row. Breaking the dependency chain exposes additional ILP and gives LLVM more opportunities to generate vectorized code. The implementation also removes explicit tracking of the minimum message index. Instead, outgoing messages select between the minimum and second minimum magnitudes by comparing against the minimum value during emission. This simplifies the reduction and avoids additional per-element bookkeeping. Additionally, alpha scaling and optional fixed-point rescaling are now performed once per row on the two candidate output magnitudes rather than once per edge, reducing arithmetic operators and eliminating the final matrix-wide rescaling pass.

The LANES variable in min_two_magnitudes is fixed to 4. This is because qLDPC codes typically have small number of check-node connectivity. Although AVX2 provides 256-bit registers allowing us to set up LANES=8, in typical use case, this will force the code go into the serial part. Therefore, a small number of LANES might be preferable.

trmue · 2026-06-17T13:43:21Z

Thank you! The automated regression found a few formatting issues (no new line at the end, trailing whitespaces, brackets)


Fix End of Files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing crates/relay_bp/src/bp/min_sum.rs

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing crates/relay_bp/src/bp/min_sum.rs

fmt......................................................................Failed
- hook id: fmt
- files were modified by this hook
cargo check..............................................................Passed
clippy...................................................................Passed
mypy.....................................................................Passed
pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/crates/relay_bp/src/bp/min_sum.rs b/crates/relay_bp/src/bp/min_sum.rs
index 217c271..2a1d94f 100644
--- a/crates/relay_bp/src/bp/min_sum.rs
+++ b/crates/relay_bp/src/bp/min_sum.rs
@@ -335,14 +335,13 @@ where
     /// updates. Unlike a single running (min, second_min, min_ind) triple,
     /// the lanes carry no dependency on each other, so the CPU can overlap
     /// iterations and LLVM can detect the vectorization opportunities.
-    /// This makes uses of the SIMD instructions more efficient. 
+    /// This makes uses of the SIMD instructions more efficient.
     #[inline]
     fn min_two_magnitudes(messages: &[N]) -> (N, N) {
-
         // Fix the number of lanes to 4 since Arm Neon registers are 128 bits wide.
-        // Although AVX2 registers are 256 bits wide, we still stick to 4 lanes since 
+        // Although AVX2 registers are 256 bits wide, we still stick to 4 lanes since
         // qLDPC codes typically have a small number of check-node connectivity.
-        // Wider lanes would not provide significant benefits for the typical use case. 
+        // Wider lanes would not provide significant benefits for the typical use case.
         const LANES: usize = 4;
         let mut min1 = [N::max_value(); LANES];
         let mut min2 = [N::max_value(); LANES];
@@ -352,11 +351,11 @@ where
         //   min2' = min(max(abs_msg, min1), min2)
         let mut chunks = messages.chunks_exact(LANES);
         for chunck in &mut chunks {
-            for k in 0..LANES{
+            for k in 0..LANES {
                 let abs_msg = chunck[k].abs();
                 let lo = if abs_msg < min1[k] { abs_msg } else { min1[k] };
                 let hi = if abs_msg < min1[k] { min1[k] } else { abs_msg };
-                
+
                 min1[k] = lo;
                 if hi < min2[k] {
                     min2[k] = hi;
@@ -378,7 +377,6 @@ where
         let mut min_message = N::max_value();
         let mut second_min_message = N::max_value();
         for k in 0..LANES {
-
             // lo = min(min1[k], min_message)
             // hi = max(min1[k], min_message)
             // second_min = min(hi, min2[k], second_min_message)
@@ -987,4 +985,4 @@ mod tests {
 
         assert_eq!(results[0].decoding.len(), 8785);
     }
-}
\ No newline at end of file
+}

eugeneyuchunlin · 2026-06-17T14:21:17Z

Thanks for pointing that out! I’ve fixed the formatting issues and pushed an update.

eugeneyuchunlin · 2026-06-24T14:52:26Z

Hi @trmue , just wanted to follow up on this PR. Would you mind taking a look when you have a chance and approving the remaining 3 workflows if everything looks okay? I believe the previous formatting issues have been fixed, but please feel free to let me know if there's anything else I can do to help move this forward. Thanks!

trmue · 2026-06-24T15:14:16Z

Yeah, sorry for only getting back to this now. Thank you for the PR!

eugeneyuchunlin added 2 commits June 16, 2026 14:05

Fix the formatting issue

b864241

trmue merged commit 19d7023 into trmue:main Jun 24, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SIMD vectorization for check-to-variable loop#41

Add SIMD vectorization for check-to-variable loop#41
trmue merged 3 commits into
trmue:mainfrom
eugeneyuchunlin:main

eugeneyuchunlin commented Jun 16, 2026

Uh oh!

trmue commented Jun 17, 2026 •

edited

Loading

Uh oh!

eugeneyuchunlin commented Jun 17, 2026

Uh oh!

eugeneyuchunlin commented Jun 24, 2026

Uh oh!

trmue commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eugeneyuchunlin commented Jun 16, 2026

What

Why

How

Improvement

Arm system

x86 system

Uh oh!

trmue commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eugeneyuchunlin commented Jun 17, 2026

Uh oh!

eugeneyuchunlin commented Jun 24, 2026

Uh oh!

trmue commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trmue commented Jun 17, 2026 •

edited

Loading