Skip to content

Add SIMD vectorization for check-to-variable loop#41

Merged
trmue merged 3 commits into
trmue:mainfrom
eugeneyuchunlin:main
Jun 24, 2026
Merged

Add SIMD vectorization for check-to-variable loop#41
trmue merged 3 commits into
trmue:mainfrom
eugeneyuchunlin:main

Conversation

@eugeneyuchunlin

Copy link
Copy Markdown
Contributor

What

Restructure the min-sum accumulator loop in compute_check_to_variable to break the serial dependency chain that blocked instruction-level parallelism.

Why

The original implementation of compute_check_to_variable had a loop-carried dependency when finding the minimum and second minimum message: each iteration's compare-select depended on the previous result, preventing LLVM from vectorizing the loop. Storing min_ind also increased CPU–memory traffic.

How

The change introduces independent accumulator lanes to break the serial dependency chain and provides LLVM compilation an opportunity to exploit instruction-level parallelism. This is accomplished by an inline helper function min_two_magnitudes that maintains multiple independent accumulators while scanning a row. The function defines a constant variable LANES that controls the degree of unrolling. The variable is set 4 due to the following two reasons:

  1. ARM NEON only provides 128-bit registers, fitting exactly 4 x f32 lanes
  2. qLDPC codes typically have a small number of connectivity (~6). So, additional lanes beyond 4 yield empty iterations rather than useful work. This applies to both ARM NEON and x85 AVX2.

The minimum message index is no longer tracked. Instead, outgoing messages compare against the minimum value during emission, which removes a random-access load on every output edge.

Output is numerically identical to the original implementation.

Improvement

Arm system

  • CPU: Apple M1 Pro
  • RAM: Unified LPDDR5 16GB
  • OS: macOS Tahoe 26.1
min_sum_144_12_12/100_samples
                        time:   [573.20 ms 574.61 ms 576.10 ms]
                        change: [-28.382% -28.197% -27.979%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking min_sum_144_12_12/100_samples_par: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.9s or enable flat sampling.
min_sum_144_12_12/100_samples_par
                        time:   [103.94 ms 106.36 ms 108.73 ms]
                        change: [-30.643% -27.995% -25.589%] (p = 0.00 < 0.05)
                        Performance has improved.

x86 system

System info:

  • CPU: AMD R7 9800x3d
  • RAM: DDR5 32GB
  • OS: Ubuntu 26.04
min_sum_144_12_12/100_samples
                        time:   [461.68 ms 463.17 ms 464.82 ms]
                        change: [-15.678% -15.259% -14.830%] (p = 0.00 < 0.05)
                        Performance has improved.
min_sum_144_12_12/100_samples_par
                        time:   [69.213 ms 69.694 ms 70.329 ms]
                        change: [-16.957% -15.543% -13.889%] (p = 0.00 < 0.05)
                        Performance has improved.

The original implementation of compute_check_to_variable had a serial
loop to find out the minimum and second minimum message. The running
state introduced a loop-carried dependency, limiting instruction-level
parallelism (ILP) and making vectorization difficult.

This change introduces an inline helper function min_two_magnitudes that
maintains multiple independent accumulators while scanning a row.
Breaking the dependency chain exposes additional ILP and gives LLVM more
opportunities to generate vectorized code.

The implementation also removes explicit tracking of the minimum message
index. Instead, outgoing messages select between the minimum and second
minimum magnitudes by comparing against the minimum value during
emission. This simplifies the reduction and avoids additional
per-element bookkeeping.

Additionally, alpha scaling and optional fixed-point rescaling are now
performed once per row on the two candidate output magnitudes rather
than once per edge, reducing arithmetic operators and eliminating the
final matrix-wide rescaling pass.
The LANES variable in min_two_magnitudes is fixed to 4. This is because
qLDPC codes typically have small number of check-node connectivity.
Although AVX2 provides 256-bit registers allowing us to set up LANES=8,
in typical use case, this will force the code go into the serial part.
Therefore, a small number of LANES might be preferable.
@trmue

trmue commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Thank you! The automated regression found a few formatting issues (no new line at the end, trailing whitespaces, brackets)


Fix End of Files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing crates/relay_bp/src/bp/min_sum.rs

Trim Trailing Whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing crates/relay_bp/src/bp/min_sum.rs

fmt......................................................................Failed
- hook id: fmt
- files were modified by this hook
cargo check..............................................................Passed
clippy...................................................................Passed
mypy.....................................................................Passed
pre-commit hook(s) made changes.
If you are seeing this message in CI, reproduce locally with: `pre-commit run --all-files`.
To run `pre-commit` as part of git workflow, use `pre-commit install`.
All changes made by hooks:
diff --git a/crates/relay_bp/src/bp/min_sum.rs b/crates/relay_bp/src/bp/min_sum.rs
index 217c271..2a1d94f 100644
--- a/crates/relay_bp/src/bp/min_sum.rs
+++ b/crates/relay_bp/src/bp/min_sum.rs
@@ -335,14 +335,13 @@ where
     /// updates. Unlike a single running (min, second_min, min_ind) triple,
     /// the lanes carry no dependency on each other, so the CPU can overlap
     /// iterations and LLVM can detect the vectorization opportunities.
-    /// This makes uses of the SIMD instructions more efficient. 
+    /// This makes uses of the SIMD instructions more efficient.
     #[inline]
     fn min_two_magnitudes(messages: &[N]) -> (N, N) {
-
         // Fix the number of lanes to 4 since Arm Neon registers are 128 bits wide.
-        // Although AVX2 registers are 256 bits wide, we still stick to 4 lanes since 
+        // Although AVX2 registers are 256 bits wide, we still stick to 4 lanes since
         // qLDPC codes typically have a small number of check-node connectivity.
-        // Wider lanes would not provide significant benefits for the typical use case. 
+        // Wider lanes would not provide significant benefits for the typical use case.
         const LANES: usize = 4;
         let mut min1 = [N::max_value(); LANES];
         let mut min2 = [N::max_value(); LANES];
@@ -352,11 +351,11 @@ where
         //   min2' = min(max(abs_msg, min1), min2)
         let mut chunks = messages.chunks_exact(LANES);
         for chunck in &mut chunks {
-            for k in 0..LANES{
+            for k in 0..LANES {
                 let abs_msg = chunck[k].abs();
                 let lo = if abs_msg < min1[k] { abs_msg } else { min1[k] };
                 let hi = if abs_msg < min1[k] { min1[k] } else { abs_msg };
-                
+
                 min1[k] = lo;
                 if hi < min2[k] {
                     min2[k] = hi;
@@ -378,7 +377,6 @@ where
         let mut min_message = N::max_value();
         let mut second_min_message = N::max_value();
         for k in 0..LANES {
-
             // lo = min(min1[k], min_message)
             // hi = max(min1[k], min_message)
             // second_min = min(hi, min2[k], second_min_message)
@@ -987,4 +985,4 @@ mod tests {
 
         assert_eq!(results[0].decoding.len(), 8785);
     }
-}
\ No newline at end of file
+}

@eugeneyuchunlin

Copy link
Copy Markdown
Contributor Author

Thanks for pointing that out! I’ve fixed the formatting issues and pushed an update.

@eugeneyuchunlin

Copy link
Copy Markdown
Contributor Author

Hi @trmue , just wanted to follow up on this PR. Would you mind taking a look when you have a chance and approving the remaining 3 workflows if everything looks okay? I believe the previous formatting issues have been fixed, but please feel free to let me know if there's anything else I can do to help move this forward. Thanks!

@trmue

trmue commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Yeah, sorry for only getting back to this now. Thank you for the PR!

@trmue trmue merged commit 19d7023 into trmue:main Jun 24, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants