Add SIMD vectorization for check-to-variable loop#41
Merged
Conversation
The original implementation of compute_check_to_variable had a serial loop to find out the minimum and second minimum message. The running state introduced a loop-carried dependency, limiting instruction-level parallelism (ILP) and making vectorization difficult. This change introduces an inline helper function min_two_magnitudes that maintains multiple independent accumulators while scanning a row. Breaking the dependency chain exposes additional ILP and gives LLVM more opportunities to generate vectorized code. The implementation also removes explicit tracking of the minimum message index. Instead, outgoing messages select between the minimum and second minimum magnitudes by comparing against the minimum value during emission. This simplifies the reduction and avoids additional per-element bookkeeping. Additionally, alpha scaling and optional fixed-point rescaling are now performed once per row on the two candidate output magnitudes rather than once per edge, reducing arithmetic operators and eliminating the final matrix-wide rescaling pass.
The LANES variable in min_two_magnitudes is fixed to 4. This is because qLDPC codes typically have small number of check-node connectivity. Although AVX2 provides 256-bit registers allowing us to set up LANES=8, in typical use case, this will force the code go into the serial part. Therefore, a small number of LANES might be preferable.
Owner
|
Thank you! The automated regression found a few formatting issues (no new line at the end, trailing whitespaces, brackets) |
Contributor
Author
|
Thanks for pointing that out! I’ve fixed the formatting issues and pushed an update. |
Contributor
Author
|
Hi @trmue , just wanted to follow up on this PR. Would you mind taking a look when you have a chance and approving the remaining 3 workflows if everything looks okay? I believe the previous formatting issues have been fixed, but please feel free to let me know if there's anything else I can do to help move this forward. Thanks! |
Owner
|
Yeah, sorry for only getting back to this now. Thank you for the PR! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Restructure the min-sum accumulator loop in
compute_check_to_variableto break the serial dependency chain that blocked instruction-level parallelism.Why
The original implementation of
compute_check_to_variablehad a loop-carried dependency when finding the minimum and second minimum message: each iteration's compare-select depended on the previous result, preventing LLVM from vectorizing the loop. Storingmin_indalso increased CPU–memory traffic.How
The change introduces independent accumulator lanes to break the serial dependency chain and provides LLVM compilation an opportunity to exploit instruction-level parallelism. This is accomplished by an inline helper function
min_two_magnitudesthat maintains multiple independent accumulators while scanning a row. The function defines a constant variableLANESthat controls the degree of unrolling. The variable is set 4 due to the following two reasons:4 x f32lanesThe minimum message index is no longer tracked. Instead, outgoing messages compare against the minimum value during emission, which removes a random-access load on every output edge.
Output is numerically identical to the original implementation.
Improvement
Arm system
x86 system
System info: