There seems to be a bug (or some other reason) in LLVM that prevents it from combining successive relaxed atomic operations, even loads/stores.
Currently, because the initialisation functions work by inlining the dispatch function at the end (it makes codegen far simpler), this generates a redundant read (see example assembly below).
Hopefully there is some way to make LLVM optimise out the redundant read, or maybe implement a way to pass state between the initialisation and dispatch functions without making codegen even more complex than it already this.
Example assembly
special_test::fast_dot_product_32::_init_x86:
push rsi
push rdi
push rbx
sub rsp, 32
mov rdi, qword ptr [rip + __imp__ZN10std_detect6detect5cache5CACHE17h3e14a7e61b6c48e6E]
mov rax, qword ptr [rdi]
test rax, rax
je .LBB7_1
.LBB7_2:
movzx esi, ax
shr esi, 15
mov rax, qword ptr [rdi]
test rax, rax
je .LBB7_3
.LBB7_4:
shr eax, 10
and eax, 1
lea r8, [rip + special_test::fast_dot_product_32::_x86_sse41]
lea r9, [rip + special_test::fast_dot_product_32::_generic]
test al, al
cmovne r9, r8
lea rax, [rip + special_test::fast_dot_product_32::_x86_avx2]
test sil, sil
cmove rax, r9
mov qword ptr [rip + special_test::fast_dot_product_32::JUMP_REF_x86], rax
; this is where the dispatch function is inlined
mov rax, qword ptr [rip + special_test::fast_dot_product_32::JUMP_REF_x86]
add rsp, 32
pop rbx
pop rdi
pop rsi
rex64 jmp rax
.LBB7_1:
mov rsi, rdx
mov rbx, rcx
call std_detect::detect::cache::detect_and_initialize
mov rcx, rbx
mov rdx, rsi
jmp .LBB7_2
.LBB7_3:
mov rdi, rdx
mov rbx, rcx
call std_detect::detect::cache::detect_and_initialize
mov rcx, rbx
mov rdx, rdi
jmp .LBB7_4
There seems to be a bug (or some other reason) in LLVM that prevents it from combining successive relaxed atomic operations, even loads/stores.
Currently, because the initialisation functions work by inlining the dispatch function at the end (it makes codegen far simpler), this generates a redundant read (see example assembly below).
Hopefully there is some way to make LLVM optimise out the redundant read, or maybe implement a way to pass state between the initialisation and dispatch functions without making codegen even more complex than it already this.
Example assembly