use 128-bit AVX ops for atomic load/store when possible by Smattr · Pull Request #323 · Smattr/rumur

Smattr · 2026-05-10T16:11:36Z

Since this code was originally written, all x86 vendors have committed to the guarantee that 128-bit AVX moves are atomic.¹ Using these is cheaper than the __sync CAS built-ins (that lower to a CMPXCHG16B) because they do not need a Read For Ownership (RFO).

Profiling misc/pending-queue.m and misc/pending-queue-4k.m with 8 threads indicates this does not have a significant effect on runtime performance. But this is expected to benefit higher thread counts where cacheline bouncing becomes more of a concern.

¹ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Since this code was originally written, all x86 vendors have committed to the guarantee that 128-bit AVX moves are atomic.¹ Using these is cheaper than the __sync CAS built-ins (that lower to a CMPXCHG16B) because they do not need a Read For Ownership (RFO). Profiling misc/pending-queue.m and misc/pending-queue-4k.m with 8 threads indicates this does not have a significant effect on runtime performance. But this is expected to benefit higher thread counts where cacheline bouncing becomes more of a concern. ¹ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Set insertion was optimistically assuming the current slot is empty and trying to CAS into it. This works if your model of the hardware is that a CAS is not significantly more expensive than a read. However on platforms like x86-64, (1) reads are naturally atomic so an “atomic” read is cheap and (2) CAS involves a Read For Ownership (RFO) that can involve expensive cache operations. This change makes set insertion check the CAS might actually succeed before attempting it. On some in-tree models, it seems like a reasonable speed up: ┌─────────────────────────┬────────┬────────┐ │ │ before │ after │ ├─────────────────────────┼────────┼────────┤ │ misc/pending-queue.m │ 8m03s │ 7m41s │ │ misc/pending-queue-4k.m │ 8m02s │ 7m51s │ └─────────────────────────┴────────┴────────┘

Smattr added 3 commits May 10, 2026 10:06

remove an unnecessary ';'

db90ade

Smattr force-pushed the smattr/8822389e-a2ef-4173-9c14-1993a470fd1f branch from 7a0e6da to db90ade Compare May 10, 2026 17:06

Smattr merged commit 8fe151f into main May 10, 2026
42 of 43 checks passed

Smattr deleted the smattr/8822389e-a2ef-4173-9c14-1993a470fd1f branch May 10, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use 128-bit AVX ops for atomic load/store when possible#323

use 128-bit AVX ops for atomic load/store when possible#323
Smattr merged 3 commits into
mainfrom
smattr/8822389e-a2ef-4173-9c14-1993a470fd1f

Smattr commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Smattr commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant