Skip to content

use 128-bit AVX ops for atomic load/store when possible#323

Merged
Smattr merged 3 commits into
mainfrom
smattr/8822389e-a2ef-4173-9c14-1993a470fd1f
May 10, 2026
Merged

use 128-bit AVX ops for atomic load/store when possible#323
Smattr merged 3 commits into
mainfrom
smattr/8822389e-a2ef-4173-9c14-1993a470fd1f

Conversation

@Smattr
Copy link
Copy Markdown
Owner

@Smattr Smattr commented May 10, 2026

Since this code was originally written, all x86 vendors have committed to the guarantee that 128-bit AVX moves are atomic.¹ Using these is cheaper than the __sync CAS built-ins (that lower to a CMPXCHG16B) because they do not need a Read For Ownership (RFO).

Profiling misc/pending-queue.m and misc/pending-queue-4k.m with 8 threads indicates this does not have a significant effect on runtime performance. But this is expected to benefit higher thread counts where cacheline bouncing becomes more of a concern.

¹ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Smattr added 3 commits May 10, 2026 10:06
Since this code was originally written, all x86 vendors have committed
to the guarantee that 128-bit AVX moves are atomic.¹ Using these is
cheaper than the __sync CAS built-ins (that lower to a CMPXCHG16B)
because they do not need a Read For Ownership (RFO).

Profiling misc/pending-queue.m and misc/pending-queue-4k.m with 8
threads indicates this does not have a significant effect on runtime
performance. But this is expected to benefit higher thread counts where
cacheline bouncing becomes more of a concern.

¹ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
Set insertion was optimistically assuming the current slot is empty and
trying to CAS into it. This works if your model of the hardware is that
a CAS is not significantly more expensive than a read. However on
platforms like x86-64, (1) reads are naturally atomic so an “atomic”
read is cheap and (2) CAS involves a Read For Ownership (RFO) that can
involve expensive cache operations.

This change makes set insertion check the CAS might actually succeed
before attempting it. On some in-tree models, it seems like a reasonable
speed up:

  ┌─────────────────────────┬────────┬────────┐
  │                         │ before │  after │
  ├─────────────────────────┼────────┼────────┤
  │ misc/pending-queue.m    │  8m03s │  7m41s │
  │ misc/pending-queue-4k.m │  8m02s │  7m51s │
  └─────────────────────────┴────────┴────────┘
@Smattr Smattr force-pushed the smattr/8822389e-a2ef-4173-9c14-1993a470fd1f branch from 7a0e6da to db90ade Compare May 10, 2026 17:06
@Smattr Smattr merged commit 8fe151f into main May 10, 2026
42 of 43 checks passed
@Smattr Smattr deleted the smattr/8822389e-a2ef-4173-9c14-1993a470fd1f branch May 10, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant