use 128-bit AVX ops for atomic load/store when possible#323
Merged
Conversation
Since this code was originally written, all x86 vendors have committed to the guarantee that 128-bit AVX moves are atomic.¹ Using these is cheaper than the __sync CAS built-ins (that lower to a CMPXCHG16B) because they do not need a Read For Ownership (RFO). Profiling misc/pending-queue.m and misc/pending-queue-4k.m with 8 threads indicates this does not have a significant effect on runtime performance. But this is expected to benefit higher thread counts where cacheline bouncing becomes more of a concern. ¹ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
Set insertion was optimistically assuming the current slot is empty and trying to CAS into it. This works if your model of the hardware is that a CAS is not significantly more expensive than a read. However on platforms like x86-64, (1) reads are naturally atomic so an “atomic” read is cheap and (2) CAS involves a Read For Ownership (RFO) that can involve expensive cache operations. This change makes set insertion check the CAS might actually succeed before attempting it. On some in-tree models, it seems like a reasonable speed up: ┌─────────────────────────┬────────┬────────┐ │ │ before │ after │ ├─────────────────────────┼────────┼────────┤ │ misc/pending-queue.m │ 8m03s │ 7m41s │ │ misc/pending-queue-4k.m │ 8m02s │ 7m51s │ └─────────────────────────┴────────┴────────┘
7a0e6da to
db90ade
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Since this code was originally written, all x86 vendors have committed to the guarantee that 128-bit AVX moves are atomic.¹ Using these is cheaper than the __sync CAS built-ins (that lower to a CMPXCHG16B) because they do not need a Read For Ownership (RFO).
Profiling misc/pending-queue.m and misc/pending-queue-4k.m with 8 threads indicates this does not have a significant effect on runtime performance. But this is expected to benefit higher thread counts where cacheline bouncing becomes more of a concern.
¹ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688