SyncTools - Header-Only Concurrency Utilities

A small collection of header-only C++17 helpers - fast queues, a futex-powered mutex, and a couple of object pools.

Features

Header-only - just drop the include/ folder into your project
Zero external dependencies - relies only on the C++ Standard Library
Lock-free queues - bounded MPSC & SPSC implementations for predictable performance
Object pools - fast memory recycling with optional per-thread caches
Futex-based High-Performance Mutex - lightweight spin-then-block design with optional ThreadSanitizer & deadlock checks
Small footprint and comprehensive unit tests
CI builds on GCC and Clang; coverage (~83%) is measured on GCC

Quick Start

Below is a tiny producer -> consumer demo you can paste into main.cpp to make sure everything builds and runs:

#include <iostream>
#include <mutex>
#include <sync_tools/mutex.hpp>
#include <sync_tools/spsc_queue.hpp>
#include <thread>

int main()
{
    // Single-producer / single-consumer queue with 1 Ki elements
    SyncTools::BoundedSPSCQueue<int, 1024> q;
    SyncTools::Mutex print_mutex;  // Just to serialise std::cout

    std::thread producer(
        [&]
        {
            for (int i = 1; i <= 10; ++i)
            {
                q.try_push(i);
            }
            q.stop();  // tell consumer we're done
        });

    std::thread consumer(
        [&]
        {
            while (auto v = q.blocking_pop())
            {
                std::scoped_lock lk(print_mutex);
                std::cout << "got " << *v << '\n';
            }
        });

    producer.join();
    consumer.join();
}

Build & run (Linux / Clang):

clang++ main.cpp -std=c++17 -I/path/to/SyncTools/include -pthread -O3 -o demo
./demo

You should see got 1 … got 10 printed on the console.

Installation

FetchContent (recommended)

Add the library directly from GitHub and let CMake manage include paths:

include(FetchContent)

FetchContent_Declare(
  SyncTools
  GIT_REPOSITORY https://github.com/Kirpatik/SyncTools.git
  GIT_TAG        v1.0.0
)
FetchContent_MakeAvailable(SyncTools)

target_link_libraries(your_target PRIVATE SyncTools::SyncTools)

Submodule / Vendored

git submodule add https://github.com/Kirpatik/SyncTools.git external/SyncTools

Then in your CMakeLists.txt:

add_subdirectory(external/SyncTools)

target_link_libraries(your_target PRIVATE SyncTools::SyncTools)

System-wide install

Build and install once, then find it via find_package:

cmake -S SyncTools -B build -DCMAKE_INSTALL_PREFIX=/usr/local
cmake --build build --target install

find_package(SyncTools REQUIRED)

target_link_libraries(your_target PRIVATE SyncTools::SyncTools)

All options are header-only; linking merely propagates include paths & compile flags.

Building & Running Benchmarks

The project ships with Google Benchmark suites (disabled by default).

mkdir build && cd build
cmake .. -DENABLE_BENCHMARKS=ON
make -j$(nproc)
./benchmarks/benchmarks

Building & Running Tests

Similarly, unit tests use GoogleTest:

mkdir build && cd build
cmake .. -DENABLE_TESTING=ON
make -j$(nproc)
./tests/tests

Usage

BoundedMPSCQueue

SyncTools::BoundedMPSCQueue<T, N> is a lock-free multi-producer / single-consumer ring buffer that uses per-slot sequence numbers (a-la Vyukov) to avoid locks and the ABA problem.

Property	Value
Concurrency	Many producers, exactly one consumer
lock-free push	Producers CAS a shared `head`; no contention between consumer
Memory ordering	`relaxed` on hot path, `acquire/release` only when crossing slots
Capacity	Power-of-two `N`; memory footprint ≈ `N*sizeof(T)+overhead`
Back-off	Optional spinning via `blocking_push` before giving up

How it works

Producer reads head, checks cell’s sequence == head -> cell free. CAS head -> head+1 and constructs object in-place.
Consumer reads tail, checks cell’s sequence == tail+1 -> element ready. Destroys object, stores seq = tail+N to mark free.
Sequence numbers monotonically increase, so ABA is impossible within 2*capacity.

Example - logging fan-in

#include <iostream>
#include <sync_tools/mpsc_queue.hpp>
#include <thread>
#include <vector>

struct LogMsg
{
    std::string text;
};

constexpr size_t QUEUE_CAP = 1 << 12;
SyncTools::BoundedMPSCQueue<LogMsg, QUEUE_CAP> logq;

void logger()
{
    while (auto m = logq.blocking_pop())
    {
        std::cout << m->text << '\n';
    }
}

void worker(int id)
{
    for (int i = 0; i < 1000; ++i)
    {
        logq.blocking_push(LogMsg{"msg " + std::to_string(id) + ":" + std::to_string(i)});
    }
}

int main()
{
    std::thread consumer(logger);
    std::vector<std::thread> producers;
    for (int i = 0; i < 4; ++i)
    {
        producers.emplace_back(worker, i);
    }
    for (auto& t : producers)
    {
        t.join();
    }
    logq.stop();
    consumer.join();
}

Mutex

SyncTools::Mutex is a lightweight futex-based lock designed for the uncontended-first workloads typical in modern C++ servers and games.

Property	Value
Size	4 bytes data + padding to 64-byte cache line
Contention strategy	spin (exponential back-off) -> `futex` sleep
Fairness	Wakes one waiter; prevents starvation
Debug aids	Optional ThreadSanitizer annotations & deadlock checks

How it works

Fast path - an atomic CAS tries to move the state from Free->Locked; succeeds for the uncontended case without syscalls.
Spin-then-yield - failing CAS enters a short adaptive spin (1,2,4…1024 cpu_relax loops) to let the owner unlock.
Blocking path - after spins it marks the state Contended and calls futex(FUTEX_WAIT_PRIVATE); kernel parks the thread.
Unlock - owner swaps state to Free; if previous state was Contended it issues futex(FUTEX_WAKE_PRIVATE, 1) to wake one waiter.

This design avoids kernel calls in the common case.

Example

#include <cstdio>
#include <mutex>
#include <sync_tools/mutex.hpp>
#include <thread>
#include <vector>

SyncTools::Mutex mtx;
int shared = 0;

void worker()
{
    for (int i = 0; i < 100'000; ++i)
    {
        std::scoped_lock lock{mtx};  // RAII wrapper works because Mutex satisfies BasicLockable
        ++shared;
    }
}

int main()
{
    std::vector<std::thread> threads;
    for (int i = 0; i < 8; ++i)
    {
        threads.emplace_back(worker);
    }
    for (auto& t : threads)
    {
        t.join();
    }
    std::printf("shared=%d\n", shared);
}

(SyncTools::Mutex models the `` interface, so it integrates with std::lock_guard, std::scoped_lock).

ObjectPool

SyncTools::ObjectPool<T> is a fixed-capacity arena allocator that delivers constant-time acquire/release with minimal bookkeeping. Capacity is rounded to the next power-of-two so the circular index mask is a single &.

Property	Value
Capacity growth	Fixed (next power-of-two of constructor argument)
Thread safety	Internal futex-based `Mutex` guarding a ring buffer of pointers
Batch API	`acquire_batch` / `release_batch` fetch or return up to N objects in one lock

Example

#include <sync_tools/object_pool.hpp>

struct Bullet
{
    float x, y, vx, vy;
};

int main()
{
    SyncTools::ObjectPool<Bullet> pool{/*initial*/ 4096};

    Bullet* b = pool.acquire();
    if (b)
    {
        // initialise and use bullet …
        pool.release(b);
    }
}

ThreadCachedObjectPool

SyncTools::ThreadCachedObjectPool<T, BatchSize, FlushThreshold> layers per-thread caches on top of a global ObjectPool to remove the mutex from the fast path.

Property	Value
Fast path	No locking - L1 resident array per thread
Batch size	`BatchSize` (defaults 1024) fetched from global pool when cache empties
Flush policy	When cache hits `FlushThreshold` (defaults 2048) it flushes half back to global pool
Contention	Only occurs when local cache empty/full, amortised O(BatchSize)

Example

#include <sync_tools/object_pool.hpp>
#include <thread>

struct Particle
{
    float pos[3];
    float vel[3];
};

SyncTools::ThreadCachedObjectPool<Particle> tpool{/*global*/ 1 << 16};

void simulate()
{
    Particle* p = tpool.acquire();
    if (!p)
    {
        return;  // pool exhausted
    }

    // … simulate particle …

    tpool.release(p);
}

int main()
{
    std::vector<std::thread> workers;
    for (int i = 0; i < std::thread::hardware_concurrency(); ++i)
    {
        workers.emplace_back(simulate);
    }
    for (auto& t : workers)
    {
        t.join();
    }
}

BoundedSPSCQueue

SyncTools::BoundedSPSCQueue<T, N> provides a lock-free single-producer / single-consumer channel with optional blocking helpers (blocking_push/blocking_pop) implemented as spin-waits.

Property	Value
Concurrency	Exactly one producer, one consumer
Wait-free guarantee	Both push & pop
Capacity	Compile-time constant power-of-two `N`
Stop control	`run()` / `stop()` gate blocking methods
Destructor	Drains queue, calls `T` destructors

Example - audio sample pipe

#include <atomic>
#include <cmath>
#include <iostream>
#include <sync_tools/spsc_queue.hpp>
#include <thread>

constexpr size_t CAP = 1024;
SyncTools::BoundedSPSCQueue<float, CAP> samples;
std::atomic<bool> running{true};

void producer()
{
    float phase = 0.0f;
    while (running.load())
    {
        float sample = std::sin(phase);
        phase += 0.01f;
        samples.blocking_push(sample);
    }
}

void consumer()
{
    while (running.load())
    {
        if (auto s = samples.blocking_pop())
        {
            std::cout << s.value() << '\n';
        }
    }
}

int main()
{
    std::thread prod(producer), cons(consumer);
    std::this_thread::sleep_for(std::chrono::seconds(1));
    running.store(false);
    samples.stop();
    prod.join();
    cons.join();
}

Benchmarks

All measurements were taken on a 6-core (12-thread) AMD Ryzen 5 5600 (3.5 GHz base, 32 MB L3, Ubuntu 22.04) compiled with gcc 11.4.0, flags -O3 -march=native, on June 10 2025.

How to reproduce mkdir build && cd build && cmake .. -DENABLE_BENCHMARKS=ON -DCMAKE_BUILD_TYPE=Release && make -j$(nproc) && ./benchmarks/benchmarks

What each benchmark measures

Group	Google-Benchmark fixture	What it does & why it matters
Short lock	`BM_ShortLock<Mutex>`	Threads repeatedly take a zero-work critical section. Measures pure lock/unlock latency under different thread counts (1 -> 16). Ideal for uncontended or micro-contention scenarios such as reference-count updates or short atomic sections.
Mixed lock	`BM_MixedLock<Mutex>`	Same as above but the thread spins ~100 NOPs inside the critical section, emulating a read-dominant workload (10% writes, 90% reads). Shows how the mutex behaves when the protected region is small but non-trivial.
Fairness	`BM_MutexFairness<Mutex>`	All threads compete for the same mutex while the test records time-to-acquire for every lock. Reports average wait and per-thread ops/s - reveals starvation or convoy effects.
Bounded MPSC queue	`BM_Queue<BoundedMPSCQueue>`	12 producer threads push 1M items each; one consumer drains them. The queue is optionally prefilled (0/0.5/-1 capacity) to stress the steady-state path. Reports throughput (M items/s). Comparisons against Boost and TBB.
Bounded SPSC queue	`BM_SPSCQueue<…>`	Single producer + single consumer, otherwise identical to MPSC test. Highlights cache-friendly, lock-free communication when only two threads are involved.
Acquire/Release	`BM_AcquireRelease<Pool>`	Each thread loops 1M times: acquire object from pool -> use -> release. Counts raw allocation + recycle throughput. Implementations: thread-cached pool (SyncTools), TBB scalable allocator wrapper, Boost.ObjectPool wrapper, native new/delete.
Pipeline	`BM_Pipeline<Pool>`	3-stage hand-off: producer acquires object + pushes to per-thread SPSC queue -> consumer pops and recycles. Half threads are producers, half consumers. Simulates real object flow (e.g., packet processing).

Each Google-Benchmark fixture sets state.SetIterationTime() to wall-clock seconds (UseRealTime) so results are stable even if the OS steals CPU.

1. Lock micro-benchmarks — latency (ns, lower = better)

Threads	std::mutex	SyncTools::Mutex	tbb::spin_mutex	PThreadAdaptive	AbseilMutex
Short lock
1	5.10	3.52	3.03	6.62	4.20
2	34.2	10.6	16.0	93.8	101
4	77.9	29.2	72.3	391	297
8	182	87.3	243	1006	576
16	382	188	461	1766	1068
Mixed lock
1	46.6	50.7	47.7	51.0	52.2
2	244	95.0	99.4	239	171
4	462	191	219	545	534
8	1246	490	547	1380	965
16	3642	964	1136	3645	1847
Fairness - avg. wait
1	24.5	23.7	23.4	40.0	23.6
2	100	85.4	85.3	77.6	85.7
4	405	191	216	403	307
8	580	435	608	980	521
16	1509	792	1219	2552	1219

2. Queue throughput (million items/s, higher = better)

Queue	0-fill	8k fill	16k fill
Bounded MPSC (SyncTools)	8.84	9.49	9.76
Boost lock-free	3.81	3.79	3.84
TBB bounded	4.49	5.19	5.29
Bounded SPSC (SyncTools)	1.36	1.42	1.36
Boost SPSC	1.36	1.50	1.47

3. Object pools & pipeline (items/s, higher = better)

Acquire & release

Threads	Thread-cached pool	TBB pool	ObjectPool	new/delete
2	1.46G	73.8M	52.4M	98.9M
4	1.32G	70.1M	19.2M	84.4M
6	1.12G	61.2M	10.7M	79.8M
8	855M	52.9M	6.90M	66.2M
10	727M	46.9M	4.92M	60.1M
12	603M	40.1M	3.84M	52.3M
14	501M	37.5M	3.31M	46.6M
16	556M	38.3M	2.90M	50.0M

3-stage pipeline

Threads	Pipeline Thread-cached	Pipeline TBB	Pipeline ObjectPool	Pipeline new/delete
2	57.6M	11.6M	15.6M	8.55M
4	49.9M	11.4M	7.59M	8.32M
6	48.9M	10.7M	4.94M	8.13M
8	45.2M	9.85M	3.71M	7.46M
10	44.4M	10.1M	2.81M	7.75M
12	39.8M	9.18M	2.27M	7.24M
14	35.5M	7.04M	2.11M	5.81M
16	26.2M	6.16M	1.82M	4.96M

Key take-aways

SyncTools::Mutex outperforms std::mutex by 1.9-3.7x across contention levels and beats TBB & Abseil. Fairness test confirms lower wait times under pressure.
The bounded MPSC queue sustains ~10M items/s, giving ~2.5x Boost and ~2x TBB throughput even when prefilled.
A thread-cached object pool delivers up to ~15x the throughput of TBB’s scalable allocator wrapper and is orders of magnitude ahead of naive allocation.
In a realistic 3-stage pipeline SyncTools moves ~26M items/s, quadrupling TBB performance.

Raw numbers come from Google-Benchmark; five independent runs show +-3% variance.

License

SyncTools is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cmake		cmake
examples		examples
include/sync_tools		include/sync_tools
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

SyncTools - Header-Only Concurrency Utilities

Table of Contents

Features

Quick Start

Installation

FetchContent (recommended)

Submodule / Vendored

System-wide install

Building & Running Benchmarks

Building & Running Tests

Usage

BoundedMPSCQueue

How it works

Example - logging fan-in

Mutex

How it works

Example

ObjectPool

Example

ThreadCachedObjectPool

Example

BoundedSPSCQueue

Example - audio sample pipe

Benchmarks

What each benchmark measures

1. Lock micro-benchmarks — latency (ns, lower = better)

2. Queue throughput (million items/s, higher = better)

3. Object pools & pipeline (items/s, higher = better)

Key take-aways

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages