A high-performance, multithreaded, shared-nothing, Redis-compatible in-memory key-value store written in modern C++20. Evolved from the single-threaded RapidKV that was built to outperform Redis, VortexKV scales across all available cores to challenge Dragonfly — the fastest multithreaded Redis alternative.
~11.9M SET ops/sec pipelined · ~9.9M GET ops/sec pipelined · Up to 3.7× faster than Dragonfly on pipelined reads · Sub-millisecond p99 latency across the board
- 11.9M ops/sec peak pipelined SET throughput (1.6× Dragonfly)
- 9.9M ops/sec peak pipelined GET throughput (3.7× Dragonfly)
- 2.58M ops/sec non-pipelined SET/GET — matching Dragonfly head-to-head
- Shared-nothing architecture — each shard owns its thread, io_context, database, and allocator; zero cross-shard locking
- Near-linear shard scaling with hash-based key routing and
asio::postfor cross-shard requests - All the single-threaded RapidKV innovations retained: Robin Hood FlatMap (SoA), CompactString, zero-copy RESP parser, slab allocator, SIEVE eviction
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ CLIENT CONNECTIONS (TCP) │
│ memtier / redis-cli / any RESP client │
└──────────────────────────────────┬──────────────────────────────────────────────────┘
│ TCP accept()
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ SERVER (main thread, single acceptor) │
│ ┌───────────────────────────────────────────────────────────────────────────────┐ │
│ │ • asio::io_context for accept loop only │ │
│ │ • Round-robin connection assignment → GetNextShard() │ │
│ │ • Generates global routing hash seed (randomized) │ │
│ │ • Creates N Shard objects (N = hardware_concurrency or config) │ │
│ └───────────────────────────────────────────────────────────────────────────────┘ │
└──────────┬──────────┬──────────┬──────────────────────┬──────────┬──────────────────┘
│ │ │ ... │ │
socket move socket move socket move socket move socket move
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌──────────────┐┌──────────────┐┌──────────────┐┌──────────────┐┌──────────────┐
│ SHARD 0 ││ SHARD 1 ││ SHARD 2 ││ ... ││ SHARD N-1 │
│ (pinned to ││ (pinned to ││ (pinned to ││ ││ (pinned to │
│ core 0) ││ core 1) ││ core 2) ││ ││ core N-1) │
│──────────────││──────────────││──────────────││──────────────││──────────────│
│ Own thread ││ Own thread ││ Own thread ││ Own thread ││ Own thread │
│(std::jthread)││(std::jthread)││(std::jthread)││(std::jthread)││(std::jthread)│
│──────────────││──────────────││──────────────││──────────────││──────────────│
│Own io_context││Own io_context││Own io_context││Own io_context││Own io_context│
│──────────────││──────────────││──────────────││──────────────││──────────────│
│Own Database ││Own Database ││Own Database ││Own Database ││Own Database │
│──────────────││──────────────││──────────────││──────────────││──────────────│
│Own Dispatcher││Own Dispatcher││Own Dispatcher││Own Dispatcher││Own Dispatcher│
│──────────────││──────────────││──────────────││──────────────││──────────────│
│Maintenance ││Maintenance ││Maintenance ││Maintenance ││Maintenance │
│Timer (TTL) ││Timer (TTL) ││Timer (TTL) ││Timer (TTL) ││Timer (TTL) │
└──────┬───────┘└──────────────┘└──────────────┘└──────────────┘└──────────────┘
│
│ Each shard manages multiple Connection objects
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ CONNECTION (per-client) │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ LinearBuffer (read) │ Double-Buffered Write (primary/secondary) │ │
│ │ with compaction │ swap-and-send on async_write completion │ │
│ ├───────────────────────────────────────────────────────────────────────────┤ │
│ │ Zero-Copy RESP Parser │ Zero-Alloc RESP Writer │ │
│ │ (string_view into buffer) │ (stack buffers + reserveContiguous) │ │
│ ├───────────────────────────────────────────────────────────────────────────┤ │
│ │ Pipeline Batch (parse all commands in one read) │ │
│ ├───────────────────────────────────────────────────────────────────────────┤ │
│ │ KEY-BASED ROUTING: hash(key, seed) % N_shards │ │
│ │ ┌─────────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Local: same shard │ │ Remote: asio::post to │ │ │
│ │ │ → direct dispatch() │ │ target shard io_context │ │ │
│ │ │ │ │ → callback posts back │ │ │
│ │ └─────────────────────┘ └──────────────────────────┘ │ │
│ ├───────────────────────────────────────────────────────────────────────────┤ │
│ │ Command Dispatcher (uint64_t-packed name → O(1) linear scan in L1) │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DATABASE LAYER (per-shard) │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Robin Hood FlatMap (Structure-of-Arrays layout) │ │
│ │ ┌────────────┬────────────┬────────────────┬─────────────┐ │ │
│ │ │ metadata[] │ keys[] │ values[] │ hashes[] │ │ │
│ │ │ (2B/slot) │(CompactStr)│(variant<CS,ll>)│ (uint32_t) │ │ │
│ │ └────────────┴────────────┴────────────────┴─────────────┘ │ │
│ │ • 7-bit hash tag early rejection (~99% false-positive avoidance) │ │
│ │ • Incremental resizing (no latency spikes) │ │
│ │ • SIEVE eviction (visited bit in metadata byte) │ │ │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────┐ ┌────────────────────────────────────────┐ │
│ │ CompactString (16B SSO) │ │ Slab Allocator (thread_local) │ │
│ │ • Inline ≤15 bytes │ │ • 7 size classes: 32B–2048B │ │
│ │ • 16-byte aligned │ │ • 1MB page allocation │ │
│ │ • Binary-safe, move-only │ │ • O(1) alloc/dealloc free-list │ │
│ │ • 4 keys per cache line │ │ • Fallback to ::operator new │ │
│ └────────────────────────────┘ └────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
EXTERNAL DEPENDENCIES:
┌──────────────┬──────────────┬──────────────┬──────────────┐
│ Asio 1.36 │ mimalloc 2.2 │ rapidhash │ libdivide │
│ (async I/O) │ (allocator │ (fast hash │ (fast modulo │
│ │ override) │ function) │ for shard │
│ │ │ │ routing) │
└──────────────┴──────────────┴──────────────┴──────────────┘
With shard = 1, VortexKV matches the single-threaded RapidKV within a few percent — validating that the shared-nothing abstraction adds near-zero overhead when cross-shard routing isn't needed.
- Each Shard owns its own
asio::io_context,std::jthread,Database, andCommandDispatcher— no shared mutable state, no mutexes, no atomic contention on the hot path - Shards are cache-line aligned (
alignas(64)) to prevent false sharing - Each shard thread is pinned to a dedicated CPU core via
PinThreadToCore()for optimal cache locality - The main thread's acceptor distributes new connections round-robin across shards
- Memory budget is divided equally across shards (
maxMemoryUsage / shardCount)
- Every command is routed to the shard that owns its key:
rapidhash(key, seed) % shard_count - libdivide provides branchless fast modulus — avoids expensive hardware division on every request
- If a command targets the local shard, it executes inline with zero overhead
- If it targets a remote shard, the request is posted to that shard's
io_contextviaExecuteRemote(), which executes the command and posts the response back to the caller's context — fully asynchronous, no blocking
- All RESP commands in a single read are parsed into a batch (
m_pipelinedRequests) - Each command is routed independently; local commands complete immediately, remote commands complete asynchronously
- A completion counter tracks outstanding responses;
FlushPipelinedResponses()fires only when all commands in the batch have completed, preserving response ordering
- Each shard runs its own event loop (
io_context::run()) — all I/O and maintenance timers run on the shard's thread - Connections use async reads/writes with
std::shared_from_this()lifetime management TCP_NODELAYenabled on every accepted socketasio::recycling_allocatorused for handler allocation to avoid heap churn- Socket handoff from acceptor thread to shard uses
release()/assign()to transfer the native FD without cross-threadio_contextconflicts
- Union-based string that stores strings ≤15 bytes inline without heap allocation
- Heap-allocated strings use the slab allocator to prevent fragmentation
- Aligned to 16 bytes — four keys fit in a single 64-byte cache line during hash table probing
- SoA layout: Separate arrays for keys, values, metadata, and hashes — metadata scanning touches only 2-byte entries for excellent cache utilization
- 2-byte metadata per slot: 1 byte PSL + 1 byte tag (7 hash bits + 1 visited bit)
- Tag-based early rejection: Check PSL + 7-bit hash tag before comparing keys — avoids ~99% of false-positive key comparisons
- Incremental resizing: Entries migrated N-at-a-time per operation, spreading resize cost across requests — no latency spikes
- SIEVE eviction: Clock-hand algorithm using the visited bit — approximates LRU with O(1) bookkeeping
- Command names (≤8 chars) are uppercased and packed into a
uint64_t— dispatch is a single integer comparison instead ofstrcmp - Fixed-size array of 32
CommandEntrystructs fits in L1 cache; hottest commands (GET, SET) registered first for early exit
- Parser produces
string_viewdirectly into the network read buffer — no allocations, no copies - Writer uses
std::to_charsinto stack buffers and writes directly into theLinearBuffer
- Two response buffers per connection (primary + secondary)
- While the OS writes the primary buffer, new responses accumulate in the secondary
- Swap-and-send on completion — no write stalls
- 7 size classes (32B–2048B) with free-list allocation
- 1MB page allocation to reduce
mmapsyscalls - O(1) alloc/dealloc; fallback to
::operator newfor oversized allocations
- Microsoft's mimalloc replaces the default allocator via
MI_OVERRIDE, providing better multithreaded allocation performance and reduced fragmentation
| Command | Description |
|---|---|
PING [message] |
Connection test |
ECHO message |
Echo back message |
SET key value [PX milliseconds] |
Set key-value with optional TTL |
GET key |
Retrieve value |
DEL key |
Delete a key |
EXISTS key |
Check key existence |
INCR key |
Increment integer value |
DECR key |
Decrement integer value |
INCRBY key delta |
Increment by delta |
DECRBY key delta |
Decrement by delta |
EXPIRE key milliseconds |
Set TTL on key |
PERSIST key |
Remove TTL |
TTL key |
Get remaining TTL |
Note:
DELandEXISTScurrently operate on a single key only. Multi-key variants (DEL key1 key2 ...,EXISTS key1 key2 ...) are not supported due to the shared-nothing architecture — each key may reside on a different shard, requiring cross-shard coordination for atomically counting results across multiple keys.
| Component | Details |
|---|---|
| CPU | AMD EPYC 9554P (64C/128T) |
| Server Cores | 64 threads pinned to cores 0–63 |
| Benchmark Cores | 64 threads pinned to cores 64–127 via taskset -c 64-127 |
| Benchmark Tool | memtier_benchmark |
| Value Size | 256 bytes |
| Key Range | 1–10,000,000 |
| Server | Launch Command |
|---|---|
| Redis | redis-server --save "" --appendonly no |
| Dragonfly | dragonfly --dbfilename "" --snapshot_cron "" --cache_mode=true --version_check=false --proactor_threads=64 |
| VortexKV | ./VortexKV VortexKV.config (64 shards) |
All three servers ran on the same bare-metal machine. Redis was configured with all persistence disabled. Dragonfly was configured in cache mode with snapshots disabled. Servers were restarted between runs for a clean state.
64 threads × 10 connections, ratio 1:0 (SET only), 256B values, 100 seconds
| Metric | Redis | Dragonfly | VortexKV | VortexKV vs Redis | VortexKV vs Dragonfly |
|---|---|---|---|---|---|
| Ops/sec | 66,631 | 2,489,825 | 2,575,169 | 38.6× | +3.4% |
| Avg Latency | 9.508 ms | 0.254 ms | 0.248 ms | −97.4% | −2.4% |
| p99 Latency | 11.967 ms | 0.351 ms | 0.343 ms | −97.1% | −2.3% |
| p99.9 Latency | 19.967 ms | 0.607 ms | 0.695 ms | −96.5% | +14.5% |
64 threads × 10 connections, ratio 0:1 (GET only), 256B values, 100 seconds, 10M keys pre-populated
| Metric | Redis | Dragonfly | VortexKV | VortexKV vs Redis | VortexKV vs Dragonfly |
|---|---|---|---|---|---|
| Ops/sec | 70,780 | 2,552,889 | 2,541,428 | 35.9× | −0.4% |
| Avg Latency | 9.131 ms | 0.253 ms | 0.249 ms | −97.3% | −1.6% |
| p99 Latency | 9.407 ms | 0.351 ms | 0.343 ms | −96.4% | −2.3% |
| p99.9 Latency | 18.303 ms | 0.535 ms | 0.511 ms | −97.2% | −4.5% |
Key takeaway: Without pipelining, VortexKV and Dragonfly are effectively neck-and-neck at ~2.5M ops/sec — both ~36× faster than single-threaded Redis. VortexKV has a slight latency edge.
64 threads × 10 connections, pipeline = 30, ratio 1:0, 256B values, 200K requests/client
| Metric | Redis (16t×10c) | Dragonfly | VortexKV | VortexKV vs Redis | VortexKV vs Dragonfly |
|---|---|---|---|---|---|
| Ops/sec | 567,642 | 7,417,396 | 11,880,297 | 20.9× | +60.2% |
| Avg Latency | 8.298 ms | 2.358 ms | 1.769 ms | −78.7% | −25.0% |
| p99 Latency | 15.743 ms | 4.047 ms | 2.319 ms | −85.3% | −42.7% |
| p99.9 Latency | 23.679 ms | 5.663 ms | 5.695 ms | −75.9% | +0.6% |
64 threads × 10 connections, pipeline = 30, ratio 0:1, 256B values, 100 seconds, 10M keys pre-populated
| Metric | Redis (16t×10c) | Dragonfly | VortexKV | VortexKV vs Redis | VortexKV vs Dragonfly |
|---|---|---|---|---|---|
| Ops/sec | 558,308 | 2,688,873 | 9,889,346 | 17.7× | 3.68× |
| Avg Latency | 8.592 ms | 7.140 ms | 1.918 ms | −77.7% | −73.1% |
| p99 Latency | 17.023 ms | 7.807 ms | 2.655 ms | −84.4% | −66.0% |
| p99.9 Latency | 17.279 ms | 8.767 ms | 3.023 ms | −82.5% | −65.5% |
Key takeaway: Under pipelining, VortexKV's shared-nothing architecture with per-shard databases dominates. Pipelined SET is 60% faster than Dragonfly; pipelined GET is 3.7× faster — the biggest gap in the entire suite.
256B values, 25 seconds, scaling client threads proportionally to server shards
| Shards | Threads × Clients | Ops/sec | Avg Latency | p99 Latency | p99.9 Latency |
|---|---|---|---|---|---|
| 1 | 2 × 10 | 83,664 | 0.239 ms | 0.287 ms | 0.407 ms |
| 4 | 8 × 10 | 311,311 | 0.257 ms | 0.367 ms | 0.407 ms |
| 16 | 32 × 10 | 1,064,231 | 0.301 ms | 0.415 ms | 0.495 ms |
| 32 | 64 × 10 | 1,736,663 | 0.384 ms | 0.559 ms | 0.655 ms |
| 64 | 64 × 10 | 2,403,368 | 0.256 ms | 0.351 ms | 0.495 ms |
| Shards | Threads × Clients | Ops/sec | Avg Latency | p99 Latency | p99.9 Latency |
|---|---|---|---|---|---|
| 1 | 2 × 10 | 73,510 | 0.272 ms | 0.295 ms | 0.855 ms |
| 4 | 8 × 10 | 287,612 | 0.278 ms | 0.415 ms | 0.855 ms |
| 16 | 32 × 10 | 959,061 | 0.321 ms | 0.447 ms | 0.895 ms |
| 32 | 64 × 10 | 1,572,802 | 0.391 ms | 0.575 ms | 0.951 ms |
| 64 | 64 × 10 | 2,551,297 | 0.250 ms | 0.351 ms | 0.959 ms |
| Shards | Dragonfly (ops/sec) | VortexKV (ops/sec) | Δ | Dragonfly p99 | VortexKV p99 |
|---|---|---|---|---|---|
| 1 | 83,664 | 73,510 | −12.1% | 0.287 ms | 0.295 ms |
| 4 | 311,311 | 287,612 | −7.6% | 0.367 ms | 0.415 ms |
| 16 | 1,064,231 | 959,061 | −9.9% | 0.415 ms | 0.447 ms |
| 32 | 1,736,663 | 1,572,802 | −9.4% | 0.559 ms | 0.575 ms |
| 64 | 2,403,368 | 2,551,297 | +6.2% | 0.351 ms | 0.351 ms |
Key takeaway: At low shard counts, Dragonfly leads by ~8–12% — likely due to its mature per-thread optimizations. As shards scale up, the gap narrows significantly, and at full 64-core saturation RapidKV overtakes Dragonfly with +6.2% higher throughput and matching p99 latency. The shared-nothing architecture's advantage materializes at high core counts where cross-core contention becomes the bottleneck.
| Requirement | Minimum Version |
|---|---|
| C++ Standard | C++20 |
| CMake | 3.20+ |
| Compiler | GCC 12+ / Clang 15+ / MSVC 2022+ |
| Git | Any recent version (for FetchContent) |
| Dependency | Version | Purpose |
|---|---|---|
| Asio | 1.36.0 | Async networking (header-only) |
| mimalloc | 2.2.7 | High-performance allocator |
| RapidHash | vendored | Fast hashing |
| libdivide | vendored | Fast integer division |
| Google Test | 1.15.2 | Unit testing (optional) |
| Google Benchmark | 1.9.1 | Micro-benchmarks (optional) |
# Clone
git clone https://github.com/shashank2602/VortexKV.git
cd VortexKV
# Build (Release)
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)
# Build with tests and benchmarks
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_EXTRAS=ON
cmake --build . -j$(nproc)
# Run
./VortexKV # default: 0.0.0.0:8080
./VortexKV ../VortexKV.config # custom config
# Run tests
ctest --output-on-failure
# Or run individual test binaries
./tests/TestCompactString
./tests/TestFlatMap
./tests/TestDatabase
./tests/TestRespProtocol
./tests/TestIntegration
# Run benchmarks
./benchmarks/BenchmarkFlatMap
./benchmarks/BenchmarkCompactString
./benchmarks/BenchmarkDatabase
./benchmarks/BenchmarkCommandDispatcher# Clone
git clone https://github.com/shashank2602/VortexKV.git
cd VortexKV
# Build (Release)
mkdir build && cd build
cmake ..
cmake --build . --config Release
# Build with tests and benchmarks
cmake .. -DBUILD_EXTRAS=ON
cmake --build . --config Release
# Run
.\Release\VortexKV.exe
.\Release\VortexKV.exe ..\VortexKV.config
# Run tests
ctest -C Release --output-on-failure# VortexKV.config
port 8080
bind 0.0.0.0
shards 64 # defaults to hardware_concurrency
maintenance_interval_ms 200
max_memory_usage 4294967296 # 4 GB
VortexKV/
├── CMakeLists.txt
├── src/
│ ├── main.cpp
│ ├── server/
│ │ ├── server.cpp # Acceptor, shard pool management
│ │ ├── shard.cpp # Per-shard thread, io_context, maintenance
│ │ └── connection.cpp # Per-connection pipeline, routing, I/O
│ ├── commands/
│ │ ├── commandDispatcher.cpp # Integer-packed command dispatch
│ │ └── commands.cpp # Command implementations
│ ├── protocol/
│ │ ├── respParser.cpp # Zero-copy RESP parser
│ │ └── respWriter.cpp # Zero-alloc RESP writer
│ ├── storage/
│ │ ├── database.cpp # Database operations (SET, GET, DEL, etc.)
│ │ ├── compactString.cpp # 16-byte SSO string
│ │ └── slabAllocator.cpp # Slab allocator (7 size classes)
│ └── utility/
│ ├── utility.cpp # Helpers, thread pinning
│ └── linearBuffer.cpp # Linear buffer implementation
├── inc/ # Headers (mirrors src/ structure)
├── external/
│ ├── rapidhash/ # rapidhash for key routing
│ └── libdivide/ # Fast integer division
├── tests/ # Google Test suite
├── benchmarks/ # Google Benchmark microbenchmarks
├── VortexKV.config # Sample configuration
└── CMakeLists.txt
- String-only values — no lists, sets, sorted sets, or hashes
- No persistence — all data is in-memory only (no RDB/AOF snapshots)
- No replication or clustering
- No AUTH, ACL, or TLS support
- Limited command set — 13 commands
- Single-key DEL/EXISTS — multi-key variants not yet supported (see Supported Commands)
- Little-endian only — The CompactString SSO bit layout assumes little-endian architecture
This project is open source.