A persistent key-value storage engine built from scratch in C++17, implementing the same LSM-tree architecture used by LevelDB, RocksDB, and Cassandra.
500,000 sequential writes, 100,000 random reads, 100-byte values.
| Metric | Baseline (v1) | Current | LevelDB (published) |
|---|---|---|---|
| Sequential write | 26,666 ops/s | 224,823 ops/s | ~566,000 ops/s |
| Random read (hit) | 37,347 ops/s | 27,075 ops/s | ~62,000 ops/s |
| Negative read (miss) | 1,419,825 ops/s | 2,978,277 ops/s | bloom-gated |
| Read avg latency | 26.68 us | 36.83 us | 16 us |
| Read p99 latency | 40.83 us | 61.42 us | — |
| Disk usage | 61.3 MB | 60.0 MB | — |
8.4x write improvement from baseline to current (WAL open/close fix).
WRITE PATH READ PATH
---------- ---------
PUT(k,v) -> WAL append (disk) GET(k)
-> MemTable insert (RAM) |
| v
(when full) MemTable (RAM)
v | miss
Flush to SSTable (disk) v
| Bloom filter check
(background) | maybe
v v
Compaction SSTable scan (disk)
(merge SSTables) | miss
v
Older SSTables...
| File | What it does |
|---|---|
include/kvstore.h / src/kvstore.cpp |
Public API + orchestration (flush, compaction, recovery) |
include/memtable.h / src/memtable.cpp |
In-memory sorted store (std::map) |
include/wal.h / src/wal.cpp |
Write-ahead log for crash recovery (binary format) |
include/sstable.h / src/sstable.cpp |
Immutable sorted on-disk tables (data + sparse index + bloom + footer) |
include/bloom.h / src/bloom.cpp |
Bloom filter (~1% FP rate, FNV-1a hash, double hashing) |
include/binio.h |
Shared binary I/O helpers (length-prefixed strings, fixed-width ints) |
include/tombstone.h |
Delete marker (NUL-padded sentinel value) |
+----------------------------+
| DATA BLOCK | [key_len:4][key][val_len:4][val] ...sorted...
+----------------------------+
| INDEX BLOCK (sparse) | [count:4] then [key_len:4][key][offset:8] every 16th key
+----------------------------+
| BLOOM BLOCK | [m:8][k:4][nbytes:4][bit_array]
+----------------------------+
| FOOTER | [index_offset:8][bloom_offset:8] (last 16 bytes)
+----------------------------+
make # build static library
make test # run tests
make bench # run benchmarks (results appended to bench/results.md)Requires C++17 and pthreads.
- WAL before RAM — durable log first, then in-memory insert. Crash between the two loses nothing.
- Binary format everywhere — length-prefixed
[len:4][bytes], no delimiters. Values can contain any bytes. - Immutable SSTables — written once, never modified. Enables lock-free reads during compaction.
- Tombstone deletes — delete = write a marker. Compaction reclaims space.
- Sparse index — every 16th key checkpointed. Binary search + bounded forward scan.
- Bloom filter per SSTable — 0.95% measured false positive rate. Skips disk entirely for absent keys.
- Background compaction — full merge on a separate thread, mutex only for snapshot/swap.
- Atomic MANIFEST — write to temp file + rename. Never seen half-written.
-
fsyncfor true power-loss durability - Size-tiered compaction (merge same-size groups instead of full merge)
- Block-based index segments (4KB blocks instead of fixed key count)
- Background flush (currently holds the lock)
- Write batching / group commit
- Skip list or arena allocator for MemTable (cache-friendly)
- Raw POSIX
write()instead ofstd::ofstream - Java implementation for comparison