A thread-safe LSM-tree storage engine written in pure in Java 21+. Built with Project Panama (Foreign Memory API), lock-free concurrency, and background flush/compaction.
| Feature | Implementation |
|---|---|
| LSM-tree architecture | MemTable (write) -> SSTable flush (disk) -> compaction |
| Off-heap memory | MemorySegment + Arena for memory-mapped SSTable access, zero-copy reads |
| Thread-safe state management | Immutable DaoState snapshots via AtomicReference + CAS transitions |
| Non-blocking reads | get() and range iterators work against a consistent snapshot, never block on flush/compact |
| Background flush | Auto-triggered when MemTable exceeds flushThresholdBytes, runs in dedicated executor |
| Background compaction | Merges SSTables in a separate executor, readers continue on old tables until completion |
| Merge iterator | Priority-based merge across active MemTable, flushing MemTable, and N SSTables |
| Tombstone support | Logical deletes propagated through merge, removed during compaction |
| Crash recovery | Atomic file moves; interrupted compaction recovered on next startup |
| Binary search in SSTable | O(log N) point lookups via trailing index |
[Entry 0: key_size (8 bytes) | key | value_size (8 bytes) | value ]
[Entry 1: key_size (8 bytes) | key | value_size (8 bytes) | value ]
...
[Entry N: key_size (8 bytes) | key | value_size = -1 (tombstone) ]
[Index: offset_0 (8 bytes) | offset_1 | ... | offset_N ]
[Footer: entry_count (8 bytes) | index_start_offset (8 bytes) ]
- Trailing index allows single-pass sequential writes during flush
- Footer provides O(1) access to entry count and index location
- Value size
-1marks a tombstone (deleted entry)
State is managed as an immutable record, replaced atomically:
public record DaoState(
MemTable activeMemTable, // accepts writes
MemTable flushingMemTable, // being written to disk (or null)
List<SSTable> ssTables // immutable list of on-disk tables
) {
}
private final AtomicReference<DaoState> state;State transitions:
upsert(): state.get().activeMemTable().put(entry)
flush: [active=A, flushing=null] -> [active=new, flushing=A]
write A to disk
[active=B, flushing=A] -> [active=B, flushing=null, tables+=[new SSTable]]
compact: merge SSTables from snapshot -> write compacted file
atomically replace old tables, preserve tables added by concurrent flushes
Two single-thread executors guarantee ordering:
flushExecutor-- at most one flush in progresscompactExecutor-- at most one compaction in progress
Reads (get, get(from, to)) take a snapshot via state.get() and never block.
Config config = new Config(
Path.of("/data/mydb"),
64 * 1024 * 1024 // flush threshold: 64 MB
);
Dao<MemorySegment, Entry<MemorySegment>> dao = new MemorySegmentDao(config);
dao.upsert(new BaseEntry<>(key, value));
dao.upsert(new BaseEntry<>(key, null)); // tombstone
Entry<MemorySegment> entry = dao.get(key);
Iterator<Entry<MemorySegment>> range = dao.get(fromKey, toKey);
dao.compact();
dao.close(); // blocks until background flush/compact complete# Requires -Xmx64m
./gradlew clean testMIT License - see LICENSE file for details.