Storage format

Storage format is inspired by SSTables from Cassandra. It is append-only SortedMap<byte[], byte[]> like structure. Changes are never overwritten, obsolete files are only deleted on compaction or cleanup.

Each update is stored in single update file. ySearch for an entry starts at newest file (with highest versionID). If no key is found, it continues at older file. And so on until key is not found, or no older files are left.

Keys in each file are stored in sorted order, so the key in file can be found efficiently using binary search. Key removals are handled by tombstones; special value which indicates delete.

Compaction and index files

Over time the number of update files grows. It would be too slow to do binary search over all of them. There is compaction process running on background. It takes multiple update files and merges them into single index file. Binary search is then performed over single merged file. Old update files are preserved for possible rollbacks.

Sometimes (after massive delete) all index files will be merged into single file. My plan is to start with compaction strategies from Cassandra or another blog post. We need to find best algorithm for our scenario.

Too many files?

DB will store data in directory with many files. I expect a few thousand files with sizes between 100KB to 10MB. Some filesystems do not handle well too many files in single folder (millions), in that case multiple subfolders could be used to minimize number of files on each level.

If single file is preferred, I could use single file with block abstraction (similar to KVStore block layer). I think many-file approach would be more robust, faster and more space efficient.

Cleanup

Cleanup (destroys old versions) deletes old update files and removes possibility to rollback to older versions. To preserver data merged file must be produced by compaction process, before the update files are deleted.

File format

Update file has following structure

File header with feature bits
Optional checksum of entire file to catch data corruption (CRC32?)
Sorted table of keys. Those are fixed size, binary search can use fixed offsets to perform efficient binary search
Values, those are referenced from key table by offset. Values are in separate section to make binary search on keys more CPU cache friendly.

Merged file has the structure with following differences:

Header contains minimal and maximal versionIDs
To save space it does not contain values, but reference to update file offset. Cleanup needs to move values before update files are deleted.

Rollback

Rollback removes update files bigger than versionID. It also removes merged files where maximal versionID is beyond rollback versionID.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!