A high-performance distributed block storage system for on-premises IaaS, similar to AWS EBS.
Kernstor provides persistent block storage volumes that can be attached to virtual machines running on hypervisor nodes. It is designed for:
- High performance: Direct NVMe access via io_uring, bypassing the kernel block layer
- Durability: 3-way synchronous replication across storage nodes
- Flexibility: Instant snapshots and clones using copy-on-write
┌─────────────────────────────────────────────────────────────────┐
│ Hypervisor Nodes │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ VM │ │ VM │ │ VM │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ┌────┴────────────┴────────────┴────┐ │
│ │ Linux NVMe/TCP Initiator │ │
│ └────────────────┬──────────────────┘ │
└───────────────────│─────────────────────────────────────────────┘
│ NVMe/TCP
▼
┌─────────────────────────────────────────────────────────────────┐
│ Storage Frontend │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ NVMe/TCP Target │ Volume Manager │ Replication Coord │ │
│ └───────────────────────────────────────────────────────────┘ │
└───────────────────│─────────────────────────────────────────────┘
│ Internal Protocol
┌───────────┼───────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Storage Node │ │ Storage Node │ │ Storage Node │
│ (Primary) │ │ (Replica) │ │ (Replica) │
│ ┌────────┐ │ │ ┌────────┐ │ │ ┌────────┐ │
│ │io_uring│ │ │ │io_uring│ │ │ │io_uring│ │
│ └───┬────┘ │ │ └───┬────┘ │ │ └───┬────┘ │
│ │ │ │ │ │ │ │ │
│ ┌───┴───┐ │ │ ┌───┴───┐ │ │ ┌───┴───┐ │
│ │ NVMe │ │ │ │ NVMe │ │ │ │ NVMe │ │
│ └───────┘ │ │ └───────┘ │ │ └───────┘ │
└──────────────┘ └──────────────┘ └──────────────┘
Exposes block volumes to hypervisors via the NVMe/TCP protocol. Handles:
- NVMe/TCP target implementation
- Volume-to-storage-node mapping
- Replication coordination (ensures writes reach all replicas)
- Client connection management
Store the actual block data on local NVMe drives. Features:
- Direct NVMe access via io_uring for minimal latency
- Configuration-driven device management (devices specified by NGUID)
- Block allocation and management
- Write-ahead logging for crash recovery
Manages cluster state via gRPC API:
- Volume lifecycle (create, delete, resize, attach, detach)
- Snapshot and clone operations
- Storage node membership and health monitoring
- Replica placement decisions
Uses Linux io_uring with NVMe passthrough (IORING_OP_URING_CMD) to bypass the kernel block layer entirely. This provides:
- Sub-100μs latency for local NVMe operations
- High IOPS with minimal CPU overhead
- Direct submission of NVMe commands
Standard NVMe/TCP protocol allows hypervisors to use the native Linux NVMe initiator, providing:
- Kernel-level NVMe driver integration
- Familiar operational model
- No custom drivers required on hypervisor nodes
Uses parallel fan-out for 3-way data replication:
- Head node sends writes to both replicas simultaneously
- Majority quorum: acknowledges after local + 1 replica completes
- Low latency due to parallel I/O
Snapshots are instant and space-efficient:
- No data copying at snapshot time
- Blocks are shared between volume and snapshots
- Reference counting tracks block sharing
- Clones created instantly from snapshots
Communication between frontend and storage nodes uses:
- Data path: Custom binary protocol optimized for block I/O
- Control path: gRPC for metadata, health checks, and coordination
| Component | Technology |
|---|---|
| Language | Rust |
| Async I/O | io_uring (via io-uring crate) |
| Hypervisor protocol | NVMe/TCP |
| Control plane RPC | gRPC (tonic) |
| Configuration | TOML |
| Serialization | Protocol Buffers |
- Linux 5.19+ (for io_uring NVMe passthrough)
- NVMe drives with character device support
- Rust 1.75+
cargo build --releaseConfig files are provided in config/:
config/frontend-1.toml:
nvme_tcp_port = 4420
[control_plane]
grpc_port = 9001
[raft]
node_id = 1
listen_addr = "[::1]:6001"
storage_path = "data/frontend-1/raft.redb"
bootstrap = true
[raft.peers]
1 = "[::1]:6001"
2 = "[::1]:6002"
3 = "[::1]:6003"config/frontend-2.toml: (node_id = 2, listen_addr = 6002, nvme_tcp_port = 4421, grpc_port = 9002, bootstrap = false)
config/frontend-3.toml: (node_id = 3, listen_addr = 6003, nvme_tcp_port = 4422, grpc_port = 9003, bootstrap = false)
Start all three frontends in separate terminals:
# Terminal 1 - start first (bootstrap node)
cargo run --release --bin kernstor-frontend -- -c config/frontend-1.toml
# Terminal 2
cargo run --release --bin kernstor-frontend -- -c config/frontend-2.toml
# Terminal 3
cargo run --release --bin kernstor-frontend -- -c config/frontend-3.tomlConfig files are provided in config/:
config/node-1.toml:
uuid = "00000000-0000-0000-0000-000000000001"
grpc_port = 9101
[[storage]]
id = "10000000-0000-0000-0000-000000000001" # Unique drive UUID
backend = "file:data/node-1/storage.bin"
metadata_path = "data/node-1/metadata.redb"
listen_address = "::1"
port = 4431
hugepages = 0
capacity_bytes = 1073741824 # 1GBconfig/node-2.toml: (uuid = ...002, grpc_port = 9102, port = 4432, etc.)
config/node-3.toml: (uuid = ...003, grpc_port = 9103, port = 4433, etc.)
Start all three storage nodes:
# Terminal 4
sudo cargo run --release --bin kernstor-node -- -c config/node-1.toml
# Terminal 5
sudo cargo run --release --bin kernstor-node -- -c config/node-2.toml
# Terminal 6
sudo cargo run --release --bin kernstor-node -- -c config/node-3.tomlNote: Storage nodes require sudo for io_uring NVMe passthrough.
To find NGUIDs on your system:
sudo nvme id-ns /dev/nvme0n1 -o json | jq -r .nguidconfig/node-nvme.toml:
uuid = "your-node-uuid"
grpc_port = 9101
[[storage]]
id = "your-drive-uuid" # Unique UUID for this drive
backend = "nvme:your-nguid-here"
metadata_path = "/var/lib/kernstor/ns1.redb"
listen_address = "::"
port = 4421
hugepages = 512 # 1GB of 2MB hugepagesThe kernstor-cli tool manages the cluster:
# Default control plane address: [::1]:9001
# Override with: --control-plane HOST:PORT# Register a storage node with the cluster (UUID is fetched from the node)
kernstor-cli node register \
--grpc-address [::1]:9101 \
--label name=node-1 \
--label rack=rack-a
# List all registered nodes
kernstor-cli node list
# Get details of a specific node
kernstor-cli node get --id 00000000-0000-0000-0000-000000000001
# Mark node for draining (no new chunks placed)
kernstor-cli node drain --id 00000000-0000-0000-0000-000000000001
# Permanently remove a node
kernstor-cli node decommission --id 00000000-0000-0000-0000-000000000001# Create a volume (size: K, M, G, T suffixes supported)
kernstor-cli volume create --name my-volume --size 100G --replication 3
# List all volumes
kernstor-cli volume list
# Get volume details (with chunk placement)
kernstor-cli volume get --id <volume-uuid> --show-chunks
# Delete a volume
kernstor-cli volume delete --id <volume-uuid># Write data to a chunk (data must be 4KB aligned)
kernstor-cli --addr [::1]:4431 write \
--drive <chunk-id> \
--offset 0 \
--data @file.bin # or hex: 0x00112233...
# Read data from a chunk
kernstor-cli --addr [::1]:4431 read \
--drive <chunk-id> \
--offset 0 \
--length 4096
# Flush data to durable storage
kernstor-cli --addr [::1]:4431 flush --drive <chunk-id>kernstor/
├── crates/
│ ├── kernstor-node/ # Storage node daemon (io_uring reactor)
│ ├── kernstor-frontend/ # Frontend daemon (NVMe/TCP + Raft + Control Plane)
│ ├── kernstor-cli/ # CLI tool for cluster management
│ ├── kernstor-proto/ # Protocol Buffers + gRPC definitions
│ └── kernstor-raft/ # Raft consensus library wrapper
├── config/ # Example configuration files
├── Cargo.toml # Workspace manifest
└── README.md
TBD