Skip to content

MalteJ/kernstor

Repository files navigation

Kernstor

A high-performance distributed block storage system for on-premises IaaS, similar to AWS EBS.

Vision

Kernstor provides persistent block storage volumes that can be attached to virtual machines running on hypervisor nodes. It is designed for:

  • High performance: Direct NVMe access via io_uring, bypassing the kernel block layer
  • Durability: 3-way synchronous replication across storage nodes
  • Flexibility: Instant snapshots and clones using copy-on-write

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Hypervisor Nodes                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐                          │
│  │   VM    │  │   VM    │  │   VM    │                          │
│  └────┬────┘  └────┬────┘  └────┬────┘                          │
│       │            │            │                               │
│  ┌────┴────────────┴────────────┴────┐                          │
│  │       Linux NVMe/TCP Initiator    │                          │
│  └────────────────┬──────────────────┘                          │
└───────────────────│─────────────────────────────────────────────┘
                    │ NVMe/TCP
                    ▼
┌─────────────────────────────────────────────────────────────────┐
│                      Storage Frontend                           │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  NVMe/TCP Target  │  Volume Manager  │  Replication Coord │  │
│  └───────────────────────────────────────────────────────────┘  │
└───────────────────│─────────────────────────────────────────────┘
                    │ Internal Protocol
        ┌───────────┼───────────┐
        ▼           ▼           ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Storage Node │ │ Storage Node │ │ Storage Node │
│   (Primary)  │ │  (Replica)   │ │  (Replica)   │
│  ┌────────┐  │ │  ┌────────┐  │ │  ┌────────┐  │
│  │io_uring│  │ │  │io_uring│  │ │  │io_uring│  │
│  └───┬────┘  │ │  └───┬────┘  │ │  └───┬────┘  │
│      │       │ │      │       │ │      │       │
│  ┌───┴───┐   │ │  ┌───┴───┐   │ │  ┌───┴───┐   │
│  │ NVMe  │   │ │  │ NVMe  │   │ │  │ NVMe  │   │
│  └───────┘   │ │  └───────┘   │ │  └───────┘   │
└──────────────┘ └──────────────┘ └──────────────┘

Components

Storage Frontend

Exposes block volumes to hypervisors via the NVMe/TCP protocol. Handles:

  • NVMe/TCP target implementation
  • Volume-to-storage-node mapping
  • Replication coordination (ensures writes reach all replicas)
  • Client connection management

Storage Nodes

Store the actual block data on local NVMe drives. Features:

  • Direct NVMe access via io_uring for minimal latency
  • Configuration-driven device management (devices specified by NGUID)
  • Block allocation and management
  • Write-ahead logging for crash recovery

Control Plane

Manages cluster state via gRPC API:

  • Volume lifecycle (create, delete, resize, attach, detach)
  • Snapshot and clone operations
  • Storage node membership and health monitoring
  • Replica placement decisions

Key Design Decisions

1. io_uring with NVMe Passthrough

Uses Linux io_uring with NVMe passthrough (IORING_OP_URING_CMD) to bypass the kernel block layer entirely. This provides:

  • Sub-100μs latency for local NVMe operations
  • High IOPS with minimal CPU overhead
  • Direct submission of NVMe commands

2. NVMe/TCP for Hypervisor Connectivity

Standard NVMe/TCP protocol allows hypervisors to use the native Linux NVMe initiator, providing:

  • Kernel-level NVMe driver integration
  • Familiar operational model
  • No custom drivers required on hypervisor nodes

3. Parallel Replication

Uses parallel fan-out for 3-way data replication:

  • Head node sends writes to both replicas simultaneously
  • Majority quorum: acknowledges after local + 1 replica completes
  • Low latency due to parallel I/O

4. Copy-on-Write Snapshots

Snapshots are instant and space-efficient:

  • No data copying at snapshot time
  • Blocks are shared between volume and snapshots
  • Reference counting tracks block sharing
  • Clones created instantly from snapshots

5. Hybrid Internal Protocol

Communication between frontend and storage nodes uses:

  • Data path: Custom binary protocol optimized for block I/O
  • Control path: gRPC for metadata, health checks, and coordination

Technology Stack

Component Technology
Language Rust
Async I/O io_uring (via io-uring crate)
Hypervisor protocol NVMe/TCP
Control plane RPC gRPC (tonic)
Configuration TOML
Serialization Protocol Buffers

Requirements

  • Linux 5.19+ (for io_uring NVMe passthrough)
  • NVMe drives with character device support
  • Rust 1.75+

Quick Start

Build

cargo build --release

Running a 3-Node Frontend Cluster

Config files are provided in config/:

config/frontend-1.toml:

nvme_tcp_port = 4420

[control_plane]
grpc_port = 9001

[raft]
node_id = 1
listen_addr = "[::1]:6001"
storage_path = "data/frontend-1/raft.redb"
bootstrap = true

[raft.peers]
1 = "[::1]:6001"
2 = "[::1]:6002"
3 = "[::1]:6003"

config/frontend-2.toml: (node_id = 2, listen_addr = 6002, nvme_tcp_port = 4421, grpc_port = 9002, bootstrap = false)

config/frontend-3.toml: (node_id = 3, listen_addr = 6003, nvme_tcp_port = 4422, grpc_port = 9003, bootstrap = false)

Start all three frontends in separate terminals:

# Terminal 1 - start first (bootstrap node)
cargo run --release --bin kernstor-frontend -- -c config/frontend-1.toml

# Terminal 2
cargo run --release --bin kernstor-frontend -- -c config/frontend-2.toml

# Terminal 3
cargo run --release --bin kernstor-frontend -- -c config/frontend-3.toml

Running 3 Storage Nodes (File Backend)

Config files are provided in config/:

config/node-1.toml:

uuid = "00000000-0000-0000-0000-000000000001"
grpc_port = 9101

[[storage]]
id = "10000000-0000-0000-0000-000000000001"  # Unique drive UUID
backend = "file:data/node-1/storage.bin"
metadata_path = "data/node-1/metadata.redb"
listen_address = "::1"
port = 4431
hugepages = 0
capacity_bytes = 1073741824  # 1GB

config/node-2.toml: (uuid = ...002, grpc_port = 9102, port = 4432, etc.)

config/node-3.toml: (uuid = ...003, grpc_port = 9103, port = 4433, etc.)

Start all three storage nodes:

# Terminal 4
sudo cargo run --release --bin kernstor-node -- -c config/node-1.toml

# Terminal 5
sudo cargo run --release --bin kernstor-node -- -c config/node-2.toml

# Terminal 6
sudo cargo run --release --bin kernstor-node -- -c config/node-3.toml

Note: Storage nodes require sudo for io_uring NVMe passthrough.

Running with NVMe Drives

To find NGUIDs on your system:

sudo nvme id-ns /dev/nvme0n1 -o json | jq -r .nguid

config/node-nvme.toml:

uuid = "your-node-uuid"
grpc_port = 9101

[[storage]]
id = "your-drive-uuid"  # Unique UUID for this drive
backend = "nvme:your-nguid-here"
metadata_path = "/var/lib/kernstor/ns1.redb"
listen_address = "::"
port = 4421
hugepages = 512  # 1GB of 2MB hugepages

CLI Commands

The kernstor-cli tool manages the cluster:

# Default control plane address: [::1]:9001
# Override with: --control-plane HOST:PORT

Node Management

# Register a storage node with the cluster (UUID is fetched from the node)
kernstor-cli node register \
  --grpc-address [::1]:9101 \
  --label name=node-1 \
  --label rack=rack-a

# List all registered nodes
kernstor-cli node list

# Get details of a specific node
kernstor-cli node get --id 00000000-0000-0000-0000-000000000001

# Mark node for draining (no new chunks placed)
kernstor-cli node drain --id 00000000-0000-0000-0000-000000000001

# Permanently remove a node
kernstor-cli node decommission --id 00000000-0000-0000-0000-000000000001

Volume Management

# Create a volume (size: K, M, G, T suffixes supported)
kernstor-cli volume create --name my-volume --size 100G --replication 3

# List all volumes
kernstor-cli volume list

# Get volume details (with chunk placement)
kernstor-cli volume get --id <volume-uuid> --show-chunks

# Delete a volume
kernstor-cli volume delete --id <volume-uuid>

Data Plane Operations

# Write data to a chunk (data must be 4KB aligned)
kernstor-cli --addr [::1]:4431 write \
  --drive <chunk-id> \
  --offset 0 \
  --data @file.bin  # or hex: 0x00112233...

# Read data from a chunk
kernstor-cli --addr [::1]:4431 read \
  --drive <chunk-id> \
  --offset 0 \
  --length 4096

# Flush data to durable storage
kernstor-cli --addr [::1]:4431 flush --drive <chunk-id>

Project Structure

kernstor/
├── crates/
│   ├── kernstor-node/         # Storage node daemon (io_uring reactor)
│   ├── kernstor-frontend/     # Frontend daemon (NVMe/TCP + Raft + Control Plane)
│   ├── kernstor-cli/          # CLI tool for cluster management
│   ├── kernstor-proto/        # Protocol Buffers + gRPC definitions
│   └── kernstor-raft/         # Raft consensus library wrapper
├── config/                    # Example configuration files
├── Cargo.toml                 # Workspace manifest
└── README.md

License

TBD

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors