Container-Runtime

A lightweight Docker-like container runtime built from scratch in C.
Runs isolated containers using Linux namespaces, captures output through a bounded-buffer logging pipeline,
enforces memory limits via a kernel module, and exposes a supervisor CLI.

How It Works in 30 Seconds

engine supervisor starts a long-running process that owns all container lifecycle — spawning, reaping, and state tracking
engine start <name> calls clone() with Linux namespace flags, chroot()s into an Alpine rootfs, and mounts /proc — giving each container its own PID, UTS, and mount namespace
Pipes + bounded buffer capture container stdout/stderr asynchronously; a dedicated logging thread drains the buffer to per-container log files
monitor.ko is a kernel module that fires a 1-second timer per container, reads /proc/<pid>/status for RSS, and sends SIGKILL (or logs a soft warning) when limits are exceeded
A UNIX domain socket at /tmp/mini_runtime.sock carries all CLI commands (start, stop, ps, logs) to the supervisor and returns responses

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          USER SPACE                                 │
│                                                                     │
│   ┌──────────────┐   UNIX socket    ┌──────────────────────────┐   │
│   │  CLI client  │ ◄──────────────► │     Supervisor process   │   │
│   │  engine      │                  │  ┌──────────────────────┐│   │
│   │  start/stop  │                  │  │ metadata_lock        ││   │
│   │  ps/logs     │                  │  │ container list       ││   │
│   └──────────────┘                  │  │ SIGCHLD handler      ││   │
│                                     │  └──────────┬───────────┘│   │
│                                     └─────────────┼────────────┘   │
│                        clone()                    │                 │
│               ┌───────────────────────────────────┘                 │
│               │                                                     │
│   ┌───────────▼──────────┐   ┌──────────────────────────────────┐  │
│   │   Container α        │   │   Container β                    │  │
│   │  PID · UTS · MNT ns  │   │  PID · UTS · MNT ns              │  │
│   │  chroot → Alpine     │   │  chroot → Alpine                 │  │
│   └──────────┬───────────┘   └──────────┬───────────────────────┘  │
│              │  pipe                    │  pipe                     │
│              └──────────┬───────────────┘                           │
│                         ▼                                           │
│              ┌──────────────────────┐                               │
│              │   Bounded Buffer     │  ← mutex + not_empty/full     │
│              │   pipe_reader_thread │     condition variables       │
│              └──────────┬───────────┘                               │
│                         ▼                                           │
│              ┌──────────────────────┐                               │
│              │   Logging thread     │ ──► /logs/<name>.log          │
│              └──────────────────────┘                               │
│                                                                     │
├─────────────────────────────────────────────────────────────────────┤
│                         KERNEL SPACE                                │
│                                                                     │
│   ┌──────────────────────────────────────────────────────────────┐ │
│   │  monitor.ko  │  1s timer  │  RSS check  │  SIGKILL/SIGWARN  │ │
│   │  ioctl ◄──────────────────────────────────── supervisor      │ │
│   └──────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Team Information

Name	SRN
Vivian Sobers E	PES1UG24CS901
Dhawal Pathak	PES1UG24CS151

Build, Load, and Run Instructions

Dependencies

sudo apt update
sudo apt install -y build-essential linux-headers-$(uname -r)

Prepare Root Filesystem

cd Container-Runtime
mkdir rootfs
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/x86_64/alpine-minirootfs-3.20.3-x86_64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-x86_64.tar.gz -C rootfs

Build

cd boilerplate
make

This builds engine, monitor.ko, memory_hog, cpu_hog, and io_pulse.

Copy Workloads into Rootfs

cp cpu_hog io_pulse memory_hog ../rootfs/

Load Kernel Module

sudo insmod monitor.ko
ls -l /dev/container_monitor

Start Supervisor

# Terminal 1
sudo ./engine supervisor ../rootfs

Launch Containers

# Terminal 2
sudo ./engine start alpha ../rootfs /bin/hostname
sudo ./engine start beta  ../rootfs /bin/sh
sudo ./engine ps
sudo ./engine logs alpha
sudo ./engine stop alpha

Run a Container in Foreground

sudo ./engine run mycontainer ../rootfs /cpu_hog 5

Memory Limit Test

sudo ./engine start memtest ../rootfs /memory_hog 1 500 --soft-mib 3 --hard-mib 6
sudo dmesg | grep memtest

Scheduling Experiments

# Two CPU-bound containers with different priorities
time sudo ./engine run cpu_normal ../rootfs /cpu_hog 10 --nice 0  &
time sudo ./engine run cpu_nice   ../rootfs /cpu_hog 10 --nice 15 &
wait

# CPU-bound vs I/O-bound
time sudo ./engine run cpu_exp ../rootfs /cpu_hog  10    --nice 0 &
time sudo ./engine run io_exp  ../rootfs /io_pulse 20 200         &
wait

Unload Module and Clean Up

sudo rmmod monitor
unlink /tmp/mini_runtime.sock

Warning

The runtime requires root privileges for clone() with namespace flags and insmod. Never run untrusted workloads without additional seccomp/capability restrictions.

Note

chroot is used for filesystem isolation in this implementation. While sufficient for the project scope, pivot_root is the production-grade choice as it prevents privileged escape from the container.

IPC & Control Flow

CLI → Supervisor command lifecycle

sequenceDiagram
    participant CLI as engine (CLI client)
    participant Sock as UNIX socket<br/>/tmp/mini_runtime.sock
    participant Sup as Supervisor process
    participant Cont as Container (child)

    CLI->>Sock: connect()
    CLI->>Sock: send("start alpha /bin/sh")
    Sock->>Sup: accept() + recv()
    Sup->>Cont: clone(CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS)
    Cont->>Cont: chroot(rootfs) + mount /proc + exec()
    Sup->>Sup: record metadata (pid, state, start_time)
    Sup->>Sock: send("OK: alpha started, pid=1234")
    Sock->>CLI: recv() → print response

    Note over Cont: Container runs workload...

    Cont-->>Sup: exit → SIGCHLD
    Sup->>Sup: waitpid() + update state → exited

Pipe → Bounded buffer → Log file

flowchart LR
    C1["Container α\nstdout/stderr"] -->|pipe fd| R1["pipe_reader\nthread α"]
    C2["Container β\nstdout/stderr"] -->|pipe fd| R2["pipe_reader\nthread β"]
    R1 -->|produce| BB["Bounded Buffer\n━━━━━━━━━━━━\nmutex lock\nnot_full cvar\nnot_empty cvar"]
    R2 -->|produce| BB
    BB -->|consume| LT["Logging thread"]
    LT --> LF1["logs/alpha.log"]
    LT --> LF2["logs/beta.log"]

Memory enforcement (kernel module)

flowchart TD
    IM["insmod monitor.ko"] --> TI["1-second kernel timer fires"]
    TI --> RL["Read /proc/pid/status\n→ VmRSS"]
    RL --> SC{RSS > soft limit?}
    SC -- Yes --> WN["dmesg: SOFT LIMIT WARNING\n(container keeps running)"]
    SC -- No --> HC{RSS > hard limit?}
    WN --> HC
    HC -- Yes --> KL["send SIGKILL to container PID"]
    KL --> US["Supervisor: SIGCHLD →\nwaitpid() → state = killed"]
    HC -- No --> TI

Demo Screenshots

Screenshot 1 — Multi-container supervision

Two containers (alpha, beta) running under a single supervisor process alongside previously tracked containers.

Screenshot 2 — Metadata tracking

ps output showing container ID, state, host PID, and start time for all tracked containers.

Screenshot 3 — Bounded-buffer logging

Log file contents captured through the pipe → bounded buffer → logging thread pipeline. Container hostname output routed to its log file.

Screenshot 4 — CLI and IPC

stop command issued from the CLI client, routed to the supervisor over a UNIX domain socket, supervisor responds and updates state.

Screenshot 5 — Soft-limit warning

dmesg output showing the kernel module emitting a SOFT LIMIT warning when the container's RSS exceeded 3 MiB.

Screenshot 6 — Hard-limit enforcement

dmesg output showing the kernel module killing the container when RSS exceeded 6 MiB. Supervisor metadata updates state to killed.

Screenshot 7 — Scheduling experiment

Two CPU-bound containers run simultaneously with nice=0 and nice=15. The lower-priority container took ~2× longer to complete the same workload.

Screenshot 8 — Clean teardown

Supervisor exits cleanly on SIGINT. No zombie processes remain. Container states reflect final exit conditions.

Engineering Analysis

4.1 Isolation Mechanisms

Each container is created using clone() with CLONE_NEWPID, CLONE_NEWUTS, and CLONE_NEWNS flags. These give the container its own PID namespace (processes inside see themselves starting at PID 1), its own UTS namespace (allowing a distinct hostname), and its own mount namespace (so filesystem mounts don't leak to the host). After clone(), the child calls chroot() into the Alpine rootfs, then mounts /proc inside the new mount namespace so process visibility works correctly inside the container.

The host kernel is still fully shared — all containers run on the same kernel, share the same physical memory management, and are subject to the same scheduler. Namespaces provide isolation of views, not true separation. The host can always see all container processes by their host PIDs.

4.2 Supervisor and Process Lifecycle

The long-running supervisor is essential because it maintains the authoritative state of all containers. Without a persistent parent, there is no process to reap children (causing zombies), no place to store metadata, and no endpoint for the CLI to talk to.

Each container is created via clone() making the supervisor its direct parent. When a container exits, the kernel delivers SIGCHLD to the supervisor. The sigchld_handler calls waitpid(-1, &status, WNOHANG) in a loop to reap all exited children and update their metadata. The stop_requested flag distinguishes a graceful stop (supervisor sent SIGTERM) from a hard limit kill (kernel sent SIGKILL) from a natural exit.

4.3 IPC, Threads, and Synchronization

The runtime uses two IPC mechanisms. A pipe per container carries stdout/stderr from the container to a pipe_reader_thread in the supervisor. A UNIX domain socket (/tmp/mini_runtime.sock) carries CLI commands and responses between the client process and the supervisor.

The bounded buffer sits between the pipe reader threads (producers) and the logging thread (consumer). Without synchronization, producers and the consumer would race on head, tail, and count, causing lost data or corruption. A pthread_mutex protects all accesses to the buffer struct. Two condition variables — not_empty and not_full — allow producers to block when the buffer is full and the consumer to block when it is empty, avoiding busy-waiting. The container metadata list is protected by a separate metadata_lock mutex, kept distinct from the buffer lock to avoid deadlock between the logging path and the CLI command path.

4.4 Memory Management and Enforcement

RSS (Resident Set Size) measures the physical memory currently mapped and present in RAM for a process. It does not include swapped-out pages, shared libraries counted once across processes, or memory that has been allocated but not yet touched (due to lazy allocation). RSS is therefore a conservative but practical measure of actual memory pressure a process is causing.

Soft and hard limits serve different purposes. The soft limit is a warning threshold — it signals that a container is approaching its budget without terminating it, giving the runtime or operator a chance to react. The hard limit is a hard enforcement boundary — the process is killed when it crosses it. Enforcement belongs in kernel space because user-space monitoring is inherently racy: by the time a user-space monitor reads RSS and decides to kill a process, the process could have allocated significantly more memory. The kernel module's timer fires every second and can send SIGKILL atomically within the same execution context as the RSS check.

4.5 Scheduling Behavior

Linux uses the Completely Fair Scheduler (CFS) which allocates CPU time proportionally based on each task's weight. The nice value maps to a weight: nice=0 gets a baseline weight of 1024, while nice=15 gets a significantly lower weight, meaning CFS gives it proportionally less CPU time when competing with a higher-priority process.

Our experiments confirmed this. Two identical cpu_hog processes running for 10 seconds each completed in 9.3s (nice=0) and 19.3s (nice=15) respectively when running simultaneously — the lower-priority container effectively got half the CPU share. In the CPU vs I/O experiment, the CPU-bound container finished in 4s while the I/O-bound container took 13s. The CPU-bound process got more CPU time because the I/O-bound process was frequently sleeping between write iterations, voluntarily yielding the CPU. CFS correctly identified the I/O-bound process as less CPU-hungry and prioritized the CPU-bound one when it was runnable.

Design Decisions and Tradeoffs

Namespace Isolation

Choice: CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS with chroot.

Tradeoff: chroot is simpler to implement than pivot_root but is less secure — a privileged process inside the container could potentially escape.

Justification: For this project's scope, chroot provides the required filesystem isolation without the complexity of pivot_root.

Supervisor Architecture

Choice: Single long-running supervisor process with a UNIX socket accept loop.

Tradeoff: The accept loop handles one request at a time sequentially. Concurrent CLI commands (e.g., two simultaneous start calls) are serialized. A threaded accept loop would handle this better.

Justification: Sequential handling is safe, simple, and sufficient for the demo workload. It avoids concurrency bugs in the command dispatch path.

IPC and Logging

Choice: Pipes for log data, UNIX domain socket for control commands.

Tradeoff: Log data is limited to CONTROL_MESSAGE_LEN bytes in the response. For large logs, streaming over a second socket would be better.

Justification: Two distinct channels keeps log throughput and control latency independent. A single channel for both would mean large log reads could block CLI responsiveness.

Kernel Monitor

Choice: Mutex-protected linked list with a 1-second periodic timer.

Tradeoff: A 1-second polling interval means a process could exceed its hard limit by up to 1 second's worth of allocations before being killed.

Justification: A mutex is appropriate here because the timer callback and ioctl handler can sleep (they run in process context), making a spinlock unnecessary. The 1-second interval is a reasonable tradeoff between enforcement latency and kernel overhead.

Scheduling Experiments

Choice: Used nice values rather than CPU affinity for priority experiments.

Tradeoff: nice affects scheduling weight but both processes still run on all CPUs. CPU affinity would isolate them more strictly but would require a multi-core setup to observe meaningful differences.

Justification: nice directly exercises CFS weight-based scheduling which is the core Linux scheduling mechanism, making the results more illustrative of scheduler behavior.

Scheduler Experiment Results

Experiment 1 — Two CPU-bound containers with different priorities

Both containers ran /cpu_hog 10 (burn CPU for 10 seconds) simultaneously.

Container	Nice Value	Wall Time	CPU Share
`cpu_normal`	0	9.295s	~67%
`cpu_nice`	15	19.295s	~33%

The nice=15 container took approximately 2× longer to complete the same workload. CFS assigned cpu_normal roughly twice the CPU share of cpu_nice due to the weight difference between nice=0 and nice=15.

Experiment 2 — CPU-bound vs I/O-bound container

Container	Type	Wall Time	Why
`cpu_exp`	CPU-bound (`nice=0`, 10s)	4.063s	Got extra CPU time while `io_exp` slept
`io_exp`	I/O-bound (20 iter × 200ms sleep)	13.169s	Woke up to scheduling delays behind `cpu_exp`

The CPU-bound container finished significantly faster than its 10-second target because the I/O-bound container spent most of its time sleeping. CFS detected that the I/O-bound process was not consuming its full CPU quota and gave the CPU-bound process more time. These results demonstrate two core CFS properties: weight-based fairness under contention, and throughput-oriented behavior where sleeping processes do not block CPU-hungry ones.

Experiment 1: nice priority
  cpu_normal (nice=0)  ████████████░░░░░░░░░░  9.3s
  cpu_nice   (nice=15) ████████████████████░░  19.3s
                       0s        10s       20s

Experiment 2: CPU vs I/O bound
  cpu_exp (CPU-bound)  ████░░░░░░░░░░░░░░░░░░  4.1s
  io_exp  (I/O-bound)  █████████████░░░░░░░░░  13.2s
                       0s        10s       20s

Licensed under the Apache 2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
screenshots		screenshots
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Container-Runtime

How It Works in 30 Seconds

System Architecture

Team Information

Build, Load, and Run Instructions

Dependencies

Prepare Root Filesystem

Build

Copy Workloads into Rootfs

Load Kernel Module

Start Supervisor

Launch Containers

Run a Container in Foreground

Memory Limit Test

Scheduling Experiments

Unload Module and Clean Up

IPC & Control Flow

CLI → Supervisor command lifecycle

Pipe → Bounded buffer → Log file

Memory enforcement (kernel module)

Demo Screenshots

Screenshot 1 — Multi-container supervision

Screenshot 2 — Metadata tracking

Screenshot 3 — Bounded-buffer logging

Screenshot 4 — CLI and IPC

Screenshot 5 — Soft-limit warning

Screenshot 6 — Hard-limit enforcement

Screenshot 7 — Scheduling experiment

Screenshot 8 — Clean teardown

Engineering Analysis

4.1 Isolation Mechanisms

4.2 Supervisor and Process Lifecycle

4.3 IPC, Threads, and Synchronization

4.4 Memory Management and Enforcement

4.5 Scheduling Behavior

Design Decisions and Tradeoffs

Namespace Isolation

Supervisor Architecture

IPC and Logging

Kernel Monitor

Scheduling Experiments

Scheduler Experiment Results

Experiment 1 — Two CPU-bound containers with different priorities

Experiment 2 — CPU-bound vs I/O-bound container

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages