-
Notifications
You must be signed in to change notification settings - Fork 0
Home
DeepDock is a robust and lightweight solution designed for the automated management of Docker containers that require NVIDIA GPU acceleration. Created specifically to optimize workflows in research laboratories and multi-user environments, DeepDock centralizes complex operations into a simple FastAPI-based REST API and a supporting CLI tool.
It empowers users to effortlessly manage the entire container lifecycle—from pulling the base image and creating a persistent environment to real-time monitoring of GPU resources and ensuring per-user volume isolation. DeepDock is the ideal choice for multi-user servers and research laboratories where parallel and efficient execution of GPU-accelerated workloads is critical.
- Environment Creation and Maintenance: Generation and control of containers based on CUDA images.
- Deterministic GPU Allocation: Ensures that specific GPUs (by ID) are exclusively allocated to a container, preventing resource contention.
- Real-Time Monitoring: Collection and exposure of vital hardware metrics (CPU, RAM) and GPU metrics (utilization, memory, temperature) via NVML.
- Image and Lifecycle Management: Complete operations for pulling, listing, creating, starting, stopping, and removing containers and images.
- Data Isolation: Automatic configuration of persistent volumes per user.
Primary Users: DeepDock is primarily designed to serve novice researchers, undergraduate students (ICs), and new postgraduate students who often face significant challenges in setting up and configuring Deep Learning (DL) environments. It provides a simplified, direct interface to access isolated, GPU-accelerated computing environments, effectively removing the technical bureaucracy of manual Docker and resource configuration.
Program Nature: Finished utility tool, serving as a specialized orchestration layer for Docker-based machine learning infrastructures.
The setup and configuration of Deep Learning (DL) environments present significant challenges, particularly for novice users, undergraduate researchers (ICs), and new postgraduate students. Within a shared laboratory or research cluster, distinct dependency requirements for each project necessitate robust environment isolation. Crucially, resource isolation—especially GPU allocation—is vital for beginner students to prevent accidental resource contention or interference with the stable working environments of their colleagues. This API was specifically conceived for this type of diverse laboratory setting, addressing the varying levels of expertise among researchers. Its primary goal is to streamline the entire process, removing environment configuration bureaucracy so that all users can focus entirely on their core research tasks.
- FastAPI-based REST API: Modern, high-performance API for managing resources programmatically.
- Real-Time GPU Monitoring: Utilizes NVML to provide live GPU utilization metrics.
- Simplified Container Lifecycle: Endpoints for pulling, listing, creating, starting, stopping, and inspecting containers.
- Automatic Volume Management: Handles per-user volume binding for persistent data storage.
- NVIDIA Container Toolkit Compatibility: Ensures seamless and automatic GPU access inside containers.
- Lightweight & Deployable: Minimal configuration required, compatible with standard Docker setups.
Specification of Functional and Non-Functional Requirements:
- FR1 (API Interface): Must provide a FastAPI-based REST API for modern, high-performance, and programmatic management of resources.
- FR2 (Environment Isolation): Must enable the creation of isolated environments with specific dependency requirements (e.g., CUDA versions) and exclusive GPU allocation.
- FR3 (Single Machine Abstraction): Each isolated environment must provide the abstraction of a single, dedicated machine, accessible and manageable via SSH for the user.
- FR4 (Data Persistence): Must allow Automatic Volume Management, handling per-user volume binding for persistent data storage, facilitating the storage and access of large datasets.
- FR5 (Monitoring): Must provide Real-Time GPU Monitoring utilizing NVML to offer live utilization, memory, and temperature metrics, alongside host machine (CPU/RAM) monitoring.
- FR6 (Lifecycle Management): Must provide endpoints for Simplified Container Lifecycle including pulling, listing, creating, starting, stopping, restarting, and removing containers.
- NFR1 (Compatibility): Must ensure seamless and automatic GPU access inside containers through NVIDIA Container Toolkit Compatibility.
- NFR2 (Deployment): Must be Lightweight & Deployable, requiring minimal configuration and compatible with standard Docker setups.
Navigate the DeepDock Wiki:
- Getting Started: Quick overview of installation and service startup
- Installation & Setup: Guide to Prerequisites and Installation
- API Reference: Detailed Endpoints and Schemas