Container Development Guide for HPC Clusters

A comprehensive guide for building, customizing, and deploying containers on HPC clusters using Enroot and Slurm.

Overview

This workflow enables you to:

Import container images from Docker Hub or NVIDIA NGC
Customize containers by installing dependencies
Save modified containers for reuse
Scale workloads across multiple nodes and GPUs

Prerequisites:

Access to an HPC/Large AI cluster with Slurm and Enroot installed
Shared storage mounted at /mnt/shared (or equivalent)
Appropriate resource allocations

Step 1: Pull Container Images

Import container images using enroot import and save them as .sqsh files in shared storage.

From NVIDIA NGC (GPU-enabled containers)

enroot import -o /mnt/shared/containers/cuda13.0.2_ubuntu22.04.sqsh \
  docker://nvcr.io#nvidia/cuda:13.0.2-cudnn-devel-ubuntu22.04

From Docker Hub

enroot import -o /mnt/shared/containers/ubuntu22.04.sqsh \
  docker://ubuntu:22.04

Naming Convention: Use descriptive names like <base>_<version>_<description>.sqsh

Step 2: Customize Containers

Launch an interactive container session to install dependencies and customize your environment.

Basic Interactive Session

srun --nodes=1 \
  --ntasks=1 \
  --gpus=1 \
  --container-writable \
  --container-remap-root \
  --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/ubuntu22.04.sqsh \
  --container-save=/mnt/shared/containers/ubuntu22.04_custom.sqsh \
  --pty bash

Parameter Breakdown

Parameter	Description
`--nodes=1`	Number of nodes to allocate
`--ntasks=1`	Number of tasks (processes) to run
`--gpus=1`	Number of GPUs to allocate (optional)
`--container-writable`	Allow modifications to the container
`--container-remap-root`	Map container root to your user (avoids permission issues)
`--container-mounts`	Mount host directories into the container
`--container-image`	Path to the base container image
`--container-save`	Path where the modified container will be saved
`--pty bash`	Launch an interactive bash shell

Install Dependencies

Once inside the container, install your required packages:

# Update package lists
apt update

# Install system packages
apt install -y ffmpeg vim git wget curl

Save and Exit

Press CTRL+D or type exit to close the session. Your modifications will be automatically saved to the specified --container-save path.

Verify the Saved Container

ls -lh /mnt/shared/containers/

You should see your new .sqsh file with a timestamp indicating when it was saved.

Step 3: Deploy at Scale

Run your customized container across multiple nodes and GPUs.

Single Node, Single GPU

srun --nodes=1 \
  --gpus=1 \
  --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/ubuntu22.04_custom.sqsh \
  bash -c 'echo "Running on: $(hostname)" && your-command'

Multi-Node, Multi-GPU

srun --nodes=8 \
  --gpus-per-node=8 \
  --ntasks-per-node=8 \
  --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/cuda13.0.2_ubuntu22.04_custom.sqsh \
  bash -c 'echo "Node: $(hostname), GPU: $CUDA_VISIBLE_DEVICES" && nvidia-smi'

Running Python Scripts

srun --nodes=4 \
  --gpus-per-node=8 \
  --ntasks-per-node=8 \
  --container-mounts=/mnt/shared:/mnt/shared,/home/$USER:/home/$USER \
  --container-image=/mnt/shared/containers/pytorch_custom.sqsh \
  python /mnt/shared/scripts/train_model.py

Using Batch Scripts

Create a Slurm batch script (job.sh):

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --time=24:00:00
#SBATCH --job-name=my_container_job

srun --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/my_custom_container.sqsh \
  python /mnt/shared/my_script.py

Submit the job:

sbatch job.sh

Best Practices

Container Organization

Store all containers in a centralized location (e.g., /mnt/shared/containers/)
Use version tags in container names (e.g., pytorch_2.0_cuda11.8.sqsh)
Document installed packages in a separate README or requirements file

Resource Management

Request only the resources you need (--gpus, --nodes, --time)
Use --ntasks-per-node to match the number of GPUs for distributed training
Monitor resource usage with squeue, sinfo, and sacct

Storage Mounts

Mount only necessary directories to reduce overhead
Use read-only mounts when modifications aren't needed
Be mindful of storage quotas and cleanup old containers

Reproducibility

Save container build steps in a script
Tag containers with versions or dates
Keep a changelog of modifications

Security

Don't store sensitive data (passwords, API keys) inside containers
Use environment variables or mounted config files for credentials
Regularly update base images for security patches

Troubleshooting

Container Import Fails

Issue: enroot import fails with network errors

Solution:

# Check network connectivity
ping registry-1.docker.io

# Try with explicit proxy settings
export HTTP_PROXY=http://proxy.example.com:8080
export HTTPS_PROXY=http://proxy.example.com:8080

Permission Denied Errors

Issue: Cannot write to directories inside the container

Solution: Always use --container-remap-root to map root user to your UID

Container Not Saved

Issue: Changes are lost after exiting the container

Solution: Ensure you're using both --container-writable and --container-save

GPU Not Detected

Issue: nvidia-smi not working inside container

Solution:

Use a CUDA-enabled base image (from NGC)
Check that the host has NVIDIA drivers installed

Mount Points Not Working

Issue: Cannot access mounted directories

Solution:

# Verify mount syntax (source:destination)
--container-mounts=/host/path:/container/path

# Check that source directory exists on host
ls /host/path

# Ensure you have read/write permissions

Out of Disk Space

Issue: Container save fails due to insufficient space

Solution:

# Check available space
df -h /mnt/shared/containers

# Clean up old containers
rm /mnt/shared/containers/old_container.sqsh

Additional Resources

Questions or Issues? Contact your HPC support team or file an issue in your organization's documentation repository.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Container Development Guide for HPC Clusters

Table of Contents

Overview

Step 1: Pull Container Images

From NVIDIA NGC (GPU-enabled containers)

From Docker Hub

Step 2: Customize Containers

Basic Interactive Session

Parameter Breakdown

Install Dependencies

Save and Exit

Verify the Saved Container

Step 3: Deploy at Scale

Single Node, Single GPU

Multi-Node, Multi-GPU

Running Python Scripts

Using Batch Scripts

Best Practices

Container Organization

Resource Management

Storage Mounts

Reproducibility

Security

Troubleshooting

Container Import Fails

Permission Denied Errors

Container Not Saved

GPU Not Detected

Mount Points Not Working

Out of Disk Space

Additional Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages