A comprehensive guide for building, customizing, and deploying containers on HPC clusters using Enroot and Slurm.
- Overview
- Step 1: Pull Container Images
- Step 2: Customize Containers
- Step 3: Deploy at Scale
- Best Practices
- Troubleshooting
This workflow enables you to:
- Import container images from Docker Hub or NVIDIA NGC
- Customize containers by installing dependencies
- Save modified containers for reuse
- Scale workloads across multiple nodes and GPUs
Prerequisites:
- Access to an HPC/Large AI cluster with Slurm and Enroot installed
- Shared storage mounted at
/mnt/shared(or equivalent) - Appropriate resource allocations
Import container images using enroot import and save them as .sqsh files in shared storage.
enroot import -o /mnt/shared/containers/cuda13.0.2_ubuntu22.04.sqsh \
docker://nvcr.io#nvidia/cuda:13.0.2-cudnn-devel-ubuntu22.04enroot import -o /mnt/shared/containers/ubuntu22.04.sqsh \
docker://ubuntu:22.04Naming Convention: Use descriptive names like <base>_<version>_<description>.sqsh
Launch an interactive container session to install dependencies and customize your environment.
srun --nodes=1 \
--ntasks=1 \
--gpus=1 \
--container-writable \
--container-remap-root \
--container-mounts=/mnt/shared:/mnt/shared \
--container-image=/mnt/shared/containers/ubuntu22.04.sqsh \
--container-save=/mnt/shared/containers/ubuntu22.04_custom.sqsh \
--pty bash| Parameter | Description |
|---|---|
--nodes=1 |
Number of nodes to allocate |
--ntasks=1 |
Number of tasks (processes) to run |
--gpus=1 |
Number of GPUs to allocate (optional) |
--container-writable |
Allow modifications to the container |
--container-remap-root |
Map container root to your user (avoids permission issues) |
--container-mounts |
Mount host directories into the container |
--container-image |
Path to the base container image |
--container-save |
Path where the modified container will be saved |
--pty bash |
Launch an interactive bash shell |
Once inside the container, install your required packages:
# Update package lists
apt update
# Install system packages
apt install -y ffmpeg vim git wget curlPress CTRL+D or type exit to close the session. Your modifications will be automatically saved to the specified --container-save path.
ls -lh /mnt/shared/containers/You should see your new .sqsh file with a timestamp indicating when it was saved.
Run your customized container across multiple nodes and GPUs.
srun --nodes=1 \
--gpus=1 \
--container-mounts=/mnt/shared:/mnt/shared \
--container-image=/mnt/shared/containers/ubuntu22.04_custom.sqsh \
bash -c 'echo "Running on: $(hostname)" && your-command'srun --nodes=8 \
--gpus-per-node=8 \
--ntasks-per-node=8 \
--container-mounts=/mnt/shared:/mnt/shared \
--container-image=/mnt/shared/containers/cuda13.0.2_ubuntu22.04_custom.sqsh \
bash -c 'echo "Node: $(hostname), GPU: $CUDA_VISIBLE_DEVICES" && nvidia-smi'srun --nodes=4 \
--gpus-per-node=8 \
--ntasks-per-node=8 \
--container-mounts=/mnt/shared:/mnt/shared,/home/$USER:/home/$USER \
--container-image=/mnt/shared/containers/pytorch_custom.sqsh \
python /mnt/shared/scripts/train_model.pyCreate a Slurm batch script (job.sh):
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --time=24:00:00
#SBATCH --job-name=my_container_job
srun --container-mounts=/mnt/shared:/mnt/shared \
--container-image=/mnt/shared/containers/my_custom_container.sqsh \
python /mnt/shared/my_script.pySubmit the job:
sbatch job.sh- Store all containers in a centralized location (e.g.,
/mnt/shared/containers/) - Use version tags in container names (e.g.,
pytorch_2.0_cuda11.8.sqsh) - Document installed packages in a separate README or requirements file
- Request only the resources you need (
--gpus,--nodes,--time) - Use
--ntasks-per-nodeto match the number of GPUs for distributed training - Monitor resource usage with
squeue,sinfo, andsacct
- Mount only necessary directories to reduce overhead
- Use read-only mounts when modifications aren't needed
- Be mindful of storage quotas and cleanup old containers
- Save container build steps in a script
- Tag containers with versions or dates
- Keep a changelog of modifications
- Don't store sensitive data (passwords, API keys) inside containers
- Use environment variables or mounted config files for credentials
- Regularly update base images for security patches
Issue: enroot import fails with network errors
Solution:
# Check network connectivity
ping registry-1.docker.io
# Try with explicit proxy settings
export HTTP_PROXY=http://proxy.example.com:8080
export HTTPS_PROXY=http://proxy.example.com:8080Issue: Cannot write to directories inside the container
Solution: Always use --container-remap-root to map root user to your UID
Issue: Changes are lost after exiting the container
Solution: Ensure you're using both --container-writable and --container-save
Issue: nvidia-smi not working inside container
Solution:
- Use a CUDA-enabled base image (from NGC)
- Check that the host has NVIDIA drivers installed
Issue: Cannot access mounted directories
Solution:
# Verify mount syntax (source:destination)
--container-mounts=/host/path:/container/path
# Check that source directory exists on host
ls /host/path
# Ensure you have read/write permissionsIssue: Container save fails due to insufficient space
Solution:
# Check available space
df -h /mnt/shared/containers
# Clean up old containers
rm /mnt/shared/containers/old_container.sqshQuestions or Issues? Contact your HPC support team or file an issue in your organization's documentation repository.