Skip to content

bgiddwani-ai/slurm_image_dev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Container Development Guide for HPC Clusters

A comprehensive guide for building, customizing, and deploying containers on HPC clusters using Enroot and Slurm.


Table of Contents


Overview

This workflow enables you to:

  1. Import container images from Docker Hub or NVIDIA NGC
  2. Customize containers by installing dependencies
  3. Save modified containers for reuse
  4. Scale workloads across multiple nodes and GPUs

Prerequisites:

  • Access to an HPC/Large AI cluster with Slurm and Enroot installed
  • Shared storage mounted at /mnt/shared (or equivalent)
  • Appropriate resource allocations

Step 1: Pull Container Images

Import container images using enroot import and save them as .sqsh files in shared storage.

From NVIDIA NGC (GPU-enabled containers)

enroot import -o /mnt/shared/containers/cuda13.0.2_ubuntu22.04.sqsh \
  docker://nvcr.io#nvidia/cuda:13.0.2-cudnn-devel-ubuntu22.04

From Docker Hub

enroot import -o /mnt/shared/containers/ubuntu22.04.sqsh \
  docker://ubuntu:22.04

Naming Convention: Use descriptive names like <base>_<version>_<description>.sqsh


Step 2: Customize Containers

Launch an interactive container session to install dependencies and customize your environment.

Basic Interactive Session

srun --nodes=1 \
  --ntasks=1 \
  --gpus=1 \
  --container-writable \
  --container-remap-root \
  --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/ubuntu22.04.sqsh \
  --container-save=/mnt/shared/containers/ubuntu22.04_custom.sqsh \
  --pty bash

Parameter Breakdown

Parameter Description
--nodes=1 Number of nodes to allocate
--ntasks=1 Number of tasks (processes) to run
--gpus=1 Number of GPUs to allocate (optional)
--container-writable Allow modifications to the container
--container-remap-root Map container root to your user (avoids permission issues)
--container-mounts Mount host directories into the container
--container-image Path to the base container image
--container-save Path where the modified container will be saved
--pty bash Launch an interactive bash shell

Install Dependencies

Once inside the container, install your required packages:

# Update package lists
apt update

# Install system packages
apt install -y ffmpeg vim git wget curl

Save and Exit

Press CTRL+D or type exit to close the session. Your modifications will be automatically saved to the specified --container-save path.

Verify the Saved Container

ls -lh /mnt/shared/containers/

You should see your new .sqsh file with a timestamp indicating when it was saved.


Step 3: Deploy at Scale

Run your customized container across multiple nodes and GPUs.

Single Node, Single GPU

srun --nodes=1 \
  --gpus=1 \
  --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/ubuntu22.04_custom.sqsh \
  bash -c 'echo "Running on: $(hostname)" && your-command'

Multi-Node, Multi-GPU

srun --nodes=8 \
  --gpus-per-node=8 \
  --ntasks-per-node=8 \
  --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/cuda13.0.2_ubuntu22.04_custom.sqsh \
  bash -c 'echo "Node: $(hostname), GPU: $CUDA_VISIBLE_DEVICES" && nvidia-smi'

Running Python Scripts

srun --nodes=4 \
  --gpus-per-node=8 \
  --ntasks-per-node=8 \
  --container-mounts=/mnt/shared:/mnt/shared,/home/$USER:/home/$USER \
  --container-image=/mnt/shared/containers/pytorch_custom.sqsh \
  python /mnt/shared/scripts/train_model.py

Using Batch Scripts

Create a Slurm batch script (job.sh):

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --time=24:00:00
#SBATCH --job-name=my_container_job

srun --container-mounts=/mnt/shared:/mnt/shared \
  --container-image=/mnt/shared/containers/my_custom_container.sqsh \
  python /mnt/shared/my_script.py

Submit the job:

sbatch job.sh

Best Practices

Container Organization

  • Store all containers in a centralized location (e.g., /mnt/shared/containers/)
  • Use version tags in container names (e.g., pytorch_2.0_cuda11.8.sqsh)
  • Document installed packages in a separate README or requirements file

Resource Management

  • Request only the resources you need (--gpus, --nodes, --time)
  • Use --ntasks-per-node to match the number of GPUs for distributed training
  • Monitor resource usage with squeue, sinfo, and sacct

Storage Mounts

  • Mount only necessary directories to reduce overhead
  • Use read-only mounts when modifications aren't needed
  • Be mindful of storage quotas and cleanup old containers

Reproducibility

  • Save container build steps in a script
  • Tag containers with versions or dates
  • Keep a changelog of modifications

Security

  • Don't store sensitive data (passwords, API keys) inside containers
  • Use environment variables or mounted config files for credentials
  • Regularly update base images for security patches

Troubleshooting

Container Import Fails

Issue: enroot import fails with network errors

Solution:

# Check network connectivity
ping registry-1.docker.io

# Try with explicit proxy settings
export HTTP_PROXY=http://proxy.example.com:8080
export HTTPS_PROXY=http://proxy.example.com:8080

Permission Denied Errors

Issue: Cannot write to directories inside the container

Solution: Always use --container-remap-root to map root user to your UID

Container Not Saved

Issue: Changes are lost after exiting the container

Solution: Ensure you're using both --container-writable and --container-save

GPU Not Detected

Issue: nvidia-smi not working inside container

Solution:

  • Use a CUDA-enabled base image (from NGC)
  • Check that the host has NVIDIA drivers installed

Mount Points Not Working

Issue: Cannot access mounted directories

Solution:

# Verify mount syntax (source:destination)
--container-mounts=/host/path:/container/path

# Check that source directory exists on host
ls /host/path

# Ensure you have read/write permissions

Out of Disk Space

Issue: Container save fails due to insufficient space

Solution:

# Check available space
df -h /mnt/shared/containers

# Clean up old containers
rm /mnt/shared/containers/old_container.sqsh

Additional Resources


Questions or Issues? Contact your HPC support team or file an issue in your organization's documentation repository.

About

Container environment development inside Slurm setup via BCM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors