GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
-
Updated
May 6, 2026 - Python
GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
GPU telemetry with workload attribution. One OTLP agent per node ties hardware metrics (NVIDIA, AMD, Intel Gaudi) to the K8s pod or Slurm job burning the GPU — so you know who's paying for that idle H100.
Simulate NVIDIA GPUs for testing. 7 behavior profiles, scale to 1000+ GPUs, Docker-ready Prometheus exporter using DCGM
GPU-native agent-swarm orchestration for the NVIDIA AI stack — NeMo, NIM, Triton, DCGM, NGC, NIXL, OpenShell. Spawn GPU-pinned agent teams across DGX/HGX nodes with NVLink-aware scheduling, task DAGs, adaptive scheduling, and full observability.
nvidia dcgm exporter container only
Production-grade health monitoring and predictive fault management system for NVIDIA A100/H100 GPU fleets
kubectl plugin that compares requested GPU resources against DCGM Exporter utilization metrics and generates rightsizing recommendations with projected monthly cost savings. Supports nvidia.com/gpu and amd.com/gpu — the gap VPA leaves open.
Complete security toolkit for enterprise NVIDIA GPU infrastructure. Includes NIST 800-53 controls, Zero Trust architecture, threat models, incident response playbooks, forensic scripts, and monitoring configurations for H100/A100/L40S and other datacenter GPUs.
Add a description, image, and links to the dcgm topic page so that developers can more easily learn about it.
To associate your repository with the dcgm topic, visit your repo's landing page and select "manage topics."