dcgm

Here are 9 public repositories matching this topic...

facebookresearch / gcm

GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring

ai monitoring hpc slurm nvml health-checks ai-training ai-cluster dcgm

Updated May 6, 2026
Python

GPU telemetry with workload attribution. One OTLP agent per node ties hardware metrics (NVIDIA, AMD, Intel Gaudi) to the K8s pod or Slurm job burning the GPU — so you know who's paying for that idle H100.

kubernetes amd gpu helm slurm nvidia gpu-monitoring mlops opentelemetry llm-observability nvml-monitoring dcgm intel-gaudi-base-operator workload-attribution

Updated May 6, 2026
Python

saiakhil2012 / dcgm-fake-gpu-exporter

Star

Simulate NVIDIA GPUs for testing. 7 behavior profiles, scale to 1000+ GPUs, Docker-ready Prometheus exporter using DCGM

testing research monitoring simulation metrics gpu telemetry observability chaos-engineering gpu-monitoring dcgm-exporter dcgm

Updated Jan 7, 2026
Python

ajcasagrande / fakeai

Star

FakeAI: Rapid Development and Testing for AI Infrastructure

ai nvidia openai llm dcgm ai-dynamo aiperf

Updated Oct 8, 2025
Python

alokemajumder / nemospawn

Star

GPU-native agent-swarm orchestration for the NVIDIA AI stack — NeMo, NIM, Triton, DCGM, NGC, NIXL, OpenShell. Spawn GPU-pinned agent teams across DGX/HGX nodes with NVLink-aware scheduling, task DAGs, adaptive scheduling, and full observability.

python cli nim hpc gpu slurm orchestration nvidia hyperparameter-optimization triton nemo ai-agents mlops dgx nvlink llm agent-swarm dcgm

Updated Mar 28, 2026
Python

spwangxp / dcgm-docker-exporter

Star

nvidia dcgm exporter container only

docker monitor gpu prometheus nvidia prometheus-exporter grafana-dashboard dcgm-exporter dcgm

Updated Feb 9, 2026
Go

SolidRegardless / gpu-health-monitor

Star

Production-grade health monitoring and predictive fault management system for NVIDIA A100/H100 GPU fleets

machine-learning kafka monitoring gpu nvidia predictive-maintenance failure-prediction timescaledb health-monitoring a100 h100 dcgm

Updated Feb 20, 2026
Python

kunalpjain / k8s-gpu-rightsizer

Star

kubectl plugin that compares requested GPU resources against DCGM Exporter utilization metrics and generates rightsizing recommendations with projected monthly cost savings. Supports nvidia.com/gpu and amd.com/gpu — the gap VPA leaves open.

go kubernetes gpu nvidia cloud-native kubectl resource-management cost-optimization finops amd-gpu mlops kubectl-plugin rightsizing ai-infrastructure dcgm

Updated Apr 23, 2026
Go

Zero-Trust-AI-Security / gpu-security-toolkit

Star

Complete security toolkit for enterprise NVIDIA GPU infrastructure. Includes NIST 800-53 controls, Zero Trust architecture, threat models, incident response playbooks, forensic scripts, and monitoring configurations for H100/A100/L40S and other datacenter GPUs.

incident-response forensics zero-trust container-security kubernetes-security nist-800-53 dcgm gpu-security nvidia-security cryptomining-detection

Updated Feb 1, 2026
Shell

Improve this page

Add a description, image, and links to the dcgm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the dcgm topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dcgm

Here are 9 public repositories matching this topic...

facebookresearch / gcm

last9 / gpu-telemetry

saiakhil2012 / dcgm-fake-gpu-exporter

ajcasagrande / fakeai

alokemajumder / nemospawn

spwangxp / dcgm-docker-exporter

SolidRegardless / gpu-health-monitor

kunalpjain / k8s-gpu-rightsizer

Zero-Trust-AI-Security / gpu-security-toolkit

Improve this page

Add this topic to your repo