Summary
Implement a wrapper layer to integrate all-smi with AAMI for unified AI accelerator monitoring. all-smi provides Prometheus-compatible metrics for various AI accelerators (NVIDIA, AMD, Intel Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa).
Background
- all-smi: Open-source tool that exposes Prometheus metrics for multiple AI accelerator vendors
- Approach: Thin wrapper instead of full internalization to minimize maintenance overhead
- Benefit: Single exporter for 7+ accelerator types with native Prometheus integration
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Config Server │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Target Registry │ │ Exporter Config │ │ SD Generator │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└───────────────────────────────────┬─────────────────────────────┘
│ HTTP API
▼
┌─────────────────────────────────────────────────────────────────┐
│ Target Node │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ all-smi │ │ aami-agent │ │ Node Exporter │ │
│ │ (port 9401) │ │ (wrapper) │ │ (port 9100) │ │
│ └────────┬────────┘ └────────┬────────┘ └─────────────────┘ │
│ │ │ │
│ │ manage/monitor │ │
│ ◄────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼ scrape
┌─────────────────────────────────────────────────────────────────┐
│ Prometheus │
└─────────────────────────────────────────────────────────────────┘
Implementation Phases
Phase 1: Foundation (1 week)
Phase 2: Installation Automation (3-4 days)
Phase 3: Health Check & Management (3-4 days)
Phase 4: Dashboard & Alerts (3-4 days)
New Directories to Create
The following directories need to be created as part of this implementation:
| Directory |
Purpose |
dashboards/ |
Grafana dashboard JSON files |
scripts/systemd/ |
systemd service unit files |
deploy/ansible/roles/all-smi/ |
Ansible role for all-smi deployment |
deploy/kubernetes/all-smi/ |
Kubernetes manifests for all-smi |
File Structure
services/
├── config-server/
│ ├── internal/
│ │ ├── domain/
│ │ │ └── exporter.go # Add all_smi to ExporterType
│ │ └── service/
│ │ └── service_discovery.go # Generate all-smi SD targets
│ └── configs/
│ └── defaults/
│ └── alert-templates.yaml # Add accelerator alert templates
scripts/
├── node/
│ ├── install-all-smi.sh # Installation script
│ ├── all-smi-health-check.sh # Health check
│ └── bootstrap.sh # Add all-smi option
└── systemd/ # NEW DIRECTORY
└── all-smi.service # systemd service
deploy/
├── ansible/
│ └── roles/
│ └── all-smi/ # Ansible role (NEW)
└── kubernetes/
└── all-smi/
└── daemonset.yaml # K8s DaemonSet (NEW)
dashboards/ # NEW DIRECTORY
└── all-smi-overview.json # Grafana dashboard
Code Changes
ExporterType Extension
// internal/domain/exporter.go
type ExporterType string
const (
ExporterTypeNodeExporter ExporterType = "node_exporter"
ExporterTypeDCGMExporter ExporterType = "dcgm_exporter"
ExporterTypeAllSMI ExporterType = "all_smi" // New
ExporterTypeCustom ExporterType = "custom"
)
Default Ports
| Exporter |
Default Port |
| node_exporter |
9100 |
| dcgm_exporter |
9400 |
| all_smi |
9401 |
| custom |
9090 |
Note: all-smi uses port 9401 to avoid conflict with dcgm_exporter (9400)
all-smi vs DCGM Exporter
Relationship
- DCGM Exporter: NVIDIA-specific, uses
dcgm_* metric prefix (e.g., dcgm_gpu_temp, dcgm_fb_used)
- all-smi: Multi-vendor support, uses
allsmi_* metric prefix
Recommendation
| Scenario |
Recommended Exporter |
| NVIDIA GPU only, need deep DCGM metrics |
dcgm_exporter |
| Multi-vendor accelerators |
all_smi |
| Mixed environment |
Both (different ports) |
Alert Template Migration
Current alert templates use DCGM metrics. New all-smi templates will use:
# Example: all-smi GPU temperature alert
- id: allsmi_high_gpu_temperature
name: High GPU Temperature (all-smi)
query_template: |
allsmi_gpu_temperature > {{ .threshold }}
Note: Verify actual metric names from all-smi documentation before implementation.
Supported Accelerators (via all-smi)
- NVIDIA GPUs (CUDA)
- AMD GPUs (ROCm)
- NVIDIA Jetson
- Apple Silicon GPUs
- Intel Gaudi NPUs
- Google Cloud TPUs
- Tenstorrent NPUs
- Rebellions NPUs
- Furiosa NPUs
Dependencies
| Item |
Status |
| config-server Exporter model |
✅ Exists (domain/exporter.go) |
| Service Discovery |
✅ Implemented (service/service_discovery.go) |
| Bootstrap script |
✅ Exists (scripts/node/bootstrap.sh) |
| Alert Template system |
✅ Implemented |
| dashboards/ directory |
⚠️ To be created |
| scripts/systemd/ directory |
⚠️ To be created |
Estimated Timeline
| Phase |
Duration |
Deliverables |
| Phase 1 |
1 week |
config-server extension, API |
| Phase 2 |
3-4 days |
Installation automation scripts |
| Phase 3 |
3-4 days |
Health check, management |
| Phase 4 |
3-4 days |
Dashboard, alert rules |
| Total |
~2.5 weeks |
|
References
Summary
Implement a wrapper layer to integrate all-smi with AAMI for unified AI accelerator monitoring. all-smi provides Prometheus-compatible metrics for various AI accelerators (NVIDIA, AMD, Intel Gaudi, Google TPU, Tenstorrent, Rebellions, Furiosa).
Background
Architecture
Implementation Phases
Phase 1: Foundation (1 week)
all_smiPhase 2: Installation Automation (3-4 days)
install-all-smi.shscript (apt/brew/binary)all-smi.servicesystemd unit filePhase 3: Health Check & Management (3-4 days)
Phase 4: Dashboard & Alerts (3-4 days)
New Directories to Create
The following directories need to be created as part of this implementation:
dashboards/scripts/systemd/deploy/ansible/roles/all-smi/deploy/kubernetes/all-smi/File Structure
Code Changes
ExporterType Extension
Default Ports
all-smi vs DCGM Exporter
Relationship
dcgm_*metric prefix (e.g.,dcgm_gpu_temp,dcgm_fb_used)allsmi_*metric prefixRecommendation
Alert Template Migration
Current alert templates use DCGM metrics. New all-smi templates will use:
Supported Accelerators (via all-smi)
Dependencies
domain/exporter.go)service/service_discovery.go)scripts/node/bootstrap.sh)Estimated Timeline
References