A lightweight, production-grade telemetry agent that collects real-time system metrics, publishes them to AWS CloudWatch, and fires threshold-based alerts via AWS SNS.
Built to mirror real-world NOC and CloudOps monitoring patterns — config-driven, modular, and resilient.
Modern infrastructure teams need visibility into system health at all times. cloud-telemetry-agent provides a deployable Python-based agent that continuously monitors host-level metrics and integrates directly with AWS observability services.
This project demonstrates core CloudOps and SRE competencies:
- Agent-based metric collection
- Cloud-native telemetry publishing
- Threshold-driven alerting pipelines
- Structured logging for downstream ingestion
- Production resilience patterns (retry logic, graceful error handling)
Host Machine
│
├── collector.py # Collects CPU, memory, disk metrics via psutil
├── publisher.py # Publishes telemetry to AWS CloudWatch
├── alerter.py # Evaluates thresholds, fires alerts via AWS SNS
├── logger.py # Structured JSON logging (file + console)
└── utils.py # Retry decorator for resilient AWS calls
│
▼
AWS CloudWatch # Metrics storage and monitoring
AWS SNS # Alert delivery (email, SMS, webhook)
- Real-time metric collection — CPU, memory, and disk utilization via
psutil - CloudWatch integration — Metrics published with host and environment dimensions
- SNS alerting — Configurable thresholds trigger immediate notifications
- Structured JSON logging — Every event logged with timestamp, level, and module
- Retry logic — Automatic retries with backoff on AWS API failures
- Config-driven — Zero hardcoded values; fully controlled via
config.ini
| Component | Technology |
|---|---|
| Language | Python 3.x |
| Metric Collection | psutil |
| AWS SDK | boto3 / botocore |
| Alerting | AWS SNS |
| Monitoring | AWS CloudWatch |
| Configuration | configparser |
| Logging | Python logging + custom JSON formatter |
cloud-telemetry-agent/
├── agent/
│ ├── __init__.py
│ ├── collector.py # System metric collection
│ ├── publisher.py # CloudWatch publisher
│ ├── alerter.py # SNS alert engine
│ ├── logger.py # JSON log formatter
│ └── utils.py # Retry decorator utility
├── config/
│ └── config.ini # All configuration lives here
├── logs/
│ └── telemetry.log # Structured JSON log output
├── tests/
│ └── __init__.py
├── requirements.txt
└── main.py # Agent entry point
- Python 3.8+
- AWS account (Free Tier compatible)
- AWS CLI installed and configured
git clone https://github.com/Alex-CloudOps/cloud-telemetry-agent.git
cd cloud-telemetry-agent
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txt- Create an IAM user with
CloudWatchFullAccessandAmazonSNSFullAccess - Configure AWS CLI:
aws configure - Create an SNS topic and subscribe your email
- Update
config/config.iniwith your AWS details
Edit config/config.ini before running:
[aws]
region = us-east-2
cloudwatch_namespace = CloudTelemetryAgent
sns_topic_arn = arn:aws:sns:us-east-2:YOUR_ACCOUNT_ID:cloud-telemetry-alerts
[agent]
hostname = your-hostname
interval_seconds = 60
environment = production
[thresholds]
cpu_percent = 85
memory_percent = 90
disk_percent = 90python main.py{"timestamp": "2026-03-07T07:13:09.260119+00:00", "level": "INFO", "message": "Starting metric collection cycle", "module": "collector"}
{"timestamp": "2026-03-07T07:13:10.261839+00:00", "level": "INFO", "message": "Collected cpu_percent: 3.9%", "module": "collector"}
{"timestamp": "2026-03-07T07:13:10.263920+00:00", "level": "INFO", "message": "Collected memory_percent: 83.7%", "module": "collector"}
{"timestamp": "2026-03-07T07:13:10.534326+00:00", "level": "INFO", "message": "CloudWatch publish complete — HTTP 200", "module": "publisher"}
{"timestamp": "2026-03-07T07:13:10.539335+00:00", "level": "INFO", "message": "Alert check complete — 0 alert(s) fired", "module": "alerter"}When a metric breaches its configured threshold, an alert is immediately published via AWS SNS:
Subject: CloudTelemetryAgent Alert - memory_percent
⚠️ ALERT: memory_percent is at 91.0% on my-server-01 (threshold: 90.0%)
Environment: production
Timestamp: 2026-03-07T07:03:10.193033+00:00
- Network I/O metrics collection
- Continuous polling loop with configurable interval
- CloudWatch Logs integration for centralized log shipping
- Docker containerization for portable deployment
- Power BI dashboard integration via exported telemetry data
- Unit tests with pytest
Alex Evans | CloudOps & NOC Engineer
GitHub | alex.evans.cloudops@gmail.com
Built to demonstrate production-grade CloudOps and observability engineering practices.# Linux migration complete