Skip to content

alex-cloudops/cloud-telemetry-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cloud-telemetry-agent

A lightweight, production-grade telemetry agent that collects real-time system metrics, publishes them to AWS CloudWatch, and fires threshold-based alerts via AWS SNS.

Built to mirror real-world NOC and CloudOps monitoring patterns — config-driven, modular, and resilient.


Overview

Modern infrastructure teams need visibility into system health at all times. cloud-telemetry-agent provides a deployable Python-based agent that continuously monitors host-level metrics and integrates directly with AWS observability services.

This project demonstrates core CloudOps and SRE competencies:

  • Agent-based metric collection
  • Cloud-native telemetry publishing
  • Threshold-driven alerting pipelines
  • Structured logging for downstream ingestion
  • Production resilience patterns (retry logic, graceful error handling)

Architecture

Host Machine
    │
    ├── collector.py       # Collects CPU, memory, disk metrics via psutil
    ├── publisher.py       # Publishes telemetry to AWS CloudWatch
    ├── alerter.py         # Evaluates thresholds, fires alerts via AWS SNS
    ├── logger.py          # Structured JSON logging (file + console)
    └── utils.py           # Retry decorator for resilient AWS calls
            │
            ▼
    AWS CloudWatch         # Metrics storage and monitoring
    AWS SNS                # Alert delivery (email, SMS, webhook)

Features

  • Real-time metric collection — CPU, memory, and disk utilization via psutil
  • CloudWatch integration — Metrics published with host and environment dimensions
  • SNS alerting — Configurable thresholds trigger immediate notifications
  • Structured JSON logging — Every event logged with timestamp, level, and module
  • Retry logic — Automatic retries with backoff on AWS API failures
  • Config-driven — Zero hardcoded values; fully controlled via config.ini

Tech Stack

Component Technology
Language Python 3.x
Metric Collection psutil
AWS SDK boto3 / botocore
Alerting AWS SNS
Monitoring AWS CloudWatch
Configuration configparser
Logging Python logging + custom JSON formatter

Project Structure

cloud-telemetry-agent/
├── agent/
│   ├── __init__.py
│   ├── collector.py       # System metric collection
│   ├── publisher.py       # CloudWatch publisher
│   ├── alerter.py         # SNS alert engine
│   ├── logger.py          # JSON log formatter
│   └── utils.py           # Retry decorator utility
├── config/
│   └── config.ini         # All configuration lives here
├── logs/
│   └── telemetry.log      # Structured JSON log output
├── tests/
│   └── __init__.py
├── requirements.txt
└── main.py                # Agent entry point

Getting Started

Prerequisites

  • Python 3.8+
  • AWS account (Free Tier compatible)
  • AWS CLI installed and configured

Installation

git clone https://github.com/Alex-CloudOps/cloud-telemetry-agent.git
cd cloud-telemetry-agent
python -m venv venv
venv\Scripts\activate  # Windows
pip install -r requirements.txt

AWS Setup

  1. Create an IAM user with CloudWatchFullAccess and AmazonSNSFullAccess
  2. Configure AWS CLI: aws configure
  3. Create an SNS topic and subscribe your email
  4. Update config/config.ini with your AWS details

Configuration

Edit config/config.ini before running:

[aws]
region = us-east-2
cloudwatch_namespace = CloudTelemetryAgent
sns_topic_arn = arn:aws:sns:us-east-2:YOUR_ACCOUNT_ID:cloud-telemetry-alerts

[agent]
hostname = your-hostname
interval_seconds = 60
environment = production

[thresholds]
cpu_percent = 85
memory_percent = 90
disk_percent = 90

Run the Agent

python main.py

Sample Output

{"timestamp": "2026-03-07T07:13:09.260119+00:00", "level": "INFO", "message": "Starting metric collection cycle", "module": "collector"}
{"timestamp": "2026-03-07T07:13:10.261839+00:00", "level": "INFO", "message": "Collected cpu_percent: 3.9%", "module": "collector"}
{"timestamp": "2026-03-07T07:13:10.263920+00:00", "level": "INFO", "message": "Collected memory_percent: 83.7%", "module": "collector"}
{"timestamp": "2026-03-07T07:13:10.534326+00:00", "level": "INFO", "message": "CloudWatch publish complete — HTTP 200", "module": "publisher"}
{"timestamp": "2026-03-07T07:13:10.539335+00:00", "level": "INFO", "message": "Alert check complete — 0 alert(s) fired", "module": "alerter"}

Alert Example

When a metric breaches its configured threshold, an alert is immediately published via AWS SNS:

Subject: CloudTelemetryAgent Alert - memory_percent

⚠️ ALERT: memory_percent is at 91.0% on my-server-01 (threshold: 90.0%)
Environment: production
Timestamp: 2026-03-07T07:03:10.193033+00:00

Roadmap

  • Network I/O metrics collection
  • Continuous polling loop with configurable interval
  • CloudWatch Logs integration for centralized log shipping
  • Docker containerization for portable deployment
  • Power BI dashboard integration via exported telemetry data
  • Unit tests with pytest

Author

Alex Evans | CloudOps & NOC Engineer
GitHub | alex.evans.cloudops@gmail.com


Built to demonstrate production-grade CloudOps and observability engineering practices.# Linux migration complete

About

Production-grade Python telemetry agent with AWS CloudWatch metrics publishing and SNS threshold alerting

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages