Ultra-Light Edge Observability Stack

A production-grade monitoring stack designed for resource-constrained edge devices. Most observability setups assume abundant memory and CPU — this one doesn't. The entire stack (metrics collection, storage, and dashboarding) runs under 150MB RAM on a 2-core device, with 99%+ scrape reliability.

The Problem

Running Prometheus + Grafana on edge hardware is non-trivial. Out of the box, a naive deployment will:

Blow through a 300MB memory budget
Hit scrape timeouts from CPU-heavy services
Fail silently due to content-type mismatches between Flask and Prometheus

This project documents the exact issues encountered and the fixes applied to get a stable, lean observability stack running in a constrained environment.

Performance

Metric	Before	After
Total Memory	210–230 MB	141 MB (53% under 300MB budget)
CPU Usage	0.6–0.8% (with spikes)	0.4% avg (smooth)
Scrape Success Rate	60–70%	99%+
Prometheus Target Status	DOWN (red)	UP (green)

Hardware Target

2-core CPU @ 2 GHz
500 MB usable RAM
Memory Budget: 300 MB maximum

Architecture

┌─────────────────────────────────────────────────┐
│                 Edge Device                      │
│  (2-core CPU, 500MB RAM)                        │
│                                                  │
│  ┌──────────────────┐         ┌──────────────┐ │
│  │  Sensor Service  │────────▶│  Prometheus  │ │
│  │   (Flask App)    │ metrics │  (Scraper)   │ │
│  │   64MB limit     │  :8000  │  128MB limit │ │
│  └──────────────────┘         └──────┬───────┘ │
│                                       │         │
│                                       ▼         │
│                               ┌──────────────┐ │
│                               │   Grafana    │ │
│                               │ (Visualize)  │ │
│                               │  96MB limit  │ │
│                               └──────────────┘ │
└─────────────────────────────────────────────────┘
                                       │
                                       ▼
                              User Dashboard (Browser)

Quick Start

Prerequisites: Docker Desktop or Docker Engine, Docker Compose, 500MB RAM available

git clone https://github.com/yourusername/edge-observability-stack
cd edge-observability-stack
docker compose up -d

# Verify containers are running
docker ps

# Check memory usage (should be ~141MB total)
docker stats --no-stream

Access Points

Service	URL	Credentials
Grafana	http://localhost:3000	admin / admin
Prometheus	http://localhost:9090	—
Sensor Service	http://localhost:8000/sensor	—
Metrics Endpoint	http://localhost:8000/metrics	—

Challenges & Solutions

Getting this stack stable on constrained hardware required diagnosing and fixing four distinct issues. Here's what broke and how it was fixed.

1. Content-Type Mismatch → Scrape Failures

Symptom: Prometheus showing "Error scraping target" on every scrape cycle.

received unsupported Content-Type 'text/html; charset=utf-8'
and no fallback_scrape_protocol specified for target

Root cause: Flask's default response wraps bytes in text/html, which Prometheus rejects outright.

# BEFORE
return generate_latest()  # Flask silently sets text/html

Fix: Explicitly return a Response object with the correct Prometheus mimetype.

# AFTER
return Response(
    generate_latest(),
    mimetype='text/plain; version=0.0.4; charset=utf-8'
)

Result: Scrape success rate jumped from 60–70% → 99%+. Prometheus target went from DOWN (red) to UP (green).

Before — Prometheus showing scrape errors:

After — Target UP, no errors:

2. CPU Busy-Wait Loop → Scrape Timeouts

Symptom: Scrapes timing out, CPU usage elevated and spikey, delayed metrics delivery.

Root cause: A 2-million-iteration loop ran on every single scrape request.

# BEFORE — runs on every /metrics request
for _ in range(2000000):
    pass  # serves no purpose, burns CPU

Fix: Removed entirely — it had no functional purpose.

Result: CPU dropped from ~4% to 0.4% (90% reduction). Scrape timeouts eliminated.

3. 5MB Memory Blob → Memory Spikes

Symptom: Regular memory spikes visible in Grafana, up to 15MB per scrape.

Root cause: A 5MB static string was allocated at module load, then randomly multiplied on each request.

# BEFORE
data_blob = "X" * 5_000_000          # 5MB baseline
temp_data = data_blob * random.randint(1, 3)  # up to 15MB per request!

Fix: Replaced with a realistic 1KB payload.

# AFTER
data_blob = "X" * 1_000  # 1KB — reflects realistic sensor data

Result: 99.98% memory reduction. Spikes disappeared entirely.

4. Missing Import → /sensor Route Crashes

Symptom: Every request to /sensor crashed with a NameError.

Root cause: jsonify() was used in the route handler but never imported.

# BEFORE — missing jsonify
from flask import Flask, Response

# AFTER — fixed
from flask import Flask, Response, jsonify

Result: /sensor route works correctly on every request.

Key Optimizations

Sensor Service

Issue	Before	After	Impact
Data blob	5 MB static	1 KB realistic	99.98% reduction
CPU loop	2M iterations/scrape	Removed	Eliminated timeouts
Memory	~80 MB	~40–50 MB	37% reduction
Content-Type	text/html	Prometheus format	100% scrape success

Prometheus

prometheus:
  command:
    - "--storage.tsdb.retention.time=12h"  # down from 48h — saves ~30% memory
    - "--config.file=/etc/prometheus/prometheus.yml"
    - "--web.enable-lifecycle"
  mem_limit: 128m

Scrape interval tuned from 5s → 15s — 66% fewer requests with no meaningful data loss for this use case.

Container Images

Switching from python:3.10 (~900MB) to python:3.10-slim (~150MB) cut the image size by 83%.

Custom Metrics

A custom counter tracks failed sensor events by failure category:

FAILED_EVENTS = Counter(
    "sensor_failed_events_total",
    "Total failed sensor events",
    ["reason"]  # Labels: cpu_spike, route_error
)

Label	Triggered when
`cpu_spike`	CPU_SPIKE gauge flips to 1 (simulated degraded state)
`route_error`	`/sensor` route throws an exception

Example Grafana query:

rate(sensor_failed_events_total{reason="cpu_spike"}[1m])

Use rate() here rather than raw counter values — it shows event frequency over time and makes anomaly detection practical.

Memory Budget

NAME                    MEM USAGE / LIMIT
sensor-service-1        ~45MB  / 64MB
prometheus-1            ~55MB  / 128MB
grafana-1               ~55MB  / 96MB
────────────────────────────────────────
TOTAL                   ~141MB / 288MB   ✓ 53% under budget

Dashboard: Before vs After

Before — CPU 0.6–0.8%, memory spikes at regular intervals, total usage 210–230MB:

After — CPU drops to 0.2%, memory stable, spikes gone:

Testing

# Start services
docker compose up -d

# All containers should be in "Up" state
docker ps

# Total memory should be ~141MB
docker stats --no-stream

# Test sensor endpoint
curl http://localhost:8000/sensor

# Confirm custom metric is present
curl http://localhost:8000/metrics | grep sensor_failed

# Confirm Prometheus target is UP (green)
open http://localhost:9090/targets

Expected results:

All containers in "Up" state
Total memory ~141MB (under 300MB budget)
Prometheus target status: UP (green)
Grafana dashboards display data with no gaps
No content-type errors in Prometheus logs
CPU stable around 0.4%

Stack

Prometheus — metrics scraping and storage
Grafana — visualization and dashboards
Flask + prometheus_client — sensor service and metrics exposition
Docker Compose — container orchestration

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
prometheus.yml		prometheus.yml
sensor_service.py		sensor_service.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ultra-Light Edge Observability Stack

The Problem

Performance

Hardware Target

Architecture

Quick Start

Access Points

Challenges & Solutions

1. Content-Type Mismatch → Scrape Failures

2. CPU Busy-Wait Loop → Scrape Timeouts

3. 5MB Memory Blob → Memory Spikes

4. Missing Import → /sensor Route Crashes

Key Optimizations

Sensor Service

Prometheus

Container Images

Custom Metrics

Memory Budget

Dashboard: Before vs After

Testing

Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ultra-Light Edge Observability Stack

The Problem

Performance

Hardware Target

Architecture

Quick Start

Access Points

Challenges & Solutions

1. Content-Type Mismatch → Scrape Failures

2. CPU Busy-Wait Loop → Scrape Timeouts

3. 5MB Memory Blob → Memory Spikes

4. Missing Import → /sensor Route Crashes

Key Optimizations

Sensor Service

Prometheus

Container Images

Custom Metrics

Memory Budget

Dashboard: Before vs After

Testing

Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages