Feature: Implement Structured Logging, Monitoring & Observability
Problem
The FluentMeet backend currently has no centralized logging strategy, no performance metrics, and no alerting. Log output is unstructured, making it unsearchable in aggregation tools. There is no way to trace a single request or audio chunk across multiple services (WebSocket handler → Kafka → STT worker → Translation worker → TTS worker → egress), making latency bottlenecks and silent failures nearly impossible to diagnose in production. Without health checks and alerting, service degradations go unnoticed until users report them.
Proposed Solution
Implement a three-pillar observability stack tailored to FastAPI and the Kafka AI pipeline:
- Structured Logging — JSON-formatted logs with correlation IDs propagated across all services, collected and shipped to a log aggregation platform (e.g., Loki, Datadog, or CloudWatch).
- Metrics — Prometheus metrics exposed via a
/metrics endpoint, scraped by a Prometheus server, and visualized in Grafana dashboards.
- Alerting — Alertmanager (or cloud-native equivalent) rules for critical failure conditions and latency SLO breaches.
User Stories
- As a DevOps engineer, I want all logs to be structured JSON, so I can query and filter them with a log aggregation tool without writing regex parsers.
- As a developer, I want a
correlation_id (trace ID) automatically attached to every log line and propagated through the Kafka pipeline, so I can trace a single audio chunk end-to-end without guessing.
- As a platform operator, I want Prometheus metrics for request rate, error rate, and translation latency, so I can set SLO-based alert thresholds and detect regressions before users do.
- As an on-call engineer, I want to receive an alert when Kafka consumer lag exceeds a threshold or service error rate spikes, so I am notified of outages proactively rather than reactively.
Acceptance Criteria
-
Structured Logging:
- All log output is JSON-formatted using
python-json-logger or structlog.
- Every log record includes:
timestamp, level, service, logger, message, correlation_id, and request_id (for HTTP requests).
- A
CorrelationIDMiddleware automatically generates and attaches a X-Correlation-ID header to every HTTP request and response.
- The
correlation_id is injected into the Kafka message envelope so it flows through all pipeline workers.
- Log level is configurable via
LOG_LEVEL environment variable (default: INFO).
-
Prometheus Metrics — the following metrics are tracked and exposed at GET /metrics:
| Metric |
Type |
Labels |
Description |
http_requests_total |
Counter |
method, endpoint, status_code |
Total HTTP requests |
http_request_duration_seconds |
Histogram |
method, endpoint |
Request latency |
pipeline_stage_duration_seconds |
Histogram |
stage (stt, translate, tts) |
Per-stage AI latency |
pipeline_end_to_end_duration_seconds |
Histogram |
— |
Full audio chunk latency |
kafka_consumer_lag |
Gauge |
topic, consumer_group |
Kafka consumer offset lag |
active_rooms_total |
Gauge |
— |
Number of currently active meeting rooms |
active_ws_connections_total |
Gauge |
type (audio, captions, signaling) |
Active WebSocket connections |
-
Health Check Endpoints:
GET /health — basic liveness check (already implemented); returns {"status": "ok"}.
GET /health/ready — readiness check; verifies connectivity to PostgreSQL, Redis, and Kafka. Returns 503 if any dependency is unreachable.
-
Alerting Rules — the following Alertmanager rules are configured:
| Alert |
Condition |
Severity |
HighErrorRate |
HTTP 5xx rate > 5% over 5 min |
Critical |
HighTranslationLatency |
P95 end-to-end pipeline > 2s over 5 min |
Warning |
KafkaConsumerLagHigh |
Consumer lag > 1000 messages |
Warning |
ServiceDown |
Health check returns non-200 |
Critical |
-
All Kafka pipeline workers log stage entry/exit with correlation_id at DEBUG level and errors at ERROR with full stack traces.
-
The GET /metrics endpoint is protected — accessible only from internal/monitoring network ranges (not publicly exposed).
Proposed Technical Details
- Logging Library:
structlog configured with JSONRenderer for production; ConsoleRenderer for local development (detected via ENVIRONMENT setting).
- Metrics Library:
prometheus-fastapi-instrumentator for automatic HTTP metrics + manual prometheus_client gauges/histograms for pipeline and Kafka metrics.
- Correlation ID:
asgi-correlation-id middleware generates a UUID v4 per request, stored in a ContextVar and accessible throughout the async call stack.
- Kafka Lag Metric: A background task polls the Kafka broker for consumer group lag every 30 seconds and updates the
kafka_consumer_lag Gauge.
- New/Modified Files:
app/core/logging.py — structlog configuration [NEW]
app/core/middleware.py — add CorrelationIDMiddleware [MODIFY]
app/core/metrics.py — Prometheus metric definitions and helpers [NEW]
app/api/v1/endpoints/health.py — extend with /health/ready [NEW]
app/main.py — register logging, middleware, metrics, and health router [MODIFY]
infra/prometheus.yml — Prometheus scrape config [NEW]
infra/alerts.yml — Alertmanager alert rules [NEW]
infra/docker-compose.yml — add Prometheus and Grafana services [MODIFY]
Tasks
Open Questions/Considerations
- Which log aggregation platform will be used in production — Loki (self-hosted), Datadog, or AWS CloudWatch? This affects the log shipping configuration.
- Should the
GET /metrics endpoint be protected by a shared secret header (e.g., Bearer token) or restricted by network policy only?
- Should we implement distributed tracing (OpenTelemetry + Jaeger/Tempo) in addition to correlation IDs, for a full trace waterfall view across services?
- What is the agreed SLO for end-to-end translation latency (e.g., P95 < 500ms)? This determines the alert threshold.
Feature: Implement Structured Logging, Monitoring & Observability
Problem
The FluentMeet backend currently has no centralized logging strategy, no performance metrics, and no alerting. Log output is unstructured, making it unsearchable in aggregation tools. There is no way to trace a single request or audio chunk across multiple services (WebSocket handler → Kafka → STT worker → Translation worker → TTS worker → egress), making latency bottlenecks and silent failures nearly impossible to diagnose in production. Without health checks and alerting, service degradations go unnoticed until users report them.
Proposed Solution
Implement a three-pillar observability stack tailored to FastAPI and the Kafka AI pipeline:
/metricsendpoint, scraped by a Prometheus server, and visualized in Grafana dashboards.User Stories
correlation_id(trace ID) automatically attached to every log line and propagated through the Kafka pipeline, so I can trace a single audio chunk end-to-end without guessing.Acceptance Criteria
Structured Logging:
python-json-loggerorstructlog.timestamp,level,service,logger,message,correlation_id, andrequest_id(for HTTP requests).CorrelationIDMiddlewareautomatically generates and attaches aX-Correlation-IDheader to every HTTP request and response.correlation_idis injected into the Kafka message envelope so it flows through all pipeline workers.LOG_LEVELenvironment variable (default:INFO).Prometheus Metrics — the following metrics are tracked and exposed at
GET /metrics:http_requests_totalmethod,endpoint,status_codehttp_request_duration_secondsmethod,endpointpipeline_stage_duration_secondsstage(stt, translate, tts)pipeline_end_to_end_duration_secondskafka_consumer_lagtopic,consumer_groupactive_rooms_totalactive_ws_connections_totaltype(audio, captions, signaling)Health Check Endpoints:
GET /health— basic liveness check (already implemented); returns{"status": "ok"}.GET /health/ready— readiness check; verifies connectivity to PostgreSQL, Redis, and Kafka. Returns503if any dependency is unreachable.Alerting Rules — the following Alertmanager rules are configured:
HighErrorRateHighTranslationLatencyKafkaConsumerLagHighServiceDownAll Kafka pipeline workers log stage entry/exit with
correlation_idatDEBUGlevel and errors atERRORwith full stack traces.The
GET /metricsendpoint is protected — accessible only from internal/monitoring network ranges (not publicly exposed).Proposed Technical Details
structlogconfigured withJSONRendererfor production;ConsoleRendererfor local development (detected viaENVIRONMENTsetting).prometheus-fastapi-instrumentatorfor automatic HTTP metrics + manualprometheus_clientgauges/histograms for pipeline and Kafka metrics.asgi-correlation-idmiddleware generates a UUID v4 per request, stored in aContextVarand accessible throughout the async call stack.kafka_consumer_lagGauge.app/core/logging.py— structlog configuration [NEW]app/core/middleware.py— addCorrelationIDMiddleware[MODIFY]app/core/metrics.py— Prometheus metric definitions and helpers [NEW]app/api/v1/endpoints/health.py— extend with/health/ready[NEW]app/main.py— register logging, middleware, metrics, and health router [MODIFY]infra/prometheus.yml— Prometheus scrape config [NEW]infra/alerts.yml— Alertmanager alert rules [NEW]infra/docker-compose.yml— add Prometheus and Grafana services [MODIFY]Tasks
structloginapp/core/logging.pywith JSON output for production and pretty output for development.CorrelationIDMiddlewareinapp/core/middleware.pyto generate and propagateX-Correlation-ID.correlation_idinto Kafka message envelopes inapp/schemas/pipeline.py.correlation_idlogging in all Kafka worker services (stt_worker,translation_worker,tts_worker).app/core/metrics.py.prometheus-fastapi-instrumentatorfor automatic HTTP metrics.GET /health/readyinapp/api/v1/endpoints/health.py.infra/docker-compose.yml.infra/alerts.yml.CorrelationIDMiddleware(assert header is present in response).GET /health/ready(mock unhealthy dependencies).Open Questions/Considerations
GET /metricsendpoint be protected by a shared secret header (e.g.,Bearertoken) or restricted by network policy only?