Public docs-only mirror of observability practices applied across my microservice projects. 微服务可观测性实践的公开文档橱窗。
This repository is a documentation showcase of how I instrument, alert on, and operate microservice observability — covering metrics, logs, traces, and the operational glue that makes the three pillars actually useful in practice.
It contains:
- ADRs for non-obvious choices (tail-sampling, alert layering, structured-logging contracts, SLO formalization)
- Sanitized sample alert rules (Prometheus)
- A reading order for evaluators
It does not contain production dashboards, real alert thresholds tuned to a specific tenant, or company-internal runbooks. Those live in private repos (sample exists in cuckoo-echo's private repo) and are shared on request.
本仓库是我在微服务项目里的可观测性实践的公开文档橱窗,包含 ADR、示例告警规则与阅读顺序。不含生产 dashboard / 调过的真实阈值 / 内部 runbook——这些位于私库(如 cuckoo-echo),按需开放。
OTel SDKs in services → OTel Collector (tail-sampling + attribute scrub) → Prometheus / Tempo / Loki backends → three-tier alerting (symptoms / causes / capacity). Each segment is detailed in the ADRs below.
Source / re-render:
diagrams/observability-pipeline.mmd— seediagrams/README.md.
This is one corner of a three-repo showcase triangle covering my main practice areas:
- cuckoo-echo-showcase — multi-tenant AI customer-service SaaS architecture
- cicd-platform-showcase — CI/CD & release governance (this showcase's SLO definitions feed cicd-platform-showcase's canary gating in ADR-0004)
- You are here — observability platform (this repo)
All three are docs-only and intentionally cross-reference where decisions span domains.
This showcase aggregates practices from:
- cuckoo-echo (private) — multi-tenant AI customer-service platform, where the full Langfuse + Prometheus + OpenTelemetry + ELK stack was wired up and 15 production runbooks were written
- cuckoo (public) — polyglot monorepo, where the same patterns are applied at smaller scale and visible in source
Together they cover: OTel SDK instrumentation, Prometheus + Grafana metrics, Loki log aggregation, Jaeger / Tempo distributed tracing, structured logging contracts, multi-window burn-rate SLO alerting, and tail-sampling for cost control.
| Pillar | Practice | Status here |
|---|---|---|
| Metrics | Prometheus + Grafana — RED + USE + custom SLO metrics | ADR-0003, sample alert-rules.yml |
| Logs | Structured (JSON) via structlog / logback-encoder; sensitive-data masking | ADR-0002 |
| Traces | OTel SDK → OTel Collector → Tempo / Jaeger; trace-id propagation across HTTP, gRPC, async (Kafka) | ADR-0001 |
| Sampling | Tail-sampling at collector tier — keep error spans, sample success spans | ADR-0004 |
| Alerting | Multi-window burn-rate (30m/6h/3d) + per-tier symptoms (api / data / queue) | ADR-0003 |
| SLO | Explicit SLI definition per service + error-budget tracking + budget-driven release gates | covered in ADR-0003 |
| ID | Decision | Why it matters |
|---|---|---|
| ADR-0001 | OTel SDK in all services, not vendor SDKs | Avoids lock-in; enables sampling decisions to live at the collector tier |
| ADR-0002 | JSON-only logs, masked sensitive fields, trace-id required | Logs become a queryable dataset, not a stream of strings |
| ADR-0003 | Three-tier alert layering: symptom / cause / saturation | Avoids "alert fatigue" while keeping diagnostics in reach |
| ADR-0004 | Tail-sampling at collector, not head-sampling at SDK | Keeps 100% of error/slow traces while cutting volume ~70% |
docs/samples/alert-rules.yml— Prometheus alert rules (multi-window burn-rate + per-tier symptoms)- More samples (Grafana dashboard JSON, OTel Collector config) live in private repos and are shared on request
For a 10-minute walkthrough:
docs/adr/0001-otel-everywhere.md— instrumentation foundationdocs/adr/0003-alert-layering.md— alert philosophy (the most underrated piece of observability)docs/samples/alert-rules.yml— the philosophy as codedocs/adr/0004-tail-sampling.md— cost control without losing signaldocs/adr/0002-structured-logging-contract.md— the data contract that makes logs useful
Real implementations live in:
- cuckoo — public —
monitoring/directory,apps/*/observability/per service - cuckoo-echo (private) —
monitoring/(Prometheus / Loki / Tempo configs),monitoring/dashboards/(Grafana JSON),shared/logging.py(structlog setup),shared/tracing.py(OTel init), 15 runbooks underdocs/runbooks/
If you are evaluating this body of work, open an issue here or reach me directly. I'll grant time-boxed read access to the relevant private repo.
- Single-author body of work. Patterns described come from real implementations; no claim of "operated by a 24/7 SRE rotation across multiple regions for years."
- No production thresholds shared verbatim. The sample
alert-rules.ymluses indicative thresholds (5% error rate, 5s P95). Real systems require thresholds tuned to that system's traffic pattern and SLOs — apply ADR-0003's framework, not the literal numbers. - Cost claims (e.g. "tail-sampling reduces volume ~70%") come from a specific dataset in cuckoo-echo's local Docker Compose run; real reductions vary by traffic shape.
MIT — applies to documentation in this showcase repository.
