Skip to content

feat(#163): Add baseline observability#276

Merged
mvillmow merged 3 commits into
mainfrom
163-auto-impl
Jun 29, 2026
Merged

feat(#163): Add baseline observability#276
mvillmow merged 3 commits into
mainfrom
163-auto-impl

Conversation

@mvillmow

@mvillmow mvillmow commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Implement issue #163: Add baseline observability to ProjectTelemachy with correlation IDs, structured logging, Prometheus metrics, and OpenTelemetry tracing.

Changes

New observability layer (telemetry.py)

  • Correlation IDs: Every log record carries per-execution workflow_id via contextvars for end-to-end tracing across async boundaries
  • Structured logging: JSON or plain-text formatters via LOG_FORMAT setting; safe formatters that don't crash without filter
  • Prometheus metrics: Workflow completion, task outcomes, HTTP latency exposed via /metrics endpoint (opt-in)
  • OpenTelemetry tracing: Spans for each workflow phase via get_tracer() lazy factory; console exporter only (OTLP planned)

Architecture

  • telemetry.py: Logging filters, JSON/plain formatters, metrics singletons, tracing setup with idempotent initialization and thread safety
  • config.py: New observability settings (LOG_FORMAT, METRICS_ENABLED, METRICS_PORT, OTEL_ENABLED, OTEL_SERVICE_NAME, OTEL_EXPORTER) with single-source validation
  • cli.py: _setup_logging rewritten to attach filter, set formatters, start tracing/metrics before httpx clients instantiated
  • executor.py: Contextvars set/reset in _run, spans around each phase (provision, teams, monitor, teardown), metrics for workflow/task outcomes
  • agamemnon_client.py: Endpoint label normalization to control cardinality, per-attempt metrics

Testing

  • 26 new tests covering correlation ID propagation, log record defaults, metrics idempotency/thread safety, task terminal states, tracer behavior, endpoint normalization
  • All 70 tests pass with 78.41% coverage (target 75%)
  • Lint (ruff) and type checking (mypy) clean

Documentation

  • Updated .env.example with observability settings
  • Updated CLAUDE.md with new environment variables and observability subsection
  • Updated docs/license-audit.md with 4 new packages (all Apache-2.0, compatible with MIT distribution)

Closes #163

@mvillmow mvillmow enabled auto-merge (squash) June 28, 2026 18:45
mvillmow added a commit that referenced this pull request Jun 29, 2026
Telemachy CI pins setup-pixi to v0.67.2, which can only read pixi
lock-format v6 and fails with "Lock-file version 7 is newer than
supported; Maximum supported version: 6" on any v7 lock. Open PRs
#263/#271/#273/#276 intentionally ship v7 locks (multi-platform
macOS/Windows + large mcp/otel dep trees), so their entire pipeline
dies at `pixi install`.

Bump all 9 pixi-version pins (_required.yml x7, release.yml x2) to
v0.70.2, matching sibling repo Agamemnon which already runs v0.70.2
on main with v7 locks. v0.70.2 is backward-compatible: `pixi install
--locked` against main's current v6 lock succeeds (warn-only, lock not
rewritten), so this does not red main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
mvillmow added a commit that referenced this pull request Jun 29, 2026
Telemachy CI pins setup-pixi to v0.67.2, which can only read pixi
lock-format v6 and fails with "Lock-file version 7 is newer than
supported; Maximum supported version: 6" on any v7 lock. Open PRs
#263/#271/#273/#276 intentionally ship v7 locks (multi-platform
macOS/Windows + large mcp/otel dep trees), so their entire pipeline
dies at `pixi install`.

Bump all 9 pixi-version pins (_required.yml x7, release.yml x2) to
v0.70.2, matching sibling repo Agamemnon which already runs v0.70.2
on main with v7 locks. v0.70.2 is backward-compatible: `pixi install
--locked` against main's current v6 lock succeeds (warn-only, lock not
rewritten), so this does not red main.

Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mvillmow mvillmow force-pushed the 163-auto-impl branch 2 times, most recently from 43f8a7e to aa13850 Compare June 29, 2026 08:26
mvillmow and others added 3 commits June 29, 2026 01:36
…acing

Implement issue #163: Add observability to ProjectTelemachy with:

- Correlation IDs: Every log record carries per-execution workflow_id via
  contextvars for end-to-end tracing across async boundaries
- Structured logging: JSON or plain-text formatters via LOG_FORMAT setting
- Prometheus metrics: Workflow completion, task outcomes, HTTP latency
  exposed via /metrics endpoint (opt-in via METRICS_ENABLED)
- OpenTelemetry tracing: Spans for each workflow phase via get_tracer()
  lazy factory; console exporter only (OTLP planned follow-up)

Architecture:
- New telemetry.py: logging filters, JSON/plain formatters, metrics,
  tracing setup with idempotent initialization and thread safety
- config.py: New observability settings with validation (single source)
- cli.py: _setup_logging rewritten to attach filter, set formatters,
  start tracing/metrics before httpx clients instantiated
- executor.py: Contextvars set/reset in _run, spans around each phase,
  metrics for workflow/task outcomes
- agamemnon_client.py: Endpoint label normalization, per-attempt metrics

Tests (26 new):
- Correlation ID propagation into gather children
- Log record defaults for missing filter
- Metrics idempotency and thread safety
- Metrics increments on success/failure
- Task terminal state transitions
- Tracer factory behavior
- Endpoint label normalization

License audit: Added 4 new packages (prometheus-client, opentelemetry-*),
all Apache-2.0 compatible with MIT distribution.

All 70 tests pass with 78.41% coverage (target 75%).
Lint and type checking clean (ruff, mypy).

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
No follow-up items discovered during implementation that qualify under
strict scope rules (core defects, security findings, safety hazards, or
critical bugs). Contextvars cleanup verified, thread safety confirmed,
idempotency guaranteed, and test coverage exceeds target.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
Signed-off-by: mvillmow <4211002+mvillmow@users.noreply.github.com>
@mvillmow mvillmow merged commit 3fdf919 into main Jun 29, 2026
13 of 14 checks passed
@mvillmow mvillmow deleted the 163-auto-impl branch June 29, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[MAJOR] §9: No observability — no metrics, no tracing, no correlation IDs in logs

1 participant