Skip to content

feat(observability): cloud/host OTel attributes + VM-metrics collector#148

Merged
pedromvgomes merged 1 commit into
mainfrom
feature/add-service-exposed-ports
Jun 28, 2026
Merged

feat(observability): cloud/host OTel attributes + VM-metrics collector#148
pedromvgomes merged 1 commit into
mainfrom
feature/add-service-exposed-ports

Conversation

@pedromvgomes

Copy link
Copy Markdown
Contributor

Summary

Two coupled slices that give Grafana Cloud cloud/host context, sharing one resource-attribute set so VM metrics and app telemetry correlate on host.id.

Slice 1 — Resource-attribute enrichment (ADR-0030)

Inject four more OTel resource attributes inforge alone knows at deploy time:

Attribute Value Env var
cloud.provider hetzner (provider self-names) INFORGE_CLOUD_PROVIDER
cloud.region Hetzner network_zone INFORGE_CLOUD_REGION
cloud.availability_zone Hetzner location INFORGE_CLOUD_AVAILABILITY_ZONE
host.type server-type SKU INFORGE_HOST_TYPE
  • Provider-supplied fields on types.ComputeOutputs, populated in hetzner.Create(), read off the host in renderDescriptor, carried in bootstrapper.Deployment, emitted by buildEnv omit-if-empty.
  • host.name/os.type deliberately dropped (the process can self-detect them).
  • Descriptor schema v4 → v5 (forced: the strict KnownFields decoder makes any field addition a major bump; safe via the pinned-bootstrapper lockstep).

Slice 2 — Host VM-metrics collector (ADR-0031)

  • New pure internal/otelcol package renders an off-the-shelf OpenTelemetry Collector Contrib config (hostmetricsotlphttp) + the idempotent install shell (version-pinned .deb, checksum-verified, apt-get install keeping our config).
  • program.provisionObservability: always-on per-host pass, gated on env-level config (variables.yaml observability.otlp_endpoint + the observability/otlp_auth secret in secrets.enc.yaml). No endpoint → no-op; endpoint but no credential → deploy fails.
  • Agent runs unprivileged (process scraper off). Credential is base64'd, pulumi.ToSecret-wrapped (encrypted in state), written 0600 to the collector user, referenced via the collector's ${file:…} provider (never inlined).
  • Stamps the same ADR-0030 attribute set + host.id.

Notes / follow-ups

  • Consumer side is a separate PR in wardnet-cloud: a four-row addition to the (attribute, env_var) table in crates/common/src/telemetry.rs::resource() mapping the new vars. It's inert until inforge emits them, so this PR can merge first.
  • No new inforge binary → goreleaser/CI untouched (the off-the-shelf .deb is fetched onto the host).
  • Two runtime assumptions encoded from the packaging source but not yet exercised on a live host: the .deb's default unit loads /etc/otelcol-contrib/config.yaml, and the collector's ${file:…} provider reads the credential cleanly. A single inforge deploy against a test env would confirm both.
  • Branch name (add-service-exposed-ports) is stale — that work merged in feat(service): add exposed_ports for private-network port exposure #141; this branch now carries the observability slices.

Test plan

  • go build ./..., go test -race ./..., golangci-lint run ./... — all clean.
  • New unit tests: internal/otelcol (render determinism, omit-empty, required-field errors, install/credential/apply scripts); bootstrapper env + descriptor round-trip carry the new attributes; program descriptor round-trip asserts the cloud/host fields.
  • wardnet-cloud telemetry.rs tests updated (in the separate PR).

Merge Commit Message

feat(observability): cloud/host OTel attributes + VM-metrics collector (ADR-0030, ADR-0031)

https://claude.ai/code/session_017Kyd98NzojozMZ19d5UCZ2

Two coupled slices that give Grafana Cloud cloud/host context, sharing one
resource-attribute set so VM metrics and app telemetry correlate on host.id.

Resource-attribute enrichment (ADR-0030): inject four more OTel resource
attributes inforge alone knows at deploy — cloud.provider, cloud.region
(Hetzner network_zone), cloud.availability_zone (Hetzner location) and host.type
(server-type SKU). They are provider-supplied fields on types.ComputeOutputs,
read off the host in renderDescriptor, carried in bootstrapper.Deployment, and
emitted by buildEnv omit-if-empty. Descriptor schema bumps v4 -> v5 (the strict
KnownFields decoder makes any field addition a major bump).

Host VM-metrics collector (ADR-0031): new pure internal/otelcol package renders
an off-the-shelf OpenTelemetry Collector Contrib config (hostmetrics -> otlphttp)
and the idempotent install shell (version-pinned .deb, checksum-verified, apt
install keeping our config). program.provisionObservability is an always-on
per-host pass gated on env-level config (variables.yaml observability.otlp_endpoint
plus the observability/otlp_auth secret in secrets.enc.yaml). The agent runs
unprivileged (process scraper off); the credential is base64'd, ToSecret-wrapped
(encrypted in state), written 0600 to the collector user, and referenced from the
config via the collector's ${file:...} provider.

The consumer-side telemetry.rs mapping for the new attributes lands separately in
the wardnet-cloud repo.

Claude-Session: https://claude.ai/code/session_017Kyd98NzojozMZ19d5UCZ2
@pedromvgomes pedromvgomes merged commit 78ac27a into main Jun 28, 2026
2 checks passed
@pedromvgomes pedromvgomes deleted the feature/add-service-exposed-ports branch June 28, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant