diff --git a/AGENTS.md b/AGENTS.md index d46fa89..0a1778b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -383,6 +383,37 @@ source. The grain is create-and-destroy (Hetzner bills a server until it is dele - **Network segregation** is a structural invariant — never peer Networks or share one across envs; see `.agents/rules/ephemeral-network-segregation.md`. +## Observability (ADR-0030, ADR-0031) + +Two coupled pieces give Grafana Cloud cloud/host context. Both stamp the **same** +resource-attribute set so VM metrics and app telemetry correlate on `host.id`. + +- **Resource-attribute enrichment (ADR-0030).** Beyond the #134 set, inforge injects four + more OTel attributes where it is the sole authority: `cloud.provider` + (`INFORGE_CLOUD_PROVIDER`), `cloud.region` (`INFORGE_CLOUD_REGION` = Hetzner `network_zone`), + `cloud.availability_zone` (`INFORGE_CLOUD_AVAILABILITY_ZONE` = Hetzner `location`), and + `host.type` (`INFORGE_HOST_TYPE` = server-type SKU). They are **provider-supplied** plain-string + fields on `types.ComputeOutputs` (`CloudProvider/CloudRegion/AvailabilityZone/MachineType`), + populated by `hetzner.Create()` (plan-time constants, no apply), read off the host in + `renderDescriptor`, carried in `bootstrapper.Deployment`, and emitted by `buildEnv` + **omit-if-empty** (a provider that doesn't supply one emits nothing). `host.name`/`os.type` + were deliberately dropped (self-detectable by the process). This bumped the descriptor to + **v5** (the strict `KnownFields` decoder makes any field addition a major bump). The consumer + side is a four-row addition to the `(attribute, env_var)` table in wardnet-cloud + `crates/common/src/telemetry.rs::resource()`. +- **Host VM-metrics collector (ADR-0031).** `internal/otelcol` (pure, Pulumi-free, like + `internal/nginx`) renders an off-the-shelf **OTel Collector Contrib** config (`hostmetrics` → + `otlphttp`) and the idempotent install shell (download the version-pinned `.deb`, verify the + release checksum, `apt-get install` the local file keeping our config on upgrade). The + `process` scraper is **off** so the agent runs **unprivileged** as the `.deb`'s `otelcol-contrib` + user. `program.provisionObservability` is an **always-on** per-host pass **gated on env-level + config**: `variables.yaml` `observability.otlp_endpoint` (non-secret) + the OTLP Basic-auth + credential in `secrets.enc.yaml` under the reserved `observability/otlp_auth` + (`otelcol.AuthSecret*`). With no endpoint it is a no-op; with an endpoint but no credential it + fails the deploy. The credential is base64'd, `pulumi.ToSecret`-wrapped (encrypted in state), + written `0600` owned by the collector user, and referenced from the config via the collector's + `${file:…}` provider (never inlined). The config stamps the ADR-0030 attribute set + `host.id`. + ## Conventions - **Provider binary names are load-bearing.** Pulumi locates plugins by the exact filename diff --git a/docs/adr/0030-otel-resource-attribute-enrichment.md b/docs/adr/0030-otel-resource-attribute-enrichment.md new file mode 100644 index 0000000..77fcb3c --- /dev/null +++ b/docs/adr/0030-otel-resource-attribute-enrichment.md @@ -0,0 +1,67 @@ +# OTel resource-attribute enrichment (cloud.*/host.*) + +Extends the #134 observability env-var contract so every service's telemetry +(metrics, traces, logs shipped to Grafana Cloud) is tagged with the cloud/host +resource attributes inforge alone knows at deploy time. Today the only host-level +attribute is `host.id` (`INFORGE_HOST_ID`); the rest of the deploy's ground truth +(provider, region, datacenter, machine size) is resolved inside the Hetzner +provider and discarded. + +We inject **four** new resource attributes — and only four: the ones where inforge +is the *sole* authority, i.e. facts the running process provably cannot determine +itself. + +| OTel attribute | Value | Source | +|---|---|---| +| `cloud.provider` | `hetzner` | provider self-names | +| `cloud.region` | Hetzner `network_zone` (e.g. `us-east`) | provider | +| `cloud.availability_zone` | Hetzner `location` (e.g. `ash`) | provider | +| `host.type` | server-type SKU (e.g. `cx23`) | provider | + +## Considered options + +- **`host.name` and `os.type` — rejected.** `host.name` (the OS hostname) is + readable by the process via `gethostname()` and would duplicate the already-injected + `host.id` (the unique cloud resource name). `os.type` is always `linux` and is + trivially self-detectable (the OTel SDK ships an OS resource detector). Neither is + inforge's unique knowledge, so injecting them adds contract surface for no value. +- **`cloud.region` = Hetzner `network_zone`, not the inforge abstract region.** The + abstract region (`us-east-1`) is already in `INFORGE_DEPLOYMENT_REGION` and mapping + *that* would be zero new work, but it is inforge's portable abstraction, not the + provider's region — and it would pair a `cloud.region` from one naming system with a + `cloud.availability_zone` from another. OTel semconv defines `cloud.region` as the + *provider's* region; the provider-native hierarchy `network_zone ⊃ location` maps + exactly onto `cloud.region ⊃ cloud.availability_zone`. We pay one extra provider-sourced + string to keep both `cloud.*` geo attributes consistent and semantically correct. + +## How + +Provider-supplied facts reach the descriptor through `types.ComputeOutputs` — the +existing provider→program boundary object, already keyed per-host in `computeOut`. +It gains four plain-`string` fields (`CloudProvider`, `CloudRegion`, +`AvailabilityZone`, `MachineType`), populated by `hetzner.Create()` (plan-time +constants, no Pulumi apply). `renderDescriptor` reads them off the host's +`ComputeOutputs` and writes them into `bootstrapper.Deployment`; `buildEnv` emits +`INFORGE_CLOUD_PROVIDER`, `INFORGE_CLOUD_REGION`, `INFORGE_CLOUD_AVAILABILITY_ZONE`, +and `INFORGE_HOST_TYPE` (omitting any that are empty, so a future non-Hetzner provider +that doesn't supply them emits nothing). This keeps `renderDescriptor` provider-agnostic +and makes `cloud.provider` self-named rather than a hardcoded constant; it also makes +**global services correct for free**, since a global host carries its real placement in +its own `ComputeOutputs` rather than a recomputed abstraction. + +The four new env-var names are already protected by the reserved `INFORGE_*` prefix, +so no validation change is needed. + +## Consequences + +- **Descriptor version bumps 4 → 5.** The bootstrapper decodes descriptors strictly + (`KnownFields(true)`), so any field addition is a breaking change for an older + bootstrapper — a major bump is forced, not a judgment call. It is safe because the + pinned `inforge-bootstrap` binary and the descriptor are written by the same deploy + (identical lockstep to the #134 v3→v4 bump). +- **Cross-repo, decoupled by the env-var contract.** The consumer side is a four-row + addition to the `(attribute, env_var)` table in wardnet-cloud + `crates/common/src/telemetry.rs::resource()`; each row is best-effort (a missing/empty + var is omitted), so inforge can ship and deploy first and the attributes start + populating once wardnet-cloud picks them up. The collector of ADR-0031 reuses this + exact attribute set so host metrics correlate with app telemetry on `host.id`. diff --git a/docs/adr/0031-host-vm-metrics-collector.md b/docs/adr/0031-host-vm-metrics-collector.md new file mode 100644 index 0000000..d24c741 --- /dev/null +++ b/docs/adr/0031-host-vm-metrics-collector.md @@ -0,0 +1,66 @@ +# Host VM-metrics collector + +Application telemetry (ADR-0030, #134) tells us nothing about the VM itself — CPU, +memory, disk, network. A running service cannot report its host's resource usage, and +inforge owns the host, so collecting VM metrics is inforge's responsibility. We install +an off-the-shelf **OpenTelemetry Collector** on every VM, scraping host metrics and +exporting them OTLP/HTTP to the same Grafana Cloud endpoint the services use, tagged +with the **same** resource-attribute set as ADR-0030 so host metrics and app telemetry +correlate on `host.id`. + +## Decisions + +- **OTel Collector with the `hostmetrics` receiver → `otlphttp` exporter**, not Grafana + Alloy or Prometheus `node_exporter`. It reuses the exact protocol, endpoint, Basic + auth, and OTel resource-attribute model the services already use; Alloy adds a second + config dialect and integrations we don't need, and `node_exporter` is the wrong + protocol/model (Prom remote_write, label-based, no OTel resource attributes). +- **Off-the-shelf official `.deb` via apt, not a binary we build.** The OpenTelemetry + Collector Contrib `.deb` ships the binary, a system user, and a systemd unit, and + handles upgrades. A custom minimal collector via `ocb` would be more inforge-idiomatic + (tiny static binary, checksum-verified download) but is build/maintenance overhead for + a solved problem. inforge owns only the rendered config and the credential. Trade-off: + `apt` from a third-party repo is less pinned/reproducible than inforge's existing + checksum-verified raw-download invariant (`bootstrapDownloadStep`); we accept that for + not maintaining a collector build. +- **Always-on, gated on env-level observability config.** If the env defines the OTLP + endpoint (`variables.yaml`) and auth token (`secrets.enc.yaml`), every VM inforge + provisions gets the collector; otherwise none is installed (it would have nowhere to + export). No per-compute opt-in flag — host metrics are wanted uniformly, and this + matches "we own the host." +- **Unprivileged agent; `process` scraper off.** The host signals we want (cpu, memory, + load, disk, filesystem, network throughput) come from world-readable `/proc` and + `/sys` — no root needed. The agent runs as the `.deb`'s unprivileged service user. Only + the `process` scraper (per-process inventory across all users) needs root or + `CAP_SYS_PTRACE`/`CAP_DAC_READ_SEARCH`; it stays **off** by default and is a separate + opt-in if per-process metrics are ever wanted. +- **Credential delivered as an agent-user `0600` file, not at runtime.** The host agent + is not a service: it has no per-service Infisical identity, no descriptor, no `files:` + projection, and must start on boot independently. So the env-level OTLP token + (decrypted once per deploy from `secrets.enc.yaml`) is written to the host over SSH + (`command.remote`, the same transport that places descriptors/units/nginx config) into + a `0600` file owned by the agent's service user, referenced by the collector's + Basic-auth header. + +## Considered options for the credential + +- **`systemd-creds` encrypted-at-rest — rejected.** It keeps plaintext off disk by + decrypting into a tmpfs cred dir at start, but it binds to a TPM2 when present, else to + a host master key (`/var/lib/systemd/credential.secret`) that is itself on disk. + Hetzner Cloud VMs don't expose a vTPM, so "encrypted at rest" degrades to "encrypted + with an on-disk key" — no meaningful defense against the disk/root-exfil threat the + tmpfs posture targets. And it costs three integration points (a unit drop-in, an + on-host `systemd-creds encrypt` step at deploy, and a wrapper to bridge the credential + *file* into the collector's env/config-based `basicauth`). The security gain mostly + evaporates without a TPM, so we take the simpler `0600` file. (If TPM-backed hosts + ever exist, revisit.) + +## Consequences + +- A persisted secret on the host diverges from the services' tmpfs-only, + never-on-disk secret posture — accepted because the agent is host infrastructure that + must boot before any inforge interaction, and the file is `0600` to the agent user only. +- New pure package `internal/otelcol` (stdlib-only, like `internal/nginx`) renders the + collector config from the host's resource attributes + endpoint and owns the on-host + path scheme; a `provisionObservability` pass in `program.go` installs/configures the + agent per host, gated on env-level observability config. diff --git a/internal/bootstrapper/descriptor.go b/internal/bootstrapper/descriptor.go index 56b3b5b..f03ffc5 100644 --- a/internal/bootstrapper/descriptor.go +++ b/internal/bootstrapper/descriptor.go @@ -28,10 +28,11 @@ import ( // producer and consumer can never disagree on the schema version. Because parsing // is strict (KnownFields), any field addition is a breaking change for an older // reader, so it bumps this major: v2 added the Deployment block, v3 added the -// Files map, and v4 swapped Deployment.Namespace for Deployment.HostID, so an older -// bootstrapper meeting a newer descriptor fails cleanly on the version rather than -// on an unknown field. -const SupportedVersion = 4 +// Files map, and v4 swapped Deployment.Namespace for Deployment.HostID; v5 added the +// cloud/host resource-identity fields (CloudProvider/CloudRegion/AvailabilityZone/ +// MachineType, ADR-0030). An older bootstrapper meeting a newer descriptor fails +// cleanly on the version rather than on an unknown field. +const SupportedVersion = 5 // Descriptor is the versioned, secret-free on-host contract inforge writes to // /etc/wardnet/services//descriptor.yaml (0644 root). It names the service, @@ -70,6 +71,14 @@ type Deployment struct { // It is stable per host (does NOT change across restarts), so it is injected as // INFORGE_HOST_ID — the OTel host.id resource attribute. HostID string `yaml:"host_id"` + // The four fields below are provider-supplied cloud/host resource identity + // (ADR-0030), injected as INFORGE_CLOUD_*/INFORGE_HOST_TYPE → the OTel + // cloud.provider/cloud.region/cloud.availability_zone/host.type attributes. + // Each is omitempty: a provider that does not supply one writes nothing. + CloudProvider string `yaml:"cloud_provider,omitempty"` // cloud.provider, e.g. "hetzner" + CloudRegion string `yaml:"cloud_region,omitempty"` // cloud.region, e.g. "us-east" + AvailabilityZone string `yaml:"availability_zone,omitempty"` // cloud.availability_zone, e.g. "ash" + MachineType string `yaml:"machine_type,omitempty"` // host.type, e.g. "cx23" } // LoadDescriptor reads and parses the descriptor at path. diff --git a/internal/bootstrapper/descriptor_test.go b/internal/bootstrapper/descriptor_test.go index 09fb397..26d9ca2 100644 --- a/internal/bootstrapper/descriptor_test.go +++ b/internal/bootstrapper/descriptor_test.go @@ -7,7 +7,7 @@ import ( "github.com/stretchr/testify/require" ) -const validDescriptor = `version: 4 +const validDescriptor = `version: 5 service: ghost exec: /srv/wardnet/ghost/run user: ghost @@ -59,9 +59,9 @@ func TestParseDescriptorRejectsUnknownField(t *testing.T) { func TestParseDescriptorRequiresFields(t *testing.T) { cases := map[string]string{ - "service": "version: 4\nexec: /x\nuser: ghost\nprovider:\n kind: infisical\n", - "exec": "version: 4\nservice: ghost\nuser: ghost\nprovider:\n kind: infisical\n", - "user": "version: 4\nservice: ghost\nexec: /x\nprovider:\n kind: infisical\n", + "service": "version: 5\nexec: /x\nuser: ghost\nprovider:\n kind: infisical\n", + "exec": "version: 5\nservice: ghost\nuser: ghost\nprovider:\n kind: infisical\n", + "user": "version: 5\nservice: ghost\nexec: /x\nprovider:\n kind: infisical\n", } for missing, doc := range cases { _, err := ParseDescriptor([]byte(doc)) @@ -72,7 +72,7 @@ func TestParseDescriptorRequiresFields(t *testing.T) { // TestParseDescriptorSecretLess: a descriptor with no provider is a secret-less // service — valid as long as it carries no env mapping. func TestParseDescriptorSecretLess(t *testing.T) { - doc := "version: 4\nservice: ghost\nexec: /x\nuser: ghost\n" + doc := "version: 5\nservice: ghost\nexec: /x\nuser: ghost\n" d, err := ParseDescriptor([]byte(doc)) require.NoError(t, err) assert.Equal(t, "", d.Provider.Kind) @@ -82,7 +82,7 @@ func TestParseDescriptorSecretLess(t *testing.T) { // TestParseDescriptorRejectsEnvWithoutProvider: env with no provider is a // producer bug — there is nothing to resolve the keys against. func TestParseDescriptorRejectsEnvWithoutProvider(t *testing.T) { - doc := "version: 4\nservice: ghost\nexec: /x\nuser: ghost\nenv:\n DATABASE_URL: infra/DATABASE_URL\n" + doc := "version: 5\nservice: ghost\nexec: /x\nuser: ghost\nenv:\n DATABASE_URL: infra/DATABASE_URL\n" _, err := ParseDescriptor([]byte(doc)) require.Error(t, err) assert.Contains(t, err.Error(), "provider.kind is empty") diff --git a/internal/bootstrapper/env.go b/internal/bootstrapper/env.go index 4cca39d..ade8ed6 100644 --- a/internal/bootstrapper/env.go +++ b/internal/bootstrapper/env.go @@ -13,7 +13,8 @@ const minimalPATH = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi // ReservedEnvPrefix is the environment-variable namespace inforge owns and // injects itself (the deployment/observability context: INFORGE_DEPLOYMENT_*, -// INFORGE_SERVICE_NAMESPACE, INFORGE_INSTANCE_ID, INFORGE_HOST_ID). A service must +// INFORGE_SERVICE_NAMESPACE, INFORGE_INSTANCE_ID, INFORGE_HOST_ID, INFORGE_HOST_TYPE, +// INFORGE_CLOUD_*). A service must // not map a secret to a name under this prefix — validation rejects it up front, // and buildEnv rejects it as a backstop — so an injected value can never silently // shadow (or be shadowed by) a service secret. @@ -58,6 +59,20 @@ func buildEnv(d Descriptor, secrets map[string]string, home, instanceID string) "INFORGE_DEPLOYMENT_FQDN="+dpl.FQDN, "INFORGE_HOST_ID="+dpl.HostID, ) + // Provider-supplied cloud/host resource identity (ADR-0030) → OTel + // cloud.provider/cloud.region/cloud.availability_zone/host.type. Unlike the + // always-present block above, each is omitted when empty so a provider that + // does not supply it emits nothing (the consumer omits empty attrs anyway). + for _, kv := range [][2]string{ + {"INFORGE_CLOUD_PROVIDER", dpl.CloudProvider}, + {"INFORGE_CLOUD_REGION", dpl.CloudRegion}, + {"INFORGE_CLOUD_AVAILABILITY_ZONE", dpl.AvailabilityZone}, + {"INFORGE_HOST_TYPE", dpl.MachineType}, + } { + if kv[1] != "" { + env = append(env, kv[0]+"="+kv[1]) + } + } } names := make([]string, 0, len(d.Env)) diff --git a/internal/bootstrapper/env_test.go b/internal/bootstrapper/env_test.go index d082f9f..bd75684 100644 --- a/internal/bootstrapper/env_test.go +++ b/internal/bootstrapper/env_test.go @@ -62,18 +62,26 @@ func TestBuildEnvDeployment(t *testing.T) { Service: "bridge", User: "bridge", Deployment: Deployment{ - Region: "us-east-1", - RegionSlug: "use1", - Environment: "prd", - BaseDomain: "wardnet.network", - FQDN: "bridge.svc.prd.use1.wardnet.network", - HostID: "wardnet-prd-use1-vm-bridge-01", + Region: "us-east-1", + RegionSlug: "use1", + Environment: "prd", + BaseDomain: "wardnet.network", + FQDN: "bridge.svc.prd.use1.wardnet.network", + HostID: "wardnet-prd-use1-vm-bridge-01", + CloudProvider: "hetzner", + CloudRegion: "us-east", + AvailabilityZone: "ash", + MachineType: "cx23", }, } env, err := buildEnv(d, nil, "/home/bridge", "inst-9") require.NoError(t, err) + assert.Contains(t, env, "INFORGE_CLOUD_PROVIDER=hetzner") + assert.Contains(t, env, "INFORGE_CLOUD_REGION=us-east") + assert.Contains(t, env, "INFORGE_CLOUD_AVAILABILITY_ZONE=ash") + assert.Contains(t, env, "INFORGE_HOST_TYPE=cx23") assert.Contains(t, env, "INFORGE_DEPLOYMENT_REGION=us-east-1") assert.Contains(t, env, "INFORGE_DEPLOYMENT_REGION_SLUG=use1") assert.Contains(t, env, "INFORGE_DEPLOYMENT_ENV=prd") @@ -90,6 +98,28 @@ func TestBuildEnvDeployment(t *testing.T) { } } +// TestBuildEnvOmitsEmptyCloudAttrs: a deployment whose provider did not supply the +// cloud/host identity (the fields are empty) emits no INFORGE_CLOUD_*/INFORGE_HOST_TYPE +// vars at all — they are omitted, not emitted blank (the always-present deployment +// block still appears). +func TestBuildEnvOmitsEmptyCloudAttrs(t *testing.T) { + d := Descriptor{ + Service: "bridge", + User: "bridge", + Deployment: Deployment{Region: "us-east-1", HostID: "wardnet-prd-use1-vm-bridge-01"}, + } + + env, err := buildEnv(d, nil, "/home/bridge", "inst-9") + require.NoError(t, err) + + assert.Contains(t, env, "INFORGE_HOST_ID=wardnet-prd-use1-vm-bridge-01") + for _, e := range env { + for _, name := range []string{"INFORGE_CLOUD_PROVIDER", "INFORGE_CLOUD_REGION", "INFORGE_CLOUD_AVAILABILITY_ZONE", "INFORGE_HOST_TYPE"} { + assert.False(t, strings.HasPrefix(e, name+"="), "%s must be omitted when empty", name) + } + } +} + // TestBuildEnvRejectsReservedName: a secret mapped to a reserved INFORGE_* name // must fail the start rather than emit a duplicate that collides with the injected // deployment context. diff --git a/internal/otelcol/config.go b/internal/otelcol/config.go new file mode 100644 index 0000000..0435b32 --- /dev/null +++ b/internal/otelcol/config.go @@ -0,0 +1,111 @@ +package otelcol + +import ( + "fmt" + + "gopkg.in/yaml.v3" +) + +// ServiceName is the OTel service.name stamped on host metrics. It is fixed (host +// metrics are one logical "service" per fleet) and distinct from any app service; +// correlation with app telemetry is on host.id, not service.name. +const ServiceName = "wardnet-host-metrics" + +// CollectionInterval is how often the hostmetrics receiver scrapes. +const CollectionInterval = "60s" + +// hostScrapers are the host-level signals collected. Each reads world-readable +// /proc or /sys, so the collector needs no privilege (ADR-0031). The `process` +// scraper — per-process inventory across all users, which needs root or +// CAP_SYS_PTRACE — is deliberately omitted; adding it is a separate opt-in. +var hostScrapers = []string{"cpu", "load", "memory", "disk", "filesystem", "network", "paging"} + +// Attributes is the resource identity stamped on every host metric. It is the same +// set inforge injects into app telemetry (ADR-0030) so host metrics and app +// telemetry correlate on host.id. HostID is required; the rest are best-effort — +// an empty field is omitted (a provider that did not supply it stamps nothing). +type Attributes struct { + HostID string // host.id + CloudProvider string // cloud.provider + CloudRegion string // cloud.region + AvailabilityZone string // cloud.availability_zone + MachineType string // host.type + Environment string // deployment.environment.name + RegionSlug string // region +} + +// Render builds the collector config: a hostmetrics receiver, a resource processor +// stamping attrs, and an otlphttp exporter to endpoint authenticated with the +// Basic-auth value read from CredentialPath via the collector's ${file:…} provider +// (so the secret is never in the config text). endpoint and attrs.HostID are +// required. +func Render(endpoint string, attrs Attributes) (string, error) { + if endpoint == "" { + return "", fmt.Errorf("otelcol: empty OTLP endpoint") + } + if attrs.HostID == "" { + return "", fmt.Errorf("otelcol: empty host id") + } + + scrapers := map[string]any{} + for _, s := range hostScrapers { + scrapers[s] = map[string]any{} + } + + // service.name is always present; the rest are appended only when non-empty so a + // missing attribute is omitted rather than stamped blank. + resourceAttrs := []map[string]any{upsert("service.name", ServiceName)} + for _, kv := range [][2]string{ + {"host.id", attrs.HostID}, + {"host.type", attrs.MachineType}, + {"cloud.provider", attrs.CloudProvider}, + {"cloud.region", attrs.CloudRegion}, + {"cloud.availability_zone", attrs.AvailabilityZone}, + {"deployment.environment.name", attrs.Environment}, + {"region", attrs.RegionSlug}, + } { + if kv[1] != "" { + resourceAttrs = append(resourceAttrs, upsert(kv[0], kv[1])) + } + } + + cfg := map[string]any{ + "receivers": map[string]any{ + "hostmetrics": map[string]any{ + "collection_interval": CollectionInterval, + "scrapers": scrapers, + }, + }, + "processors": map[string]any{ + "resource": map[string]any{"attributes": resourceAttrs}, + "batch": map[string]any{}, + }, + "exporters": map[string]any{ + "otlphttp": map[string]any{ + "endpoint": endpoint, + "headers": map[string]any{ + "Authorization": fmt.Sprintf("Basic ${file:%s}", CredentialPath), + }, + }, + }, + "service": map[string]any{ + "pipelines": map[string]any{ + "metrics": map[string]any{ + "receivers": []string{"hostmetrics"}, + "processors": []string{"resource", "batch"}, + "exporters": []string{"otlphttp"}, + }, + }, + }, + } + + b, err := yaml.Marshal(cfg) + if err != nil { + return "", fmt.Errorf("otelcol: marshal config: %w", err) + } + return string(b), nil +} + +func upsert(key, value string) map[string]any { + return map[string]any{"key": key, "value": value, "action": "upsert"} +} diff --git a/internal/otelcol/config_test.go b/internal/otelcol/config_test.go new file mode 100644 index 0000000..75a3d7d --- /dev/null +++ b/internal/otelcol/config_test.go @@ -0,0 +1,120 @@ +package otelcol + +import ( + "strings" + "testing" + + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" + "gopkg.in/yaml.v3" +) + +func fullAttrs() Attributes { + return Attributes{ + HostID: "wardnet-prd-use1-vm-bridge-01", + CloudProvider: "hetzner", + CloudRegion: "us-east", + AvailabilityZone: "ash", + MachineType: "cx23", + Environment: "prd", + RegionSlug: "use1", + } +} + +func TestRenderParsesAndCarriesAttributes(t *testing.T) { + out, err := Render("https://otlp.example/v1", fullAttrs()) + require.NoError(t, err) + + // It must be valid YAML. + var cfg map[string]any + require.NoError(t, yaml.Unmarshal([]byte(out), &cfg)) + + // Endpoint and the file-provider auth header (secret never inlined). + assert.Contains(t, out, "endpoint: https://otlp.example/v1") + assert.Contains(t, out, "Authorization: Basic ${file:"+CredentialPath+"}") + assert.NotContains(t, out, "instanceID") + + // Every resource attribute is present, including the shared ADR-0030 set. + for _, want := range []string{ + "service.name", ServiceName, + "host.id", "wardnet-prd-use1-vm-bridge-01", + "host.type", "cx23", + "cloud.provider", "hetzner", + "cloud.region", "us-east", + "cloud.availability_zone", "ash", + "deployment.environment.name", "prd", + "region", "use1", + } { + assert.Contains(t, out, want) + } + + // The hostmetrics receiver is wired with the host scrapers and no `process`. + assert.Contains(t, out, "hostmetrics") + assert.NotContains(t, out, "process:") +} + +func TestRenderOmitsEmptyAttributes(t *testing.T) { + // Only the required host id is set; the provider supplied nothing else. + out, err := Render("https://otlp.example/v1", Attributes{HostID: "node-1"}) + require.NoError(t, err) + + assert.Contains(t, out, "node-1") + for _, absent := range []string{"cloud.provider", "cloud.region", "cloud.availability_zone", "host.type"} { + assert.NotContains(t, out, absent, "an empty attribute must be omitted") + } + // service.name is always stamped. + assert.Contains(t, out, ServiceName) +} + +func TestRenderIsDeterministic(t *testing.T) { + a, err := Render("https://otlp.example/v1", fullAttrs()) + require.NoError(t, err) + b, err := Render("https://otlp.example/v1", fullAttrs()) + require.NoError(t, err) + assert.Equal(t, a, b, "render must be byte-stable for a stable resource graph") +} + +func TestRenderRequiresEndpointAndHostID(t *testing.T) { + _, err := Render("", fullAttrs()) + require.Error(t, err) + assert.Contains(t, err.Error(), "endpoint") + + _, err = Render("https://otlp.example/v1", Attributes{}) + require.Error(t, err) + assert.Contains(t, err.Error(), "host id") +} + +func TestInstallScriptPinsVersionAndVerifies(t *testing.T) { + s := InstallScript("0.155.0") + assert.Contains(t, s, "ver='0.155.0'") + assert.Contains(t, s, "otelcol-contrib_${ver}_linux_${arch}.deb") + assert.Contains(t, s, "releases/download/v0.155.0") + assert.Contains(t, s, ChecksumsAsset) + // Verifies the checksum and installs the local .deb keeping our config. + assert.Contains(t, s, "checksum mismatch") + assert.Contains(t, s, "apt-get install -y") + assert.Contains(t, s, "--force-confold") + // Idempotent skip when already at this version. + assert.Contains(t, s, "already installed") +} + +func TestCredentialScriptIsOwnedAndPrivate(t *testing.T) { + s := CredentialScript("YWJjOnRrbg==") + assert.Contains(t, s, CredentialPath) + assert.Contains(t, s, "-m '0600'") + assert.Contains(t, s, "chown '"+User+":"+User+"'") + // The value travels base64'd, decoded on the host — never as plaintext argv. + assert.Contains(t, s, "base64 -d") +} + +func TestApplyScriptEnablesAndRestarts(t *testing.T) { + cfg, err := Render("https://otlp.example/v1", fullAttrs()) + require.NoError(t, err) + s := ApplyScript(cfg) + assert.Contains(t, s, ConfigPath) + assert.Contains(t, s, "systemctl enable '"+Service+"'") + assert.Contains(t, s, "systemctl restart '"+Service+"'") + // The config body travels base64-encoded (decoded on the host), so the rendered + // bytes appear as their base64, not as plaintext. + assert.True(t, strings.Contains(s, base64Encode(cfg))) +} diff --git a/internal/otelcol/install.go b/internal/otelcol/install.go new file mode 100644 index 0000000..b686aaf --- /dev/null +++ b/internal/otelcol/install.go @@ -0,0 +1,90 @@ +package otelcol + +import ( + "fmt" + "strings" +) + +// InstallScript renders the idempotent shell that installs the off-the-shelf +// otelcol-contrib .deb at the given version onto a host (ADR-0031). It mirrors the +// inforge-bootstrap download step: detect the host arch, pin the version, download +// the .deb + release checksums, verify the sha256, then apt-install the local .deb +// (apt runs the package's postinstall, creating the otelcol-contrib user and the +// systemd unit). It is a no-op when the pinned version is already installed. +// +// The install does NOT write the collector config or enable the unit — that is the +// caller's apply step, which depends on this one. apt is told to keep our config on +// a package upgrade (--force-confold) so a later .deb bump never reverts the config +// inforge wrote. +// +// version carries no leading "v" (e.g. "0.155.0"); it is single-quoted into a shell +// var so it is injection-safe while composing with the host-side ${arch} expansion. +func InstallScript(version string) string { + base := DownloadBaseURL(version) + return strings.Join([]string{ + "set -e", + "ver=" + shQuote(version), + // Skip entirely when this exact version is already installed. + `if [ "$(dpkg-query -W -f='${Version}' ` + Package + ` 2>/dev/null || true)" = "$ver" ]; then`, + fmt.Sprintf(` echo "%s $ver already installed"; exit 0`, Package), + "fi", + "arch=$(uname -m)", + "case \"$arch\" in", + " x86_64) arch=amd64 ;;", + " aarch64) arch=arm64 ;;", + " *) echo \"unsupported host arch: $arch\" >&2; exit 1 ;;", + "esac", + fmt.Sprintf(`asset="%s_${ver}_linux_${arch}.deb"`, Package), + fmt.Sprintf("base=%s", shQuote(base)), + fmt.Sprintf("sums_name=%s", shQuote(ChecksumsAsset)), + "tmp=$(mktemp)", + "sums=$(mktemp)", + `trap 'rm -f "$tmp" "$sums"' EXIT`, + `curl -fsSL "${base}/${asset}" -o "$tmp"`, + `curl -fsSL "${base}/${sums_name}" -o "$sums"`, + // Pull the expected sha256 for exactly this asset; an absent line is fatal. + `want=$(awk -v f="$asset" '$2==f {print $1}' "$sums")`, + `[ -n "$want" ] || { echo "no checksum for $asset in release" >&2; exit 1; }`, + `got=$(sha256sum "$tmp" | awk '{print $1}')`, + `[ "$want" = "$got" ] || { echo "checksum mismatch for $asset" >&2; exit 1; }`, + // Install the local .deb. --force-confold keeps the config file inforge owns + // across a package upgrade; noninteractive avoids any conffile prompt. + `sudo DEBIAN_FRONTEND=noninteractive apt-get install -y ` + + `-o Dpkg::Options::=--force-confdef -o Dpkg::Options::=--force-confold "$tmp"`, + }, "\n") +} + +// ApplyScript writes the rendered config (0644, root) and reloads the collector. The +// credential file is written separately by the caller (it carries a secret and is +// chown'd to the collector user); this enables the unit and restarts it so a changed +// config or credential takes effect. Restart (not reload) is used because a changed +// ${file:…} credential is only re-read on start. +func ApplyScript(config string) string { + return strings.Join([]string{ + "set -e", + WriteFileScript(ConfigPath, config, "0644", "root:root"), + fmt.Sprintf("sudo systemctl enable %s", shQuote(Service)), + fmt.Sprintf("sudo systemctl restart %s", shQuote(Service)), + }, "\n") +} + +// CredentialScript writes the base64 OTLP Basic-auth value to CredentialPath as a +// 0600 file owned by the collector user, so only the collector can read it. The +// value is supplied already base64-encoded by the caller. +func CredentialScript(authB64 string) string { + return WriteFileScript(CredentialPath, authB64, "0600", User+":"+User) +} + +// WriteFileScript renders sudo shell that writes content to absPath with the given +// mode and owner, creating the parent dir. content is base64-encoded for transport +// so arbitrary bytes (incl. a secret) survive intact and never appear in argv as +// plaintext. The decode is piped straight into a root-owned install. +func WriteFileScript(absPath, content, mode, owner string) string { + enc := base64Encode(content) + dir := parentDir(absPath) + return strings.Join([]string{ + fmt.Sprintf("sudo install -d -m 0755 %s", shQuote(dir)), + fmt.Sprintf(`printf %%s %s | base64 -d | sudo install -m %s /dev/stdin %s`, shQuote(enc), shQuote(mode), shQuote(absPath)), + fmt.Sprintf("sudo chown %s %s", shQuote(owner), shQuote(absPath)), + }, "\n") +} diff --git a/internal/otelcol/paths.go b/internal/otelcol/paths.go new file mode 100644 index 0000000..3e38b44 --- /dev/null +++ b/internal/otelcol/paths.go @@ -0,0 +1,57 @@ +// Package otelcol renders the host VM-metrics collector's config and the +// idempotent shell that installs the off-the-shelf OpenTelemetry Collector Contrib +// .deb on a VM (ADR-0031). It is deploy-side only (never imported by +// inforge-bootstrap) and holds no provider/Pulumi dependencies — the program wires +// it to hosts, exactly as internal/nginx is wired to ingress hosts. +package otelcol + +import "fmt" + +// On-host names the install + provisioning paths must agree on. They are fixed by +// the official otelcol-contrib .deb (verified against the packaging at v0.155.0): +// the unit reads EnvironmentFile /etc/otelcol-contrib/otelcol-contrib.conf whose +// default OTELCOL_OPTIONS points ExecStart at ConfigPath, and runs as the +// unprivileged User created by the package postinstall. +const ( + // Package is the dpkg package name the .deb installs. + Package = "otelcol-contrib" + // Service is the systemd unit the .deb ships and enables. + Service = "otelcol-contrib.service" + // User/Group is the unprivileged account the unit runs as. The credential file + // is chown'd to it so only the collector can read it. + User = "otelcol-contrib" + // ConfigPath is the config the unit loads by default; we overwrite it. + ConfigPath = "/etc/otelcol-contrib/config.yaml" + // CredentialPath holds the base64 OTLP Basic-auth value (0600, owned by User), + // referenced from the rendered config via the collector's ${file:…} provider so + // the secret never appears in the config itself. + CredentialPath = "/etc/otelcol-contrib/otlp-auth" + // DefaultVersion is the otelcol-contrib release installed when an env does not + // pin one. Bump deliberately (a new .deb is downloaded + apt-installed on next + // deploy). + DefaultVersion = "0.155.0" +) + +// AuthSecretContainer / AuthSecretKey locate the OTLP Basic-auth credential in the +// env's secrets.enc.yaml. The stored value is the raw "instanceID:token" Grafana +// Cloud hands out for OTLP Basic auth; inforge base64-encodes it at deploy and +// writes the result to CredentialPath. +const ( + AuthSecretContainer = "observability" + AuthSecretKey = "otlp_auth" +) + +// DebAsset is the release asset filename for a version+arch, e.g. +// "otelcol-contrib_0.155.0_linux_amd64.deb". version carries no leading "v". +func DebAsset(version, arch string) string { + return fmt.Sprintf("%s_%s_linux_%s.deb", Package, version, arch) +} + +// DownloadBaseURL is the GitHub release download base for a version (tag has the +// leading "v"; the asset name does not). +func DownloadBaseURL(version string) string { + return fmt.Sprintf("https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v%s", version) +} + +// ChecksumsAsset is the release checksums file the install step verifies against. +const ChecksumsAsset = "opentelemetry-collector-releases_otelcol-contrib_checksums.txt" diff --git a/internal/otelcol/shell.go b/internal/otelcol/shell.go new file mode 100644 index 0000000..5e83a91 --- /dev/null +++ b/internal/otelcol/shell.go @@ -0,0 +1,26 @@ +package otelcol + +import ( + "encoding/base64" + "path" + "strings" +) + +// shQuote single-quotes s for safe interpolation into a POSIX shell command, +// escaping any embedded single quote as the standard '\'' sequence. The package +// renders host shell but stays Pulumi-free (it is pure, like internal/nginx), so it +// carries its own minimal quoting rather than importing the remote helpers. +func shQuote(s string) string { + return "'" + strings.ReplaceAll(s, "'", `'\''`) + "'" +} + +// base64Encode returns the standard base64 of s, for transporting arbitrary bytes +// (including a secret) through a shell command without exposing them in argv. +func base64Encode(s string) string { + return base64.StdEncoding.EncodeToString([]byte(s)) +} + +// parentDir is the directory containing absPath. +func parentDir(absPath string) string { + return path.Dir(absPath) +} diff --git a/internal/types/types.go b/internal/types/types.go index eebd24b..e1971f3 100644 --- a/internal/types/types.go +++ b/internal/types/types.go @@ -331,9 +331,20 @@ type NetworkOutputs struct { // PrivateIP is the host's address on its attached private network — empty in // preview, and used by the ingress tier to proxy_pass to a backend that lives on // a different host within the same Hetzner Network (cross-host routing). +// The four metadata fields below are provider-supplied OTel resource-identity facts, +// known at plan time (plain strings, not Pulumi outputs). They are the host/cloud +// ground truth a running process cannot determine for itself; renderDescriptor reads +// them off the host's outputs and injects them as INFORGE_* env vars (ADR-0030), and +// the host metrics collector (ADR-0031) stamps the same values. A provider that does +// not supply one leaves it empty, and the empty value is omitted downstream. type ComputeOutputs struct { PublicIP pulumi.StringOutput PrivateIP pulumi.StringOutput + + CloudProvider string // cloud.provider, e.g. "hetzner" + CloudRegion string // cloud.region — the provider's region, e.g. Hetzner network_zone "us-east" + AvailabilityZone string // cloud.availability_zone — the datacenter, e.g. Hetzner location "ash" + MachineType string // host.type — the server-type SKU, e.g. "cx23" } // FirewallPorts is the derived inbound port plan for one host, computed by the @@ -537,8 +548,21 @@ type SSHConfig struct { // realizations) now live in regions.yaml (see internal/regions); variables.yaml // carries only the base domain and SSH material. type EnvironmentVariables struct { - BaseDomain string `yaml:"base_domain"` - SSH SSHConfig `yaml:"ssh"` + BaseDomain string `yaml:"base_domain"` + SSH SSHConfig `yaml:"ssh"` + Observability ObservabilityConfig `yaml:"observability"` +} + +// ObservabilityConfig is the optional env-level observability block (ADR-0031). +// When OTLPEndpoint is set (and the matching auth secret exists in the env's +// secrets.enc.yaml), inforge installs the host VM-metrics collector on every VM. +// When it is empty, no collector is installed — the agent is always-on but gated +// on this config being present, so an env that has not set up Grafana Cloud gets +// nothing. OTLPEndpoint is the non-secret OTLP/HTTP base URL; the Basic-auth +// credential is NOT here — it lives in secrets.enc.yaml under the reserved +// observability container (see otelcol.AuthSecretRef). +type ObservabilityConfig struct { + OTLPEndpoint string `yaml:"otlp_endpoint"` } // Resources is the full set of resource specs for one region. diff --git a/program/mtls_descriptor_test.go b/program/mtls_descriptor_test.go index c8a87f6..78cc11e 100644 --- a/program/mtls_descriptor_test.go +++ b/program/mtls_descriptor_test.go @@ -15,7 +15,7 @@ func TestRenderDescriptorMeshFiles(t *testing.T) { svc := types.ServiceSpec{Name: "bridge", Container: "bridge", Host: "bridge", Type: "raw", User: "bridge", Pki: "wardnet-mesh"} // A mesh service with a provider (a bundle) advertises its leaf/key/bundle. bundle := &types.ServiceSecretsBundle{ProviderKind: "infisical", URL: "https://x", Environment: "prod", SecretPath: "/bridge"} - out, err := renderDescriptor(svc, bundle, "ws-1", "prd", "us-east-1", "use1", "wardnet.network", "bridge-01") + out, err := renderDescriptor(svc, types.ComputeOutputs{}, bundle, "ws-1", "prd", "us-east-1", "use1", "wardnet.network", "bridge-01") require.NoError(t, err) d, err := bootstrapper.ParseDescriptor([]byte(out)) @@ -28,7 +28,7 @@ func TestRenderDescriptorMeshNoProviderNoFiles(t *testing.T) { // A mesh service with no provider yet (secret-less; provider/identity lands // in #109) emits no files: — never an unsatisfiable descriptor. svc := types.ServiceSpec{Name: "bridge", Container: "bridge", Host: "bridge", Type: "raw", User: "bridge", Pki: "wardnet-mesh"} - out, err := renderDescriptor(svc, nil, "", "prd", "us-east-1", "use1", "wardnet.network", svc.Name+"-01") + out, err := renderDescriptor(svc, types.ComputeOutputs{}, nil, "", "prd", "us-east-1", "use1", "wardnet.network", svc.Name+"-01") require.NoError(t, err) d, err := bootstrapper.ParseDescriptor([]byte(out)) @@ -38,7 +38,7 @@ func TestRenderDescriptorMeshNoProviderNoFiles(t *testing.T) { func TestRenderDescriptorNoMeshNoFiles(t *testing.T) { svc := types.ServiceSpec{Name: "plain", Container: "plain", Host: "plain", Type: "raw", User: "plain"} - out, err := renderDescriptor(svc, nil, "", "prd", "us-east-1", "use1", "wardnet.network", svc.Name+"-01") + out, err := renderDescriptor(svc, types.ComputeOutputs{}, nil, "", "prd", "us-east-1", "use1", "wardnet.network", svc.Name+"-01") require.NoError(t, err) d, err := bootstrapper.ParseDescriptor([]byte(out)) diff --git a/program/program.go b/program/program.go index e62d6d2..a54da39 100644 --- a/program/program.go +++ b/program/program.go @@ -4,6 +4,7 @@ package program import ( + "encoding/base64" "encoding/json" "fmt" "os" @@ -21,6 +22,7 @@ import ( "github.com/wardnet/inforge/internal/manifest" "github.com/wardnet/inforge/internal/meshcert" "github.com/wardnet/inforge/internal/naming" + "github.com/wardnet/inforge/internal/otelcol" "github.com/wardnet/inforge/internal/regions" "github.com/wardnet/inforge/internal/registry" iremote "github.com/wardnet/inforge/internal/remote" @@ -220,6 +222,23 @@ func Run(ctx *pulumi.Context) error { } } + // Env-level observability (ADR-0031): when otlp_endpoint is configured, every VM + // gets the host-metrics collector. The OTLP Basic-auth credential lives in the + // env secret store as the raw "instanceID:token"; base64 it once into the header + // value and mark it secret so it is encrypted in Pulumi state. An endpoint set + // with no credential is a hard misconfiguration (fail at up, skipped in preview). + var obsAuthB64 pulumi.StringOutput + if vars.Observability.OTLPEndpoint != "" { + authRaw := "" + if c, ok := encSecrets[otelcol.AuthSecretContainer]; ok { + authRaw = c[otelcol.AuthSecretKey] + } + if authRaw == "" && !ctx.DryRun() { + return fmt.Errorf("observability: otlp_endpoint is set but secrets.enc.yaml has no %s/%s credential", otelcol.AuthSecretContainer, otelcol.AuthSecretKey) + } + obsAuthB64 = pulumi.ToSecret(pulumi.String(base64.StdEncoding.EncodeToString([]byte(authRaw)))).(pulumi.StringOutput) + } + // Post-process each scope through the host-level pipeline. The order within a // scope is load-bearing: DNS records first (ACME HTTP-01 needs the A-record to // exist), then app seeds + ingress (nginx/ACME), then service secrets, then the @@ -252,6 +271,9 @@ func Run(ctx *pulumi.Context) error { if err := provisionServices(ctx, sc.res, computeOutputs[sc.key], bundles, gates, vars.SSH.DeployPrivateKey, env, sc.key, sc.slug, vars.BaseDomain, inforgeVersion); err != nil { return err } + if err := provisionObservability(ctx, sc.res, computeOutputs[sc.key], gates, vars.Observability, obsAuthB64, vars.SSH.DeployPrivateKey, env, sc.slug); err != nil { + return err + } } return nil @@ -460,6 +482,82 @@ func provisionService(ctx *pulumi.Context, svc types.ServiceSpec, host types.Com return nil } +// provisionObservability installs the host VM-metrics collector on every VM in this +// region (ADR-0031), gated on the env defining observability config: it is a no-op +// when obs.OTLPEndpoint is empty. The agent is always-on otherwise — every host +// gets it, no per-compute opt-in — and stamps the same cloud/host resource identity +// (ADR-0030) inforge injects into app telemetry, so host metrics correlate with app +// telemetry on host.id. It is scoped to regional hosts, matching provisionServices +// (global-placement hosts are not service-provisioned in this loop either). +// +// authB64 is the base64 OTLP Basic-auth value, already marked secret by the caller; +// the credential write is built inside an ApplyT over it so the secret is encrypted +// in Pulumi state (never written as plaintext), mirroring deliverServiceSecrets. +func provisionObservability(ctx *pulumi.Context, res types.Resources, computeOut map[string]types.ComputeOutputs, gates map[string]pulumi.Resource, obs types.ObservabilityConfig, authB64 pulumi.StringOutput, deployPrivateKey, env, slug string) error { + if obs.OTLPEndpoint == "" { + return nil + } + deployUserByCompute := naming.DeployUsersByHost(res.Compute) + for _, hostKey := range sortedKeys(computeOut) { + host := computeOut[hostKey] + deployUser := deployUserByCompute[hostKey] + if !ctx.DryRun() { + if deployUser == "" { + return fmt.Errorf("observability: host %q has no deploy_user; inforge needs one to SSH and install the collector", hostKey) + } + if deployPrivateKey == "" { + return fmt.Errorf("observability: no deploy private key configured (set the deploy_private_key stack config or INFORGE_DEPLOY_PRIVATE_KEY)") + } + } + gate, err := cloudInitGate(ctx, gates, hostKey, host, deployPrivateKey, env, slug) + if err != nil { + return err + } + conn := iremote.Connection(host.PublicIP, deployUser, deployPrivateKey) + name := naming.Resource(env, slug, "otelcol", hostKey) + + install := otelcol.InstallScript(otelcol.DefaultVersion) + installCmd, err := remote.NewCommand(ctx, name+"-install", &remote.CommandArgs{ + Connection: conn, + Create: pulumi.String(install), + Update: pulumi.String(install), + Triggers: pulumi.Array{pulumi.String(install)}, + }, pulumi.DependsOn([]pulumi.Resource{gate})) + if err != nil { + return fmt.Errorf("observability: host %q: install collector: %w", hostKey, err) + } + + config, err := otelcol.Render(obs.OTLPEndpoint, otelcol.Attributes{ + HostID: naming.Resource(env, slug, "vm", hostKey), + CloudProvider: host.CloudProvider, + CloudRegion: host.CloudRegion, + AvailabilityZone: host.AvailabilityZone, + MachineType: host.MachineType, + Environment: env, + RegionSlug: slug, + }) + if err != nil { + return fmt.Errorf("observability: host %q: render config: %w", hostKey, err) + } + // The credential is secret; build the write+apply script inside an ApplyT over + // the secret so the whole command's Create is encrypted in state. Order: write + // the 0600 credential, then the config + enable + restart (a changed ${file:} + // credential is only re-read on start). + applyScript := authB64.ApplyT(func(b64 string) string { + return otelcol.CredentialScript(b64) + "\n" + otelcol.ApplyScript(config) + }).(pulumi.StringOutput) + if _, err := remote.NewCommand(ctx, name+"-config", &remote.CommandArgs{ + Connection: conn, + Create: applyScript, + Update: applyScript, + Triggers: pulumi.Array{applyScript}, + }, pulumi.DependsOn([]pulumi.Resource{installCmd})); err != nil { + return fmt.Errorf("observability: host %q: configure collector: %w", hostKey, err) + } + } + return nil +} + // provisionApps seeds each app's on-host folder with a placeholder bundle on its // ingress host, so the app's nginx server block and Let's Encrypt certificate // provision before the first real release (slice D). It runs once per app over @@ -680,7 +778,7 @@ func deliverServiceSecrets(ctx *pulumi.Context, svc types.ServiceSpec, host type // descriptor.yaml depends on the workspace ID (provider.project), so render it // inside an ApplyT on that output. descriptor := bundle.Project.ApplyT(func(project string) (string, error) { - return renderDescriptor(svc, bundle, project, env, region, slug, baseDomain, computeKey) + return renderDescriptor(svc, host, bundle, project, env, region, slug, baseDomain, computeKey) }).(pulumi.StringOutput) // Encrypt {client_id, client_secret} to the host key inside an ApplyT over the @@ -742,7 +840,7 @@ func deliverServiceDescriptor(ctx *pulumi.Context, svc types.ServiceSpec, host t conn := iremote.Connection(host.PublicIP, deployUser, deployPrivateKey) name := naming.Resource(env, slug, "svc", svc.Name) - descriptor, err := renderDescriptor(svc, nil, "", env, region, slug, baseDomain, computeKey) + descriptor, err := renderDescriptor(svc, host, nil, "", env, region, slug, baseDomain, computeKey) if err != nil { return err } @@ -770,7 +868,7 @@ func deliverServiceDescriptor(ctx *pulumi.Context, svc types.ServiceSpec, host t // fqdn/host) is derived from the deployment context and is present for every // service, secret-bearing or not. hostKey is the service's resolved compute key // ("-", e.g. "bridge-01"); the host id is its full VM resource name. -func renderDescriptor(svc types.ServiceSpec, bundle *types.ServiceSecretsBundle, project, env, region, slug, baseDomain, hostKey string) (string, error) { +func renderDescriptor(svc types.ServiceSpec, host types.ComputeOutputs, bundle *types.ServiceSecretsBundle, project, env, region, slug, baseDomain, hostKey string) (string, error) { // The global scope is region-less: globalScope is an internal output-map key, not // an abstract region, so it must not leak into the on-host descriptor. Surface an // empty INFORGE_DEPLOYMENT_REGION (matching the already-empty RegionSlug) rather @@ -793,6 +891,12 @@ func renderDescriptor(svc types.ServiceSpec, bundle *types.ServiceSecretsBundle, // "-" hostKey as the name segment yields the same string as // naming.ResourceInstance, so the host id matches the cloud server name. HostID: naming.Resource(env, slug, "vm", hostKey), + // Provider-supplied cloud/host resource identity, off the host's own + // outputs (ADR-0030) — empty for a provider that does not supply them. + CloudProvider: host.CloudProvider, + CloudRegion: host.CloudRegion, + AvailabilityZone: host.AvailabilityZone, + MachineType: host.MachineType, }, } if bundle != nil { diff --git a/program/program_test.go b/program/program_test.go index c3b5cdc..e185038 100644 --- a/program/program_test.go +++ b/program/program_test.go @@ -545,7 +545,14 @@ func TestRenderDescriptorRoundTrips(t *testing.T) { Env: map[string]string{"DATABASE_URL": "infra/DATABASE_URL"}, } - out, err := renderDescriptor(svc, bundle, "ws-123", "prd", "us-east-1", "use1", "wardnet.network", "ghost-01") + // Provider-supplied cloud/host identity flows off the host's outputs (ADR-0030). + host := types.ComputeOutputs{ + CloudProvider: "hetzner", + CloudRegion: "us-east", + AvailabilityZone: "ash", + MachineType: "cx23", + } + out, err := renderDescriptor(svc, host, bundle, "ws-123", "prd", "us-east-1", "use1", "wardnet.network", "ghost-01") require.NoError(t, err) d, err := bootstrapper.ParseDescriptor([]byte(out)) @@ -567,6 +574,11 @@ func TestRenderDescriptorRoundTrips(t *testing.T) { assert.Equal(t, "ghost.svc.prd.use1.wardnet.network", d.Deployment.FQDN) // host.id is the full VM resource name built from the resolved compute key. assert.Equal(t, "wardnet-prd-use1-vm-ghost-01", d.Deployment.HostID) + // Provider-supplied cloud/host resource identity round-trips (ADR-0030). + assert.Equal(t, "hetzner", d.Deployment.CloudProvider) + assert.Equal(t, "us-east", d.Deployment.CloudRegion) + assert.Equal(t, "ash", d.Deployment.AvailabilityZone) + assert.Equal(t, "cx23", d.Deployment.MachineType) } // TestRenderDescriptorGlobalScopeIsRegionLess: a global service is region-less, so @@ -578,7 +590,7 @@ func TestRenderDescriptorGlobalScopeIsRegionLess(t *testing.T) { // Driven exactly as the scopes loop drives the global slice: region=globalScope, // slug="". - out, err := renderDescriptor(svc, nil, "", "prd", globalScope, "", "wardnet.network", "tenants-01") + out, err := renderDescriptor(svc, types.ComputeOutputs{}, nil, "", "prd", globalScope, "", "wardnet.network", "tenants-01") require.NoError(t, err) d, err := bootstrapper.ParseDescriptor([]byte(out)) @@ -595,7 +607,7 @@ func TestRenderDescriptorGlobalScopeIsRegionLess(t *testing.T) { func TestRenderDescriptorSecretLess(t *testing.T) { svc := types.ServiceSpec{Name: "ghost", Container: "ghost", User: "ghost"} - out, err := renderDescriptor(svc, nil, "", "prd", "us-east-1", "use1", "wardnet.network", "ghost-01") + out, err := renderDescriptor(svc, types.ComputeOutputs{}, nil, "", "prd", "us-east-1", "use1", "wardnet.network", "ghost-01") require.NoError(t, err) d, err := bootstrapper.ParseDescriptor([]byte(out)) diff --git a/providers/hetzner/compute.go b/providers/hetzner/compute.go index b102fa8..76c55d1 100644 --- a/providers/hetzner/compute.go +++ b/providers/hetzner/compute.go @@ -161,7 +161,17 @@ func (h *HetznerCompute) Create( // omit an explicit Ip; .Elem() unwraps the *string output to "" in preview. privateIP := server.Networks.Index(pulumi.Int(0)).Ip().Elem() - return types.ComputeOutputs{PublicIP: server.Ipv4Address, PrivateIP: privateIP}, nil + return types.ComputeOutputs{ + PublicIP: server.Ipv4Address, + PrivateIP: privateIP, + // Provider-supplied OTel resource identity (ADR-0030): all plan-time constants + // resolved above, so they need no Pulumi apply. network_zone ⊃ location maps + // onto cloud.region ⊃ cloud.availability_zone. + CloudProvider: "hetzner", + CloudRegion: regionCfg.NetworkZone, + AvailabilityZone: regionCfg.Location, + MachineType: serverType, + }, nil } // ensureFirewall returns the hcloud.Firewall for the spec name, creating it if it