Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,37 @@ source. The grain is create-and-destroy (Hetzner bills a server until it is dele
- **Network segregation** is a structural invariant — never peer Networks or share one across envs;
see `.agents/rules/ephemeral-network-segregation.md`.

## Observability (ADR-0030, ADR-0031)

Two coupled pieces give Grafana Cloud cloud/host context. Both stamp the **same**
resource-attribute set so VM metrics and app telemetry correlate on `host.id`.

- **Resource-attribute enrichment (ADR-0030).** Beyond the #134 set, inforge injects four
more OTel attributes where it is the sole authority: `cloud.provider`
(`INFORGE_CLOUD_PROVIDER`), `cloud.region` (`INFORGE_CLOUD_REGION` = Hetzner `network_zone`),
`cloud.availability_zone` (`INFORGE_CLOUD_AVAILABILITY_ZONE` = Hetzner `location`), and
`host.type` (`INFORGE_HOST_TYPE` = server-type SKU). They are **provider-supplied** plain-string
fields on `types.ComputeOutputs` (`CloudProvider/CloudRegion/AvailabilityZone/MachineType`),
populated by `hetzner.Create()` (plan-time constants, no apply), read off the host in
`renderDescriptor`, carried in `bootstrapper.Deployment`, and emitted by `buildEnv`
**omit-if-empty** (a provider that doesn't supply one emits nothing). `host.name`/`os.type`
were deliberately dropped (self-detectable by the process). This bumped the descriptor to
**v5** (the strict `KnownFields` decoder makes any field addition a major bump). The consumer
side is a four-row addition to the `(attribute, env_var)` table in wardnet-cloud
`crates/common/src/telemetry.rs::resource()`.
- **Host VM-metrics collector (ADR-0031).** `internal/otelcol` (pure, Pulumi-free, like
`internal/nginx`) renders an off-the-shelf **OTel Collector Contrib** config (`hostmetrics` →
`otlphttp`) and the idempotent install shell (download the version-pinned `.deb`, verify the
release checksum, `apt-get install` the local file keeping our config on upgrade). The
`process` scraper is **off** so the agent runs **unprivileged** as the `.deb`'s `otelcol-contrib`
user. `program.provisionObservability` is an **always-on** per-host pass **gated on env-level
config**: `variables.yaml` `observability.otlp_endpoint` (non-secret) + the OTLP Basic-auth
credential in `secrets.enc.yaml` under the reserved `observability/otlp_auth`
(`otelcol.AuthSecret*`). With no endpoint it is a no-op; with an endpoint but no credential it
fails the deploy. The credential is base64'd, `pulumi.ToSecret`-wrapped (encrypted in state),
written `0600` owned by the collector user, and referenced from the config via the collector's
`${file:…}` provider (never inlined). The config stamps the ADR-0030 attribute set + `host.id`.

## Conventions

- **Provider binary names are load-bearing.** Pulumi locates plugins by the exact filename
Expand Down
67 changes: 67 additions & 0 deletions docs/adr/0030-otel-resource-attribute-enrichment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# OTel resource-attribute enrichment (cloud.*/host.*)

Extends the #134 observability env-var contract so every service's telemetry
(metrics, traces, logs shipped to Grafana Cloud) is tagged with the cloud/host
resource attributes inforge alone knows at deploy time. Today the only host-level
attribute is `host.id` (`INFORGE_HOST_ID`); the rest of the deploy's ground truth
(provider, region, datacenter, machine size) is resolved inside the Hetzner
provider and discarded.

We inject **four** new resource attributes — and only four: the ones where inforge
is the *sole* authority, i.e. facts the running process provably cannot determine
itself.

| OTel attribute | Value | Source |
|---|---|---|
| `cloud.provider` | `hetzner` | provider self-names |
| `cloud.region` | Hetzner `network_zone` (e.g. `us-east`) | provider |
| `cloud.availability_zone` | Hetzner `location` (e.g. `ash`) | provider |
| `host.type` | server-type SKU (e.g. `cx23`) | provider |

## Considered options

- **`host.name` and `os.type` — rejected.** `host.name` (the OS hostname) is
readable by the process via `gethostname()` and would duplicate the already-injected
`host.id` (the unique cloud resource name). `os.type` is always `linux` and is
trivially self-detectable (the OTel SDK ships an OS resource detector). Neither is
inforge's unique knowledge, so injecting them adds contract surface for no value.
- **`cloud.region` = Hetzner `network_zone`, not the inforge abstract region.** The
abstract region (`us-east-1`) is already in `INFORGE_DEPLOYMENT_REGION` and mapping
*that* would be zero new work, but it is inforge's portable abstraction, not the
provider's region — and it would pair a `cloud.region` from one naming system with a
`cloud.availability_zone` from another. OTel semconv defines `cloud.region` as the
*provider's* region; the provider-native hierarchy `network_zone ⊃ location` maps
exactly onto `cloud.region ⊃ cloud.availability_zone`. We pay one extra provider-sourced
string to keep both `cloud.*` geo attributes consistent and semantically correct.

## How

Provider-supplied facts reach the descriptor through `types.ComputeOutputs` — the
existing provider→program boundary object, already keyed per-host in `computeOut`.
It gains four plain-`string` fields (`CloudProvider`, `CloudRegion`,
`AvailabilityZone`, `MachineType`), populated by `hetzner.Create()` (plan-time
constants, no Pulumi apply). `renderDescriptor` reads them off the host's
`ComputeOutputs` and writes them into `bootstrapper.Deployment`; `buildEnv` emits
`INFORGE_CLOUD_PROVIDER`, `INFORGE_CLOUD_REGION`, `INFORGE_CLOUD_AVAILABILITY_ZONE`,
and `INFORGE_HOST_TYPE` (omitting any that are empty, so a future non-Hetzner provider
that doesn't supply them emits nothing). This keeps `renderDescriptor` provider-agnostic
and makes `cloud.provider` self-named rather than a hardcoded constant; it also makes
**global services correct for free**, since a global host carries its real placement in
its own `ComputeOutputs` rather than a recomputed abstraction.

The four new env-var names are already protected by the reserved `INFORGE_*` prefix,
so no validation change is needed.

## Consequences

- **Descriptor version bumps 4 → 5.** The bootstrapper decodes descriptors strictly
(`KnownFields(true)`), so any field addition is a breaking change for an older
bootstrapper — a major bump is forced, not a judgment call. It is safe because the
pinned `inforge-bootstrap` binary and the descriptor are written by the same deploy
(identical lockstep to the #134 v3→v4 bump).
- **Cross-repo, decoupled by the env-var contract.** The consumer side is a four-row
addition to the `(attribute, env_var)` table in wardnet-cloud
`crates/common/src/telemetry.rs::resource()`; each row is best-effort (a missing/empty
var is omitted), so inforge can ship and deploy first and the attributes start
populating once wardnet-cloud picks them up. The collector of ADR-0031 reuses this
exact attribute set so host metrics correlate with app telemetry on `host.id`.
66 changes: 66 additions & 0 deletions docs/adr/0031-host-vm-metrics-collector.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Host VM-metrics collector

Application telemetry (ADR-0030, #134) tells us nothing about the VM itself — CPU,
memory, disk, network. A running service cannot report its host's resource usage, and
inforge owns the host, so collecting VM metrics is inforge's responsibility. We install
an off-the-shelf **OpenTelemetry Collector** on every VM, scraping host metrics and
exporting them OTLP/HTTP to the same Grafana Cloud endpoint the services use, tagged
with the **same** resource-attribute set as ADR-0030 so host metrics and app telemetry
correlate on `host.id`.

## Decisions

- **OTel Collector with the `hostmetrics` receiver → `otlphttp` exporter**, not Grafana
Alloy or Prometheus `node_exporter`. It reuses the exact protocol, endpoint, Basic
auth, and OTel resource-attribute model the services already use; Alloy adds a second
config dialect and integrations we don't need, and `node_exporter` is the wrong
protocol/model (Prom remote_write, label-based, no OTel resource attributes).
- **Off-the-shelf official `.deb` via apt, not a binary we build.** The OpenTelemetry
Collector Contrib `.deb` ships the binary, a system user, and a systemd unit, and
handles upgrades. A custom minimal collector via `ocb` would be more inforge-idiomatic
(tiny static binary, checksum-verified download) but is build/maintenance overhead for
a solved problem. inforge owns only the rendered config and the credential. Trade-off:
`apt` from a third-party repo is less pinned/reproducible than inforge's existing
checksum-verified raw-download invariant (`bootstrapDownloadStep`); we accept that for
not maintaining a collector build.
- **Always-on, gated on env-level observability config.** If the env defines the OTLP
endpoint (`variables.yaml`) and auth token (`secrets.enc.yaml`), every VM inforge
provisions gets the collector; otherwise none is installed (it would have nowhere to
export). No per-compute opt-in flag — host metrics are wanted uniformly, and this
matches "we own the host."
- **Unprivileged agent; `process` scraper off.** The host signals we want (cpu, memory,
load, disk, filesystem, network throughput) come from world-readable `/proc` and
`/sys` — no root needed. The agent runs as the `.deb`'s unprivileged service user. Only
the `process` scraper (per-process inventory across all users) needs root or
`CAP_SYS_PTRACE`/`CAP_DAC_READ_SEARCH`; it stays **off** by default and is a separate
opt-in if per-process metrics are ever wanted.
- **Credential delivered as an agent-user `0600` file, not at runtime.** The host agent
is not a service: it has no per-service Infisical identity, no descriptor, no `files:`
projection, and must start on boot independently. So the env-level OTLP token
(decrypted once per deploy from `secrets.enc.yaml`) is written to the host over SSH
(`command.remote`, the same transport that places descriptors/units/nginx config) into
a `0600` file owned by the agent's service user, referenced by the collector's
Basic-auth header.

## Considered options for the credential

- **`systemd-creds` encrypted-at-rest — rejected.** It keeps plaintext off disk by
decrypting into a tmpfs cred dir at start, but it binds to a TPM2 when present, else to
a host master key (`/var/lib/systemd/credential.secret`) that is itself on disk.
Hetzner Cloud VMs don't expose a vTPM, so "encrypted at rest" degrades to "encrypted
with an on-disk key" — no meaningful defense against the disk/root-exfil threat the
tmpfs posture targets. And it costs three integration points (a unit drop-in, an
on-host `systemd-creds encrypt` step at deploy, and a wrapper to bridge the credential
*file* into the collector's env/config-based `basicauth`). The security gain mostly
evaporates without a TPM, so we take the simpler `0600` file. (If TPM-backed hosts
ever exist, revisit.)

## Consequences

- A persisted secret on the host diverges from the services' tmpfs-only,
never-on-disk secret posture — accepted because the agent is host infrastructure that
must boot before any inforge interaction, and the file is `0600` to the agent user only.
- New pure package `internal/otelcol` (stdlib-only, like `internal/nginx`) renders the
collector config from the host's resource attributes + endpoint and owns the on-host
path scheme; a `provisionObservability` pass in `program.go` installs/configures the
agent per host, gated on env-level observability config.
17 changes: 13 additions & 4 deletions internal/bootstrapper/descriptor.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,11 @@ import (
// producer and consumer can never disagree on the schema version. Because parsing
// is strict (KnownFields), any field addition is a breaking change for an older
// reader, so it bumps this major: v2 added the Deployment block, v3 added the
// Files map, and v4 swapped Deployment.Namespace for Deployment.HostID, so an older
// bootstrapper meeting a newer descriptor fails cleanly on the version rather than
// on an unknown field.
const SupportedVersion = 4
// Files map, and v4 swapped Deployment.Namespace for Deployment.HostID; v5 added the
// cloud/host resource-identity fields (CloudProvider/CloudRegion/AvailabilityZone/
// MachineType, ADR-0030). An older bootstrapper meeting a newer descriptor fails
// cleanly on the version rather than on an unknown field.
const SupportedVersion = 5

// Descriptor is the versioned, secret-free on-host contract inforge writes to
// /etc/wardnet/services/<svc>/descriptor.yaml (0644 root). It names the service,
Expand Down Expand Up @@ -70,6 +71,14 @@ type Deployment struct {
// It is stable per host (does NOT change across restarts), so it is injected as
// INFORGE_HOST_ID — the OTel host.id resource attribute.
HostID string `yaml:"host_id"`
// The four fields below are provider-supplied cloud/host resource identity
// (ADR-0030), injected as INFORGE_CLOUD_*/INFORGE_HOST_TYPE → the OTel
// cloud.provider/cloud.region/cloud.availability_zone/host.type attributes.
// Each is omitempty: a provider that does not supply one writes nothing.
CloudProvider string `yaml:"cloud_provider,omitempty"` // cloud.provider, e.g. "hetzner"
CloudRegion string `yaml:"cloud_region,omitempty"` // cloud.region, e.g. "us-east"
AvailabilityZone string `yaml:"availability_zone,omitempty"` // cloud.availability_zone, e.g. "ash"
MachineType string `yaml:"machine_type,omitempty"` // host.type, e.g. "cx23"
}

// LoadDescriptor reads and parses the descriptor at path.
Expand Down
12 changes: 6 additions & 6 deletions internal/bootstrapper/descriptor_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import (
"github.com/stretchr/testify/require"
)

const validDescriptor = `version: 4
const validDescriptor = `version: 5
service: ghost
exec: /srv/wardnet/ghost/run
user: ghost
Expand Down Expand Up @@ -59,9 +59,9 @@ func TestParseDescriptorRejectsUnknownField(t *testing.T) {

func TestParseDescriptorRequiresFields(t *testing.T) {
cases := map[string]string{
"service": "version: 4\nexec: /x\nuser: ghost\nprovider:\n kind: infisical\n",
"exec": "version: 4\nservice: ghost\nuser: ghost\nprovider:\n kind: infisical\n",
"user": "version: 4\nservice: ghost\nexec: /x\nprovider:\n kind: infisical\n",
"service": "version: 5\nexec: /x\nuser: ghost\nprovider:\n kind: infisical\n",
"exec": "version: 5\nservice: ghost\nuser: ghost\nprovider:\n kind: infisical\n",
"user": "version: 5\nservice: ghost\nexec: /x\nprovider:\n kind: infisical\n",
}
for missing, doc := range cases {
_, err := ParseDescriptor([]byte(doc))
Expand All @@ -72,7 +72,7 @@ func TestParseDescriptorRequiresFields(t *testing.T) {
// TestParseDescriptorSecretLess: a descriptor with no provider is a secret-less
// service — valid as long as it carries no env mapping.
func TestParseDescriptorSecretLess(t *testing.T) {
doc := "version: 4\nservice: ghost\nexec: /x\nuser: ghost\n"
doc := "version: 5\nservice: ghost\nexec: /x\nuser: ghost\n"
d, err := ParseDescriptor([]byte(doc))
require.NoError(t, err)
assert.Equal(t, "", d.Provider.Kind)
Expand All @@ -82,7 +82,7 @@ func TestParseDescriptorSecretLess(t *testing.T) {
// TestParseDescriptorRejectsEnvWithoutProvider: env with no provider is a
// producer bug — there is nothing to resolve the keys against.
func TestParseDescriptorRejectsEnvWithoutProvider(t *testing.T) {
doc := "version: 4\nservice: ghost\nexec: /x\nuser: ghost\nenv:\n DATABASE_URL: infra/DATABASE_URL\n"
doc := "version: 5\nservice: ghost\nexec: /x\nuser: ghost\nenv:\n DATABASE_URL: infra/DATABASE_URL\n"
_, err := ParseDescriptor([]byte(doc))
require.Error(t, err)
assert.Contains(t, err.Error(), "provider.kind is empty")
Expand Down
17 changes: 16 additions & 1 deletion internal/bootstrapper/env.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ const minimalPATH = "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bi

// ReservedEnvPrefix is the environment-variable namespace inforge owns and
// injects itself (the deployment/observability context: INFORGE_DEPLOYMENT_*,
// INFORGE_SERVICE_NAMESPACE, INFORGE_INSTANCE_ID, INFORGE_HOST_ID). A service must
// INFORGE_SERVICE_NAMESPACE, INFORGE_INSTANCE_ID, INFORGE_HOST_ID, INFORGE_HOST_TYPE,
// INFORGE_CLOUD_*). A service must
// not map a secret to a name under this prefix — validation rejects it up front,
// and buildEnv rejects it as a backstop — so an injected value can never silently
// shadow (or be shadowed by) a service secret.
Expand Down Expand Up @@ -58,6 +59,20 @@ func buildEnv(d Descriptor, secrets map[string]string, home, instanceID string)
"INFORGE_DEPLOYMENT_FQDN="+dpl.FQDN,
"INFORGE_HOST_ID="+dpl.HostID,
)
// Provider-supplied cloud/host resource identity (ADR-0030) → OTel
// cloud.provider/cloud.region/cloud.availability_zone/host.type. Unlike the
// always-present block above, each is omitted when empty so a provider that
// does not supply it emits nothing (the consumer omits empty attrs anyway).
for _, kv := range [][2]string{
{"INFORGE_CLOUD_PROVIDER", dpl.CloudProvider},
{"INFORGE_CLOUD_REGION", dpl.CloudRegion},
{"INFORGE_CLOUD_AVAILABILITY_ZONE", dpl.AvailabilityZone},
{"INFORGE_HOST_TYPE", dpl.MachineType},
} {
if kv[1] != "" {
env = append(env, kv[0]+"="+kv[1])
}
}
}

names := make([]string, 0, len(d.Env))
Expand Down
42 changes: 36 additions & 6 deletions internal/bootstrapper/env_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,26 @@ func TestBuildEnvDeployment(t *testing.T) {
Service: "bridge",
User: "bridge",
Deployment: Deployment{
Region: "us-east-1",
RegionSlug: "use1",
Environment: "prd",
BaseDomain: "wardnet.network",
FQDN: "bridge.svc.prd.use1.wardnet.network",
HostID: "wardnet-prd-use1-vm-bridge-01",
Region: "us-east-1",
RegionSlug: "use1",
Environment: "prd",
BaseDomain: "wardnet.network",
FQDN: "bridge.svc.prd.use1.wardnet.network",
HostID: "wardnet-prd-use1-vm-bridge-01",
CloudProvider: "hetzner",
CloudRegion: "us-east",
AvailabilityZone: "ash",
MachineType: "cx23",
},
}

env, err := buildEnv(d, nil, "/home/bridge", "inst-9")
require.NoError(t, err)

assert.Contains(t, env, "INFORGE_CLOUD_PROVIDER=hetzner")
assert.Contains(t, env, "INFORGE_CLOUD_REGION=us-east")
assert.Contains(t, env, "INFORGE_CLOUD_AVAILABILITY_ZONE=ash")
assert.Contains(t, env, "INFORGE_HOST_TYPE=cx23")
assert.Contains(t, env, "INFORGE_DEPLOYMENT_REGION=us-east-1")
assert.Contains(t, env, "INFORGE_DEPLOYMENT_REGION_SLUG=use1")
assert.Contains(t, env, "INFORGE_DEPLOYMENT_ENV=prd")
Expand All @@ -90,6 +98,28 @@ func TestBuildEnvDeployment(t *testing.T) {
}
}

// TestBuildEnvOmitsEmptyCloudAttrs: a deployment whose provider did not supply the
// cloud/host identity (the fields are empty) emits no INFORGE_CLOUD_*/INFORGE_HOST_TYPE
// vars at all — they are omitted, not emitted blank (the always-present deployment
// block still appears).
func TestBuildEnvOmitsEmptyCloudAttrs(t *testing.T) {
d := Descriptor{
Service: "bridge",
User: "bridge",
Deployment: Deployment{Region: "us-east-1", HostID: "wardnet-prd-use1-vm-bridge-01"},
}

env, err := buildEnv(d, nil, "/home/bridge", "inst-9")
require.NoError(t, err)

assert.Contains(t, env, "INFORGE_HOST_ID=wardnet-prd-use1-vm-bridge-01")
for _, e := range env {
for _, name := range []string{"INFORGE_CLOUD_PROVIDER", "INFORGE_CLOUD_REGION", "INFORGE_CLOUD_AVAILABILITY_ZONE", "INFORGE_HOST_TYPE"} {
assert.False(t, strings.HasPrefix(e, name+"="), "%s must be omitted when empty", name)
}
}
}

// TestBuildEnvRejectsReservedName: a secret mapped to a reserved INFORGE_* name
// must fail the start rather than emit a duplicate that collides with the injected
// deployment context.
Expand Down
Loading