client: add Prometheus HTTP SD endpoint for local allocations by dberkerdem · Pull Request #28116 · hashicorp/nomad

dberkerdem · 2026-06-12T08:13:49Z

Description

Adds GET /v1/client/allocations/prometheus-sd to the client agent HTTP API, serving the node's running allocations as Prometheus HTTP SD target groups.

One target group is emitted per allocated port of every running allocation, labeled with __meta_nomad_* labels covering namespace, job, task group, allocation, node, and port, plus job/group meta as __meta_nomad_meta_<key> (group meta taking precedence over job meta). A ?port=<label> query parameter filters to a single port label (e.g. ?port=metrics). The endpoint serves local client state only and requires node:read ACL capability, so scrape-target discovery fans out to the client nodes instead of funneling through the servers. Requests on agents without a client return 400, matching the other /v1/client/* endpoints.

The change is purely additive: a new endpoint file, one route registration in http.go, and a small client method exposing running allocations.

Per the AI usage guidelines: AI tooling was used to assist with boilerplate and porting this change onto main (conflict resolution, license headers); all code has been human-reviewed and tested.

Testing & Reproduction steps

Unit tests cover target-group rendering (including legacy task-network ports), port filtering, allocation status filtering, incomplete-allocation handling, meta label precedence, deterministic output ordering, empty-list (not null) JSON wire format, ACL enforcement, and IPv6 host-IP bracketing.

go test ./command/agent/ -run 'TestAllocPromSDTargetGroups|TestPromSDTargetGroup|TestClientServiceDiscoveryRequest'  # 15 tests, all pass
go test ./client/ -run 'TestClient_Allocations'  # pass

Manual check: run a dev agent with a job exposing a labeled port, then curl 'http://127.0.0.1:4646/v1/client/allocations/prometheus-sd?port=metrics' returns HTTP SD JSON consumable by a Prometheus http_sd_configs block.

Links

Feature request: #28115

Contributor Checklist

Changelog Entry added (will rename the .changelog/ file to this PR's number once assigned)
Testing unit tests included for the new endpoint and client method
Documentation happy to submit an API docs page for the new endpoint to web-unified-docs if the feature direction is approved

Add GET /v1/client/service_discovery to the client agent HTTP API, serving the node's running allocations as Prometheus HTTP SD target groups (https://prometheus.io/docs/prometheus/latest/http_sd/). One target group is emitted per allocated port of every running allocation, labeled with __meta_nomad_* labels covering namespace, job, task group, allocation, node, and port, plus job/group meta as __meta_nomad_meta_<key>. A ?port=<label> query parameter filters to a single port label (e.g. ?port=metrics). The endpoint serves local client state only and requires node:read, so scrape-target discovery fans out to the client nodes instead of funneling through the servers.

hashicorp-cla-app · 2026-06-12T08:14:10Z

All committers have signed the CLA.

Avoid overloading 'service discovery', which already names Nomad's native service catalog — this endpoint serves allocations, not services. Dashed path segments also match the existing API style.

schmichael

Took a first pass and it looks pretty good! Not approving yet because there are some design questions worth discussing in the issue: #28115

schmichael · 2026-06-12T23:13:02Z

+	// Listing every allocation on the node spans namespaces, so require
+	// node:read like the other node-level client endpoints.
+	aclObj, err := s.ResolveToken(req)


As per my comment on the issue, if we switch to <namespace>:read-job this should change to filter out allocations in namespaces the token is not allowed to read.

schmichael · 2026-06-12T23:16:31Z

+	}
+	maps.Copy(baseLabels, nodeLabels)
+
+	// Expose job and task group meta (group overrides job) so schedulers


"schedulers" is a bit weird in this context because in Nomad code that usually refers to the scheduler/ package.

Suggested change

// Expose job and task group meta (group overrides job) so schedulers

// Expose job and task group meta (group overrides job) so users

schmichael · 2026-06-12T23:23:44Z

+		// A running allocation always carries its job and allocated
+		// resources; their absence means corrupted client state. Skip
+		// the allocation but say so, because its targets silently
+		// disappearing from a successful response is otherwise
+		// undebuggable.
+		if alloc.Job == nil || alloc.AllocatedResources == nil {
+			logger.Warn("skipping running allocation with incomplete state in service discovery",
+				"alloc_id", alloc.ID, "job_id", alloc.JobID,
+				"has_job", alloc.Job != nil, "has_resources", alloc.AllocatedResources != nil)
+			continue
+		}


Panicking here is fine and probably preferable for what should be "unreachable" code. At least if someone has hit these conditions I'd love to hear about it! Since panics by handlers are recovered and logged by the http server, there's no availability danger.

The downside is that 1 corrupted job breaks service discovery if we panic and return a 500 from the handler. I think this is probably desirable because if unexpected corruption has taken place, what other corruption is lurking? This node probably needs to be drained and rebuilt.

dberkerdem requested review from a team as code owners June 12, 2026 08:13

changelog: rename entry to PR number

e5c4a76

jrasell added this to Nomad - Community Issues Triage Jun 12, 2026

github-project-automation Bot moved this to Needs Triage in Nomad - Community Issues Triage Jun 12, 2026

rename endpoint to /v1/client/allocations/prometheus-sd

ad48383

Avoid overloading 'service discovery', which already names Nomad's native service catalog — this endpoint serves allocations, not services. Dashed path segments also match the existing API style.

schmichael reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

client: add Prometheus HTTP SD endpoint for local allocations#28116

client: add Prometheus HTTP SD endpoint for local allocations#28116
dberkerdem wants to merge 3 commits into
hashicorp:mainfrom
dberkerdem:f-client-http-sd

dberkerdem commented Jun 12, 2026 •

edited

Loading

Uh oh!

hashicorp-cla-app Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

schmichael left a comment

Uh oh!

schmichael Jun 12, 2026

Uh oh!

schmichael Jun 12, 2026

Uh oh!

schmichael Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// Expose job and task group meta (group overrides job) so schedulers
	// Expose job and task group meta (group overrides job) so users

Uh oh!

Conversation

dberkerdem commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing & Reproduction steps

Links

Contributor Checklist

Uh oh!

hashicorp-cla-app Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schmichael left a comment

Choose a reason for hiding this comment

Uh oh!

schmichael Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

schmichael Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

schmichael Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dberkerdem commented Jun 12, 2026 •

edited

Loading

hashicorp-cla-app Bot commented Jun 12, 2026 •

edited

Loading