Skip to content

client: add Prometheus HTTP SD endpoint for local allocations#28116

Open
dberkerdem wants to merge 3 commits into
hashicorp:mainfrom
dberkerdem:f-client-http-sd
Open

client: add Prometheus HTTP SD endpoint for local allocations#28116
dberkerdem wants to merge 3 commits into
hashicorp:mainfrom
dberkerdem:f-client-http-sd

Conversation

@dberkerdem

@dberkerdem dberkerdem commented Jun 12, 2026

Copy link
Copy Markdown

Description

Adds GET /v1/client/allocations/prometheus-sd to the client agent HTTP API, serving the node's running allocations as Prometheus HTTP SD target groups.

One target group is emitted per allocated port of every running allocation, labeled with __meta_nomad_* labels covering namespace, job, task group, allocation, node, and port, plus job/group meta as __meta_nomad_meta_<key> (group meta taking precedence over job meta). A ?port=<label> query parameter filters to a single port label (e.g. ?port=metrics). The endpoint serves local client state only and requires node:read ACL capability, so scrape-target discovery fans out to the client nodes instead of funneling through the servers. Requests on agents without a client return 400, matching the other /v1/client/* endpoints.

The change is purely additive: a new endpoint file, one route registration in http.go, and a small client method exposing running allocations.

Per the AI usage guidelines: AI tooling was used to assist with boilerplate and porting this change onto main (conflict resolution, license headers); all code has been human-reviewed and tested.

Testing & Reproduction steps

Unit tests cover target-group rendering (including legacy task-network ports), port filtering, allocation status filtering, incomplete-allocation handling, meta label precedence, deterministic output ordering, empty-list (not null) JSON wire format, ACL enforcement, and IPv6 host-IP bracketing.

go test ./command/agent/ -run 'TestAllocPromSDTargetGroups|TestPromSDTargetGroup|TestClientServiceDiscoveryRequest'  # 15 tests, all pass
go test ./client/ -run 'TestClient_Allocations'  # pass

Manual check: run a dev agent with a job exposing a labeled port, then curl 'http://127.0.0.1:4646/v1/client/allocations/prometheus-sd?port=metrics' returns HTTP SD JSON consumable by a Prometheus http_sd_configs block.

Links

Feature request: #28115

Contributor Checklist

  • Changelog Entry added (will rename the .changelog/ file to this PR's number once assigned)
  • Testing unit tests included for the new endpoint and client method
  • Documentation happy to submit an API docs page for the new endpoint to web-unified-docs if the feature direction is approved

Add GET /v1/client/service_discovery to the client agent HTTP API,
serving the node's running allocations as Prometheus HTTP SD target
groups (https://prometheus.io/docs/prometheus/latest/http_sd/).

One target group is emitted per allocated port of every running
allocation, labeled with __meta_nomad_* labels covering namespace,
job, task group, allocation, node, and port, plus job/group meta as
__meta_nomad_meta_<key>. A ?port=<label> query parameter filters to
a single port label (e.g. ?port=metrics).

The endpoint serves local client state only and requires node:read,
so scrape-target discovery fans out to the client nodes instead of
funneling through the servers.
@dberkerdem dberkerdem requested review from a team as code owners June 12, 2026 08:13
@hashicorp-cla-app

hashicorp-cla-app Bot commented Jun 12, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Avoid overloading 'service discovery', which already names Nomad's
native service catalog — this endpoint serves allocations, not
services. Dashed path segments also match the existing API style.

@schmichael schmichael left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass and it looks pretty good! Not approving yet because there are some design questions worth discussing in the issue: #28115

Comment on lines +68 to +70
// Listing every allocation on the node spans namespaces, so require
// node:read like the other node-level client endpoints.
aclObj, err := s.ResolveToken(req)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my comment on the issue, if we switch to <namespace>:read-job this should change to filter out allocations in namespaces the token is not allowed to read.

}
maps.Copy(baseLabels, nodeLabels)

// Expose job and task group meta (group overrides job) so schedulers

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"schedulers" is a bit weird in this context because in Nomad code that usually refers to the scheduler/ package.

Suggested change
// Expose job and task group meta (group overrides job) so schedulers
// Expose job and task group meta (group overrides job) so users

Comment on lines +102 to +112
// A running allocation always carries its job and allocated
// resources; their absence means corrupted client state. Skip
// the allocation but say so, because its targets silently
// disappearing from a successful response is otherwise
// undebuggable.
if alloc.Job == nil || alloc.AllocatedResources == nil {
logger.Warn("skipping running allocation with incomplete state in service discovery",
"alloc_id", alloc.ID, "job_id", alloc.JobID,
"has_job", alloc.Job != nil, "has_resources", alloc.AllocatedResources != nil)
continue
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panicking here is fine and probably preferable for what should be "unreachable" code. At least if someone has hit these conditions I'd love to hear about it! Since panics by handlers are recovered and logged by the http server, there's no availability danger.

The downside is that 1 corrupted job breaks service discovery if we panic and return a 500 from the handler. I think this is probably desirable because if unexpected corruption has taken place, what other corruption is lurking? This node probably needs to be drained and rebuilt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants