Skip to content

manager: expose Prometheus workqueue/reflector metrics; drop dead webhook port#27

Merged
tamalsaha merged 4 commits into
masterfrom
manager-probe-server
Jun 3, 2026
Merged

manager: expose Prometheus workqueue/reflector metrics; drop dead webhook port#27
tamalsaha merged 4 commits into
masterfrom
manager-probe-server

Conversation

@tamalsaha

@tamalsaha tamalsaha commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Make the OCM AddOn fargocd manager subcommand observable. Before this PR /metrics returned only go_* / process_* collectors — enough to confirm the pod was alive but blind to reconcile backlog or list/watch behaviour. Companion to kubeops/installer#485, which wires the chart-side surface (ClusterIP Service, ServiceMonitor, and probes against the addon-framework's existing /healthz on :8443).

1. Workqueue metrics

Blank-import k8s.io/component-base/metrics/prometheus/workqueue in the manager package. Its init() registers the standard workqueue_{depth, adds_total, queue_duration_seconds, work_duration_seconds, retries_total, longest_running_processor_seconds, unfinished_work_seconds} collectors against legacyregistry (which backs the addon-framework's /metrics) and calls workqueue.SetProvider.

2. Reflector (informer) metrics

k8s.io/client-go/tools/cache exposes a MetricsProvider interface but ships no off-the-shelf Prometheus wrapper, so register a labelled set of collectors and bind them via SetReflectorMetricsProvider:

  • reflector_lists_total, reflector_list_duration_seconds, reflector_items_per_list
  • reflector_watches_total, reflector_short_watches_total, reflector_watch_duration_seconds, reflector_items_per_watch
  • reflector_last_resource_version

Each is labelled name=<resource> so per-informer behaviour is distinguishable. go.mod promotes github.com/prometheus/client_golang from indirect to direct; no vendor change (already vendored transitively).

Caveat: cache.SetReflectorMetricsProvider is sync.Once-gated upstream. If a future vendored package wins the race, our provider becomes a no-op and the reflector_* series stay flat at zero. Nothing in the current import graph does this — flagging for the future.

3. Drop dead webhook port from the embedded fargocd chart

pkg/manager/agent-manifests/fargocd is the spoke chart shipped via OCM (kept byte-identical with installer/charts/fargocd). The deployment declared containerPort: 9443 (https) and the service forwarded port 443 to it, but the operator constructs a controller-runtime webhook.NewServer(...) and never calls SetupWebhookWithManager — no admission handlers are registered, so nothing was listening behind the TLS endpoint. Drop both. The installer-side copy is dropped in kubeops/installer#485.

Resulting /metrics surface

Family Source
workqueue_* (8 metrics, labelled by queue) blank import
reflector_* (8 metrics, labelled by resource) custom provider
go_*, process_* legacyregistry defaults (already there)
apiserver_*, etcd3_*, … genericapiserver defaults — definitions exist but values stay at 0 (no real apiserver/etcd)

Probes are not introduced here; the chart in kubeops/installer#485 points liveness/readiness at the addon-framework's existing HTTPS /healthz on the same :8443 endpoint.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • helm template of the embedded fargocd chart renders without the 9443 / 443 entries
  • In-cluster: scrape https://<pod>:8443/metrics (with -k for the self-signed cert) and confirm workqueue_depth{name=…} and reflector_lists_total{name=…} appear with non-empty label values once informers start

The fargocd chart's deployment declared a containerPort 9443 (`https`)
and the service forwarded port 443 to it, but the operator only
constructs a controller-runtime webhook server -- no admission handlers
are ever registered, so nothing was listening behind that TLS endpoint.
Drop both from the embedded chart copy under pkg/manager/agent-manifests
(the installer-side copy is updated in the kubeops/installer repo).

For the OCM AddOn `fargocd manager` subcommand, the addon-framework
controller binds HTTPS on :8443 with a runtime self-signed cert
(SAN=localhost), which makes kubelet probes awkward. Stand up a plain
HTTP probe server (default :8081) that serves /healthz and /readyz,
gated by --health-probe-bind-address (set to empty to disable). The
embedded fargocd chart still uses controller-runtime's own probe
plumbing on the same port name, so naming stays consistent.

Signed-off-by: Tamal Saha <tamal@appscode.com>
tamalsaha added 2 commits June 3, 2026 14:25
The addon-framework controller exposes Prometheus /metrics on its
:8443 HTTPS endpoint via genericapiserver, but the framework itself
registers no collectors. Without a workqueue metrics provider, the
endpoint only returns Go runtime + process collectors, which is
enough to confirm the pod is alive but says nothing about reconcile
backlog or throughput.

Blank-import k8s.io/component-base/metrics/prometheus/workqueue in
the manager package so its init() registers the prometheus provider
against client-go's workqueue (via workqueue.SetProvider) and adds
the standard workqueue_{depth,adds_total,queue_duration_seconds,
work_duration_seconds,retries_total,longest_running_processor_seconds,
unfinished_work_seconds} collectors to legacyregistry.

The ServiceMonitor shipped by the fargocd-manager chart in
kubeops/installer#485 picks these up automatically — no chart
change needed.

Signed-off-by: Tamal Saha <tamal@appscode.com>
Pairs with the workqueue blank import: workqueue metrics cover
reconcile backlog/throughput on the addon controllers, reflector
metrics cover the list/watch behaviour of the informers feeding
those reconcilers.

client-go/tools/cache exposes a MetricsProvider interface but ships
no off-the-shelf prometheus wrapper, so register a labelled set of
collectors (reflector_{lists_total, list_duration_seconds,
items_per_list, watches_total, short_watches_total,
watch_duration_seconds, items_per_watch, last_resource_version})
against legacyregistry and bind them via SetReflectorMetricsProvider.
Each metric is labelled by reflector name (the watched resource
type) so per-informer behaviour is distinguishable.

go.mod: promote github.com/prometheus/client_golang from indirect
to direct -- it was already vendored transitively, so no vendor/
change.

Signed-off-by: Tamal Saha <tamal@appscode.com>
@tamalsaha tamalsaha changed the title manager: serve health probes on :8081; drop dead webhook port manager: expose health probes and Prometheus workqueue/reflector metrics Jun 3, 2026
The chart change in kubeops/installer#485 points liveness/readiness
probes at the addon-framework's existing HTTPS /healthz on :8443
(kubelet's httpGet skips TLS verification, so the runtime self-signed
cert is fine). With that, the dedicated plain-HTTP server on :8081
has no consumer -- drop the listener, the --health-probe-bind-address
flag, and the ProbeAddr option.

The workqueue and reflector metrics wired up in the previous commits
stay (they live on the same :8443 endpoint as /healthz).

Signed-off-by: Tamal Saha <tamal@appscode.com>
tamalsaha added a commit to kubeops/installer that referenced this pull request Jun 3, 2026
Drop the dedicated probes (8081) container port and point the
readiness + liveness httpGet at the metrics port (8443) with
scheme: HTTPS. The OCM addon-framework's genericapiserver already
serves /healthz there, and kubelet's httpGet probes skip TLS
verification so the runtime self-signed cert (SAN=localhost) is
not an issue.

Drops the dependency on the now-removed --health-probe-bind-address
plumbing in kubeops/fargocd#27.

Signed-off-by: Tamal Saha <tamal@appscode.com>
@tamalsaha tamalsaha changed the title manager: expose health probes and Prometheus workqueue/reflector metrics manager: expose Prometheus workqueue/reflector metrics; drop dead webhook port Jun 3, 2026
@tamalsaha tamalsaha merged commit 81c0b0d into master Jun 3, 2026
4 checks passed
@tamalsaha tamalsaha deleted the manager-probe-server branch June 3, 2026 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant