Skip to content

feat: Support tracing configuration for REST API services #2449

@thossain-nv

Description

@thossain-nv

Is this a new feature, an enhancement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem this feature solves

Tracing support in REST API services are present but not enabled due to lack of configuration mechanism for Tracing Provider.

Current State

  • The main API has config and middleware, but only creates spans when tracing.enabled is true; it does not install a real tracer provider/exporter. See rest-api/api/internal/server/server.go:
if cfg.GetTracingEnabled() {
    svcName := cfg.GetTracingServiceName()
    if svcName != "" {
        e.Use(otelecho.Middleware(svcName, otelecho.WithSkipper(skipTracingRoutes), otelecho.WithPropagators(otprop.OT{})))
    }
}

Feature Description

Enable Tracing for REST API Services

The rest-api services should use repo config to enable tracing, and standard OpenTelemetry environment variables to configure export.

Configuration Model

  • tracing.enabled: Enables tracing for API and Workflow.
  • tracing.serviceName: Default service name when OTEL_SERVICE_NAME is not set.
  • OTEL_EXPORTER_OTLP_*: Configures the OTLP exporter endpoint, protocol, headers, and TLS behavior.
  • OTEL_RESOURCE_ATTRIBUTES: Adds shared resource metadata.
  • OTEL_PROPAGATORS: Configures trace context propagation.

Helm

Enable tracing for API and Workflow in Helm values:

nico-rest-api:
  config:
    tracing:
      enabled: true
      serviceName: nico-rest-api

nico-rest-workflow:
  config:
    tracing:
      enabled: true
      serviceName: nico-rest-workflow

Add standard OTEL environment variables to each traced deployment:

env:
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: grpc
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: http://otel-collector.observability.svc.cluster.local:4317
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: service.namespace=nico-rest,deployment.environment=prod
  - name: OTEL_PROPAGATORS
    value: tracecontext,baggage

Usually omit OTEL_SERVICE_NAME for API and Workflow so tracing.serviceName is used. Set it only when a deployment needs a more specific service identity, for example:

env:
  - name: OTEL_SERVICE_NAME
    value: nico-rest-cloud-worker

Apply the OTEL exporter env vars to all services that should participate in end-to-end traces:

nico-rest-api
nico-rest-cloud-worker
nico-rest-site-worker
nico-rest-site-manager
nico-rest-cert-manager

Describe your ideal solution

  • Add a shared tracing package, for example rest-api/common/pkg/otel/trace.go, that:
    • Creates an OTLP trace exporter from standard env vars such as OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_EXPORTER_OTLP_HEADERS, and TLS/insecure settings.
    • Supports grpc and http/protobuf protocols by using the matching OTLP trace exporter package.
    • Builds an SDK tracer provider with a batch span processor and resource attributes from OTEL_RESOURCE_ATTRIBUTES, OTEL_SERVICE_NAME, and the configured service name fallback.
    • Sets the global tracer provider and a standard propagator, preferably W3C Trace Context + Baggage, with optional legacy ot propagation only if needed for compatibility.
    • Returns a shutdown function so each binary can flush spans on exit.
  • Wire the shared bootstrap into rest-api/api/cmd/api/main.go before DB, Temporal, and Echo setup. Keep tracing.enabled as the on/off switch and use tracing.serviceName only as a fallback when OTEL_SERVICE_NAME is unset.
  • Wire the same bootstrap into rest-api/workflow/cmd/workflow/main.go, then pass the existing tInterceptors into tsdkClient.Options so Temporal spans are emitted.
  • Replace ad hoc gates:
    • In rest-api/db/pkg/db/session.go, enable bunotel when a real tracer provider is enabled, or pass an explicit tracing option into NewSessionFromConfig if avoiding globals is preferred.
    • In the site-workflow gRPC clients, replace LS_SERVICE_NAME with a standard/shared tracing-enabled check.
  • Update cert-manager/site-manager startup by making rest-api/cert-manager/pkg/core/otel.go call the shared bootstrap when standard OTEL endpoint env vars are present. This lets existing otelhttp wrappers emit spans without adding new service-specific config.
  • Clean up stale LightStep references: remove the LightStep TODO in rest-api/cert-manager/pkg/core/httpservice.go, replace LS_SERVICE_NAME usage, and update the old site-agent OTEL_EXPORTER_OTLP_SPAN_* env names to standard OTEL names if that path is still supported.
  • Add focused tests for the shared bootstrap and startup gates:
    • Provider is no-op when tracing is disabled or no exporter endpoint is configured.
    • OTEL_SERVICE_NAME overrides config service name.
    • Protocol selection chooses gRPC vs HTTP/protobuf correctly.
    • API/workflow startup enables middleware/interceptors when tracing is enabled.

Describe any alternatives you have considered

No response

Additional context

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open feature requests and have found no duplicates for this feature request

Metadata

Metadata

Assignees

No one assigned

    Labels

    featureFeature (deprecated - use issue type, but it's needed for reporting now)
    No fields configured for Enhancement.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions