Skip to content

[FEAT] OTel: option to use span links instead of parent context for fire-and-forget child runs #3641

@Mdev303

Description

@Mdev303

Problem

HatchetInstrumentor currently wires every spawned run as a parent-child OTel relationship: the caller's trace context
is injected into additionalMetadata (instrumentor.js, injectContext) and the worker extracts it and uses it as
the parent of hatchet.start_step_run (instrumentor.js, passed as parentContext to startActiveSpan).

This is the right default for synchronous, awaited orchestration. It is the wrong default for fire-and-forget
patterns, which are very common with Hatchet:

  • Calling runNoWait in a loop to fan out N child runs.
  • Children whose start time depends on rate limits, queue depth, or worker availability.
  • Long-running children spawned from a short-lived parent.

In all of these, the parent trace's duration is artificially extended to whenever the slowest child finishes,
because OTel backends keep a trace "open" while any descendant span is still active. Concretely:

  • Aggregate p50/p95 of the parent transaction become meaningless (they reflect downstream rate-limit waits, not the
    parent's own work).
  • The waterfall view is dominated by N deeply nested hatchet.run_workflow / hatchet.start_step_run spans.
  • Trace-search and trace-duration alerts on the parent become noisy / useless.

OpenTelemetry has a first-class concept that is designed exactly for this: span links. A link references the
spawning span without making the child causally part of the same trace, so each child becomes its own trace (cheap to
query, navigable from the parent), and the parent trace closes when the spawn loop returns.

Proposal

Add a config option to HatchetInstrumentor so users can opt specific task types into link semantics instead of parent
semantics. A predicate is the most flexible shape:

new HatchetInstrumentor({
  // Return true to use a span link instead of parent context for this task's                                           
  // hatchet.start_step_run span. The actionId is the worker's task identifier.
  useLinksInsteadOfParent: (actionId: string) => boolean,                                                               
});                                                       

Behavior when the predicate returns true for a given action:

  • In instrumentor.js, instead of calling startActiveSpan(name, opts, parentContext, fn), call startSpan(name, { ...opts, links: [{ context: spanContextFromParent }] }) (no parent context argument). The child span becomes a root in
    a new trace, with a link pointing back to the spawning span.
  • Caller side (instrumentor.js:325) is unchanged — traceparent is still injected into additionalMetadata. The
    worker just chooses to interpret it as a link rather than a parent.

Default: () => false, preserving today's behavior exactly.

Why per-task granularity (not a global flag)

A single Hatchet app legitimately wants both behaviors. Synchronous orchestrators (await workflow.run(...)) want
parent-child so latency rolls up. Fan-out spawners (workflow.runNoWait(...) in a loop) want links so the parent trace
doesn't get held open. A global boolean would force one or the other; a per-task predicate lets each workflow declare
what it is.

Per-call granularity (a runNoWait(input, { trace: "link" }) option) would be strictly more flexible, but it's a bigger
API surface change and the per-task predicate covers the vast majority of cases — usually the "kind of relationship" is
a property of the task, not the call site.

Workaround today

import { context, ROOT_CONTEXT } from "@opentelemetry/api";
                                                                                                                        
context.with(ROOT_CONTEXT, () => {
  void someWorkflow.runNoWait(input);                                                                                   
});                                                                                                                     

This detaches the child from the current trace, so it starts as its own root. It works but loses the cross-trace link
entirely (no clickable jump back from child to spawner), and it pushes OTel plumbing into application code at every
spawn site.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedIssues that have been accepted by maintainers.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions