Problem
HatchetInstrumentor currently wires every spawned run as a parent-child OTel relationship: the caller's trace context
is injected into additionalMetadata (instrumentor.js, injectContext) and the worker extracts it and uses it as
the parent of hatchet.start_step_run (instrumentor.js, passed as parentContext to startActiveSpan).
This is the right default for synchronous, awaited orchestration. It is the wrong default for fire-and-forget
patterns, which are very common with Hatchet:
- Calling
runNoWait in a loop to fan out N child runs.
- Children whose start time depends on rate limits, queue depth, or worker availability.
- Long-running children spawned from a short-lived parent.
In all of these, the parent trace's duration is artificially extended to whenever the slowest child finishes,
because OTel backends keep a trace "open" while any descendant span is still active. Concretely:
- Aggregate p50/p95 of the parent transaction become meaningless (they reflect downstream rate-limit waits, not the
parent's own work).
- The waterfall view is dominated by N deeply nested
hatchet.run_workflow / hatchet.start_step_run spans.
- Trace-search and trace-duration alerts on the parent become noisy / useless.
OpenTelemetry has a first-class concept that is designed exactly for this: span links. A link references the
spawning span without making the child causally part of the same trace, so each child becomes its own trace (cheap to
query, navigable from the parent), and the parent trace closes when the spawn loop returns.
Proposal
Add a config option to HatchetInstrumentor so users can opt specific task types into link semantics instead of parent
semantics. A predicate is the most flexible shape:
new HatchetInstrumentor({
// Return true to use a span link instead of parent context for this task's
// hatchet.start_step_run span. The actionId is the worker's task identifier.
useLinksInsteadOfParent: (actionId: string) => boolean,
});
Behavior when the predicate returns true for a given action:
- In
instrumentor.js, instead of calling startActiveSpan(name, opts, parentContext, fn), call startSpan(name, { ...opts, links: [{ context: spanContextFromParent }] }) (no parent context argument). The child span becomes a root in
a new trace, with a link pointing back to the spawning span.
- Caller side (
instrumentor.js:325) is unchanged — traceparent is still injected into additionalMetadata. The
worker just chooses to interpret it as a link rather than a parent.
Default: () => false, preserving today's behavior exactly.
Why per-task granularity (not a global flag)
A single Hatchet app legitimately wants both behaviors. Synchronous orchestrators (await workflow.run(...)) want
parent-child so latency rolls up. Fan-out spawners (workflow.runNoWait(...) in a loop) want links so the parent trace
doesn't get held open. A global boolean would force one or the other; a per-task predicate lets each workflow declare
what it is.
Per-call granularity (a runNoWait(input, { trace: "link" }) option) would be strictly more flexible, but it's a bigger
API surface change and the per-task predicate covers the vast majority of cases — usually the "kind of relationship" is
a property of the task, not the call site.
Workaround today
import { context, ROOT_CONTEXT } from "@opentelemetry/api";
context.with(ROOT_CONTEXT, () => {
void someWorkflow.runNoWait(input);
});
This detaches the child from the current trace, so it starts as its own root. It works but loses the cross-trace link
entirely (no clickable jump back from child to spawner), and it pushes OTel plumbing into application code at every
spawn site.
Problem
HatchetInstrumentorcurrently wires every spawned run as a parent-child OTel relationship: the caller's trace contextis injected into
additionalMetadata(instrumentor.js,injectContext) and the worker extracts it and uses it asthe parent of
hatchet.start_step_run(instrumentor.js, passed asparentContexttostartActiveSpan).This is the right default for synchronous, awaited orchestration. It is the wrong default for fire-and-forget
patterns, which are very common with Hatchet:
runNoWaitin a loop to fan out N child runs.In all of these, the parent trace's duration is artificially extended to whenever the slowest child finishes,
because OTel backends keep a trace "open" while any descendant span is still active. Concretely:
parent's own work).
hatchet.run_workflow/hatchet.start_step_runspans.OpenTelemetry has a first-class concept that is designed exactly for this: span links. A link references the
spawning span without making the child causally part of the same trace, so each child becomes its own trace (cheap to
query, navigable from the parent), and the parent trace closes when the spawn loop returns.
Proposal
Add a config option to
HatchetInstrumentorso users can opt specific task types into link semantics instead of parentsemantics. A predicate is the most flexible shape:
Behavior when the predicate returns
truefor a given action:instrumentor.js, instead of callingstartActiveSpan(name, opts, parentContext, fn), callstartSpan(name, { ...opts, links: [{ context: spanContextFromParent }] })(no parent context argument). The child span becomes a root ina new trace, with a link pointing back to the spawning span.
instrumentor.js:325) is unchanged —traceparentis still injected intoadditionalMetadata. Theworker just chooses to interpret it as a link rather than a parent.
Default:
() => false, preserving today's behavior exactly.Why per-task granularity (not a global flag)
A single Hatchet app legitimately wants both behaviors. Synchronous orchestrators (
await workflow.run(...)) wantparent-child so latency rolls up. Fan-out spawners (
workflow.runNoWait(...)in a loop) want links so the parent tracedoesn't get held open. A global boolean would force one or the other; a per-task predicate lets each workflow declare
what it is.
Per-call granularity (a
runNoWait(input, { trace: "link" })option) would be strictly more flexible, but it's a biggerAPI surface change and the per-task predicate covers the vast majority of cases — usually the "kind of relationship" is
a property of the task, not the call site.
Workaround today
This detaches the child from the current trace, so it starts as its own root. It works but loses the cross-trace link
entirely (no clickable jump back from child to spawner), and it pushes OTel plumbing into application code at every
spawn site.