Skip to content

Add fork safety and cooperative cancellation to native trace exporter#5835

Open
lloeki wants to merge 3 commits into
lloeki/native-transportfrom
lloeki/native-trace-exporter-fork-safety
Open

Add fork safety and cooperative cancellation to native trace exporter#5835
lloeki wants to merge 3 commits into
lloeki/native-transportfrom
lloeki/native-trace-exporter-fork-safety

Conversation

@lloeki
Copy link
Copy Markdown
Member

@lloeki lloeki commented May 28, 2026

What does this PR do?

Add fork safety and cooperative request cancellation to the native trace exporter C extension:

  1. Fork safety native methods: _native_before_fork, _native_after_fork_in_parent,
    _native_after_fork_in_child on TraceExporter, delegating to libdatadog's
    SharedRuntime fork hooks.

  2. Cooperative cancellation: Replace RUBY_UBF_IO with a per-send cancellation token.
    When Ruby interrupts the thread (shutdown, Thread#kill), the custom UBF cancels the
    token, which cooperatively aborts the in-flight HTTP request in the Rust runtime. This
    replaces the signal-based approach which could not actually cancel the Rust HTTP pipeline.

  3. AtForkMonkeyPatch wiring: Register a :child callback in
    Transport::Native::Transport#initialize that calls _native_after_fork_in_child to
    recreate the tokio runtime in forked child processes (Puma, Unicorn, Passenger).

Motivation:

FUP to #5690. Companion to DataDog/libdatadog#2051 which adds the FFI surface this
extension calls.

Without fork hooks, the Rust tokio runtime is dead in child processes after fork, and
subsequent send calls would hang or fail. Without cooperative cancellation, Thread#kill
or Ruby shutdown during a send could leave the HTTP request running in the background.

Change log entry

None (not yet wired into the default tracer transport; no user-visible change).

Additional Notes:

  • AtForkMonkeyPatch only supports :child stage — before_fork and after_fork_in_parent
    are exposed as native methods but not wired into automatic callbacks. The child path is
    the critical one: SharedRuntime::after_fork_child creates a fresh runtime regardless of
    whether the parent was prepared.
  • The cancellation token is stack-allocated per send call and passed by address to both
    the GVL-released function and the UBF.

AI was used to accelerate implementation; all code was reviewed and understood.

How to test the change?

bundle exec rake clean compile
bundle exec rspec spec/datadog/tracing/transport/native/

48 native exporter specs pass end-to-end. Fork safety and cancellation will need
integration tests with actual forking (future work).

@dd-octo-sts dd-octo-sts Bot added the tracing label May 28, 2026
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 28, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 16 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-rb | install-dependencies: [2.6, amd64]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Compilation error due to unknown type name 'ddog_TracerSpan' in trace_exporter.c.

DataDog/apm-reliability/dd-trace-rb | install-dependencies: [2.6, arm64]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Compilation error in trace_exporter.c:15:1: unknown type name ‘ddog_TracerSpan’ and multiple implicit declaration of functions causing make failure.

DataDog/apm-reliability/dd-trace-rb | install-dependencies: [2.7, amd64]   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Compilation error in trace_exporter.c:15:1: unknown type name ‘ddog_TracerSpan’.

View all 16 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 67d68aa | Docs | Datadog PR Page | Give us feedback!

@lloeki lloeki marked this pull request as ready for review May 28, 2026 14:40
@lloeki lloeki requested review from a team as code owners May 28, 2026 14:40
@lloeki lloeki requested review from vpellan and wantsui and removed request for a team May 28, 2026 14:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa6e0db5c2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +666 to +667
ddog_TraceExporterCancelToken cancel_token =
ddog_trace_exporter_cancel_token_new();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update libdatadog before calling new exporter APIs

This now calls the cancellation-token API, but the repository still declares and locks libdatadog to 33.0.0.1.0, whose installed datadog/data-pipeline.h does not define ddog_TraceExporterCancelToken or the new cancel/fork functions. I verified with the current lockfile dependency by running bundle exec rake compile, which fails in this file on these new symbols, so anyone building the gem against the declared dependency cannot install the native extension until the libdatadog dependency is bumped or the calls are guarded.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this depends on DataDog/libdatadog#2051 being merged.

Comment on lines 680 to +682
rb_thread_call_without_gvl2(
send_chunks_without_gvl, &args,
RUBY_UBF_IO, NULL);
interrupt_exporter_call, &cancel_token);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate interrupts after cooperative cancellation

When Thread#kill or shutdown interrupts an in-flight send, the new UBF cancels the token and can make the Rust call return with args.send_ran == true; this loop then skips check_if_pending_exception() and falls through to create a transport error response instead of reliably raising the pending interrupt. In that scenario a writer thread that was killed during a native send may continue running after cancellation, so the pending exception should be checked after the GVL call once native response cleanup is safe, not only when the send never started.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member

@marcotc marcotc May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a tiny window here for this to happen in send_chunks_without_gvl:

  args->send_ran = true;
  return NULL;

(give or take other internals in rb_thread_call_without_gvl2)

But I guess that's an accurate concern.

Copy link
Copy Markdown
Member

@Strech Strech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a question on reporting it

Core::Utils::AtForkMonkeyPatch.at_fork(:child) do
exporter._native_after_fork_in_child
rescue => e
Datadog.logger.warn { "Native transport after-fork reset failed: #{e}" }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a warning? Or that means - no traces for the fork child?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking the implementation of _native_after_fork_in_child, this looks like just a defensive rescue for uncommon failures like "TraceExporter has not been initialized or was already freed", which are errors likely "impossible" to happen on a well coded implementation (aka exceptions caught here should happen during code changes in development, not in production). I think this error message is just for ourselves really, since we don't expect this to fail.

But, I agree that we should log what would happen if it failed, something like: "Native transport after-fork reset failed. Traces might not be send to Datadog: "

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I think it makes sense at least to explain the consequences of that error in the warning.

Copy link
Copy Markdown
Contributor

@vpellan vpellan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering the same question as Sergey otherwise LGTM

Copy link
Copy Markdown
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only #5835 (comment) needs to be acked/addressed.

@VianneyRuhlmann
Copy link
Copy Markdown

AtForkMonkeyPatch only supports :child stage — before_fork and after_fork_in_parent
are exposed as native methods but not wired into automatic callbacks. The child path is
the critical one: SharedRuntime::after_fork_child creates a fresh runtime regardless of
whether the parent was prepared.

This a blocker, all hooks need to be called. Otherwise the child may inherit a locked mutex causing deadlocks when dropping ressources. In addition some system ressources need to be dropped before the fork or may cause panics when dropped in the child (kqueue handle on macos).

Another requirement of libdatadog functions is that the rust functions are not interrupted by a fork. In dd-trace-py this is handled by joining on all threads that run python functions before the fork.

lloeki added 3 commits May 29, 2026 13:51
Expose `_native_before_fork`, `_native_after_fork_in_parent`, and
`_native_after_fork_in_child` instance methods that delegate to
libdatadog's SharedRuntime fork hooks. These coordinate the tokio
runtime lifecycle around process forks (Puma, Unicorn, Passenger).
Create a cancellation token per send call and pass it to the custom
unblock function. When Ruby interrupts the thread (shutdown,
Thread#kill), the UBF cancels the token, which cooperatively aborts
the in-flight HTTP request in the Rust runtime. This replaces the
signal-based RUBY_UBF_IO which could not actually cancel the Rust
HTTP pipeline.
Register a `:child` callback that calls `_native_after_fork_in_child`
on the exporter to recreate the tokio runtime in forked child
processes. Without this, the Rust runtime is dead after fork and
subsequent send calls would hang or fail.

The `AtForkMonkeyPatch` only supports `:child` stage, so
`before_fork` and `after_fork_in_parent` are not called. The child
path is the critical one: it creates a fresh runtime regardless of
whether the parent was prepared.
@lloeki lloeki force-pushed the lloeki/native-trace-exporter-fork-safety branch from 0ba47a9 to 67d68aa Compare May 29, 2026 12:02
@lloeki lloeki force-pushed the lloeki/native-transport branch from cb933d2 to 66f6fa8 Compare May 29, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants