Add fork safety and cooperative cancellation to native trace exporter#5835
Add fork safety and cooperative cancellation to native trace exporter#5835lloeki wants to merge 3 commits into
Conversation
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa6e0db5c2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ddog_TraceExporterCancelToken cancel_token = | ||
| ddog_trace_exporter_cancel_token_new(); |
There was a problem hiding this comment.
Update libdatadog before calling new exporter APIs
This now calls the cancellation-token API, but the repository still declares and locks libdatadog to 33.0.0.1.0, whose installed datadog/data-pipeline.h does not define ddog_TraceExporterCancelToken or the new cancel/fork functions. I verified with the current lockfile dependency by running bundle exec rake compile, which fails in this file on these new symbols, so anyone building the gem against the declared dependency cannot install the native extension until the libdatadog dependency is bumped or the calls are guarded.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Correct, this depends on DataDog/libdatadog#2051 being merged.
| rb_thread_call_without_gvl2( | ||
| send_chunks_without_gvl, &args, | ||
| RUBY_UBF_IO, NULL); | ||
| interrupt_exporter_call, &cancel_token); |
There was a problem hiding this comment.
Propagate interrupts after cooperative cancellation
When Thread#kill or shutdown interrupts an in-flight send, the new UBF cancels the token and can make the Rust call return with args.send_ran == true; this loop then skips check_if_pending_exception() and falls through to create a transport error response instead of reliably raising the pending interrupt. In that scenario a writer thread that was killed during a native send may continue running after cancellation, so the pending exception should be checked after the GVL call once native response cleanup is safe, not only when the send never started.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Such a tiny window here for this to happen in send_chunks_without_gvl:
args->send_ran = true;
return NULL;(give or take other internals in rb_thread_call_without_gvl2)
But I guess that's an accurate concern.
Strech
left a comment
There was a problem hiding this comment.
LGTM, just a question on reporting it
| Core::Utils::AtForkMonkeyPatch.at_fork(:child) do | ||
| exporter._native_after_fork_in_child | ||
| rescue => e | ||
| Datadog.logger.warn { "Native transport after-fork reset failed: #{e}" } |
There was a problem hiding this comment.
Is this a warning? Or that means - no traces for the fork child?
There was a problem hiding this comment.
Looking the implementation of _native_after_fork_in_child, this looks like just a defensive rescue for uncommon failures like "TraceExporter has not been initialized or was already freed", which are errors likely "impossible" to happen on a well coded implementation (aka exceptions caught here should happen during code changes in development, not in production). I think this error message is just for ourselves really, since we don't expect this to fail.
But, I agree that we should log what would happen if it failed, something like: "Native transport after-fork reset failed. Traces might not be send to Datadog: "
There was a problem hiding this comment.
Exactly, I think it makes sense at least to explain the consequences of that error in the warning.
vpellan
left a comment
There was a problem hiding this comment.
Wondering the same question as Sergey otherwise LGTM
marcotc
left a comment
There was a problem hiding this comment.
Only #5835 (comment) needs to be acked/addressed.
This a blocker, all hooks need to be called. Otherwise the child may inherit a locked mutex causing deadlocks when dropping ressources. In addition some system ressources need to be dropped before the fork or may cause panics when dropped in the child (kqueue handle on macos). Another requirement of libdatadog functions is that the rust functions are not interrupted by a fork. In dd-trace-py this is handled by joining on all threads that run python functions before the fork. |
Expose `_native_before_fork`, `_native_after_fork_in_parent`, and `_native_after_fork_in_child` instance methods that delegate to libdatadog's SharedRuntime fork hooks. These coordinate the tokio runtime lifecycle around process forks (Puma, Unicorn, Passenger).
Create a cancellation token per send call and pass it to the custom unblock function. When Ruby interrupts the thread (shutdown, Thread#kill), the UBF cancels the token, which cooperatively aborts the in-flight HTTP request in the Rust runtime. This replaces the signal-based RUBY_UBF_IO which could not actually cancel the Rust HTTP pipeline.
Register a `:child` callback that calls `_native_after_fork_in_child` on the exporter to recreate the tokio runtime in forked child processes. Without this, the Rust runtime is dead after fork and subsequent send calls would hang or fail. The `AtForkMonkeyPatch` only supports `:child` stage, so `before_fork` and `after_fork_in_parent` are not called. The child path is the critical one: it creates a fresh runtime regardless of whether the parent was prepared.
0ba47a9 to
67d68aa
Compare
cb933d2 to
66f6fa8
Compare
What does this PR do?
Add fork safety and cooperative request cancellation to the native trace exporter C extension:
Fork safety native methods:
_native_before_fork,_native_after_fork_in_parent,_native_after_fork_in_childonTraceExporter, delegating to libdatadog'sSharedRuntimefork hooks.Cooperative cancellation: Replace
RUBY_UBF_IOwith a per-send cancellation token.When Ruby interrupts the thread (shutdown,
Thread#kill), the custom UBF cancels thetoken, which cooperatively aborts the in-flight HTTP request in the Rust runtime. This
replaces the signal-based approach which could not actually cancel the Rust HTTP pipeline.
AtForkMonkeyPatch wiring: Register a
:childcallback inTransport::Native::Transport#initializethat calls_native_after_fork_in_childtorecreate the tokio runtime in forked child processes (Puma, Unicorn, Passenger).
Motivation:
FUP to #5690. Companion to DataDog/libdatadog#2051 which adds the FFI surface this
extension calls.
Without fork hooks, the Rust tokio runtime is dead in child processes after fork, and
subsequent send calls would hang or fail. Without cooperative cancellation,
Thread#killor Ruby shutdown during a send could leave the HTTP request running in the background.
Change log entry
None (not yet wired into the default tracer transport; no user-visible change).
Additional Notes:
AtForkMonkeyPatchonly supports:childstage —before_forkandafter_fork_in_parentare exposed as native methods but not wired into automatic callbacks. The child path is
the critical one:
SharedRuntime::after_fork_childcreates a fresh runtime regardless ofwhether the parent was prepared.
the GVL-released function and the UBF.
AI was used to accelerate implementation; all code was reviewed and understood.
How to test the change?
48 native exporter specs pass end-to-end. Fork safety and cancellation will need
integration tests with actual forking (future work).