Skip to content

plugin-native: teardown race in NativeNodeWrapper::run_source between detached worker thread and dlclose #481

@staging-devin-ai-integration

Description

Summary

NativeNodeWrapper::run_source returns from its pre-Start exit branches (Shutdown / control-channel-close / pre-Start UpdateParams handshake) without joining or signalling the per-instance worker thread. The detached worker continues to hold an Arc<InstanceState> and ultimately runs InstanceState::drop, which calls back into the loaded .so via destroy_instance. When the test (or process) scope ends quickly after run() returns, the worker can still be inside the .so at the moment dlclose runs, causing a SIGSEGV during process teardown.

Repro

Run the new fixture-based source tests under coverage instrumentation:

CARGO_LLVM_COV_TARGET_DIR=target/coverage \
  cargo llvm-cov --no-report nextest -p streamkit-plugin-native --test source_plugin

Without the test-side workaround introduced in the accompanying PR, three tests intermittently abort with signal 11 after they have already asserted and reported test result: ok:

  • source_plugin_shutdown_before_start_exits_cleanly
  • source_plugin_control_channel_close_before_start_exits_cleanly
  • source_plugin_update_params_before_start_takes_effect

(occasionally also the running-then-shutdown variants, which hit the same teardown path).

Why coverage exposes it

cargo-llvm-cov instruments the .so with first-touch profile setup, which makes the very first FFI call into a freshly loaded library noticeably slower. The pre-Start exit branches return without ever forcing the worker through an FFI call (no Tick / Process / Flush ever runs), so the first FFI hit is the eventual destroy_instance triggered from worker drop — exactly when dlclose is most likely to race it.

Pinned by

The new tests in crates/plugin-native/tests/source_plugin.rs call a small drain_detached_worker().await helper (200 ms sleep) before returning so the detached worker can complete destroy_instance before the subprocess starts unloading. The helper carries a doc comment that links back to this issue. The race still exists in production code — only the test reproducer is masked.

  • File: crates/plugin-native/src/wrapper.rs (function run_source, ~lines 1740-1789)
  • Tests: crates/plugin-native/tests/source_plugin.rs (helper drain_detached_worker and the pre-Start tests)

Suggested fix

Make run_source deterministic with respect to its worker on the pre-Start exit branches. Two viable approaches:

  1. Drop the locally-held InstanceWorker before returning from each early-exit branch and explicitly wait on a worker-shutdown signal before run() returns; or
  2. Drive the worker through an explicit no-op FFI "warm" call during new() so the first FFI hit can never coincide with teardown.

The current InstanceWorker::Drop doc-comment already calls out that joining synchronously is unacceptable; the fix needs to route through the async path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtesting

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions