Summary
NativeNodeWrapper::run_source returns from its pre-Start exit branches (Shutdown / control-channel-close / pre-Start UpdateParams handshake) without joining or signalling the per-instance worker thread. The detached worker continues to hold an Arc<InstanceState> and ultimately runs InstanceState::drop, which calls back into the loaded .so via destroy_instance. When the test (or process) scope ends quickly after run() returns, the worker can still be inside the .so at the moment dlclose runs, causing a SIGSEGV during process teardown.
Repro
Run the new fixture-based source tests under coverage instrumentation:
CARGO_LLVM_COV_TARGET_DIR=target/coverage \
cargo llvm-cov --no-report nextest -p streamkit-plugin-native --test source_plugin
Without the test-side workaround introduced in the accompanying PR, three tests intermittently abort with signal 11 after they have already asserted and reported test result: ok:
source_plugin_shutdown_before_start_exits_cleanly
source_plugin_control_channel_close_before_start_exits_cleanly
source_plugin_update_params_before_start_takes_effect
(occasionally also the running-then-shutdown variants, which hit the same teardown path).
Why coverage exposes it
cargo-llvm-cov instruments the .so with first-touch profile setup, which makes the very first FFI call into a freshly loaded library noticeably slower. The pre-Start exit branches return without ever forcing the worker through an FFI call (no Tick / Process / Flush ever runs), so the first FFI hit is the eventual destroy_instance triggered from worker drop — exactly when dlclose is most likely to race it.
Pinned by
The new tests in crates/plugin-native/tests/source_plugin.rs call a small drain_detached_worker().await helper (200 ms sleep) before returning so the detached worker can complete destroy_instance before the subprocess starts unloading. The helper carries a doc comment that links back to this issue. The race still exists in production code — only the test reproducer is masked.
- File:
crates/plugin-native/src/wrapper.rs (function run_source, ~lines 1740-1789)
- Tests:
crates/plugin-native/tests/source_plugin.rs (helper drain_detached_worker and the pre-Start tests)
Suggested fix
Make run_source deterministic with respect to its worker on the pre-Start exit branches. Two viable approaches:
- Drop the locally-held
InstanceWorker before returning from each early-exit branch and explicitly wait on a worker-shutdown signal before run() returns; or
- Drive the worker through an explicit no-op FFI "warm" call during
new() so the first FFI hit can never coincide with teardown.
The current InstanceWorker::Drop doc-comment already calls out that joining synchronously is unacceptable; the fix needs to route through the async path.
Summary
NativeNodeWrapper::run_sourcereturns from its pre-Start exit branches (Shutdown / control-channel-close / pre-Start UpdateParams handshake) without joining or signalling the per-instance worker thread. The detached worker continues to hold anArc<InstanceState>and ultimately runsInstanceState::drop, which calls back into the loaded.soviadestroy_instance. When the test (or process) scope ends quickly afterrun()returns, the worker can still be inside the.soat the moment dlclose runs, causing a SIGSEGV during process teardown.Repro
Run the new fixture-based source tests under coverage instrumentation:
Without the test-side workaround introduced in the accompanying PR, three tests intermittently abort with signal 11 after they have already asserted and reported
test result: ok:source_plugin_shutdown_before_start_exits_cleanlysource_plugin_control_channel_close_before_start_exits_cleanlysource_plugin_update_params_before_start_takes_effect(occasionally also the running-then-shutdown variants, which hit the same teardown path).
Why coverage exposes it
cargo-llvm-covinstruments the .so with first-touch profile setup, which makes the very first FFI call into a freshly loaded library noticeably slower. The pre-Start exit branches return without ever forcing the worker through an FFI call (no Tick / Process / Flush ever runs), so the first FFI hit is the eventualdestroy_instancetriggered from worker drop — exactly when dlclose is most likely to race it.Pinned by
The new tests in
crates/plugin-native/tests/source_plugin.rscall a smalldrain_detached_worker().awaithelper (200 ms sleep) before returning so the detached worker can completedestroy_instancebefore the subprocess starts unloading. The helper carries a doc comment that links back to this issue. The race still exists in production code — only the test reproducer is masked.crates/plugin-native/src/wrapper.rs(functionrun_source, ~lines 1740-1789)crates/plugin-native/tests/source_plugin.rs(helperdrain_detached_workerand the pre-Start tests)Suggested fix
Make
run_sourcedeterministic with respect to its worker on the pre-Start exit branches. Two viable approaches:InstanceWorkerbefore returning from each early-exit branch and explicitly wait on a worker-shutdown signal beforerun()returns; ornew()so the first FFI hit can never coincide with teardown.The current
InstanceWorker::Dropdoc-comment already calls out that joining synchronously is unacceptable; the fix needs to route through the async path.