Skip to content

fix(plugin-native): join worker thread on all run_source exit paths#519

Open
staging-devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779719238-fix-source-worker-join
Open

fix(plugin-native): join worker thread on all run_source exit paths#519
staging-devin-ai-integration[bot] wants to merge 1 commit into
mainfrom
devin/1779719238-fix-source-worker-join

Conversation

@staging-devin-ai-integration
Copy link
Copy Markdown
Contributor

@staging-devin-ai-integration staging-devin-ai-integration Bot commented May 25, 2026

Summary

  • run_source returned from pre-Start exit branches (Shutdown, control-channel close, cancellation) and post-Start exits (shutdown during ticks, tick error, normal completion) without joining the per-instance worker thread. The detached worker could still be inside the .so when dlclose ran during process teardown, causing SIGSEGV under coverage instrumentation.
  • Store the worker thread's JoinHandle in InstanceWorker and add an async shutdown() method that drops the channel sender then joins the thread via spawn_blocking.
  • All exit paths in run_source now call shutdown() before returning. The post-Start tick loop is restructured to converge at a single exit point rather than using early returns, so the join always happens.
  • The Drop impl remains as a fallback for run_processor and ?-error exits — it detaches the thread (safe because the Arc<InstanceState> keeps the plugin alive).
  • Removed the test-side drain_detached_worker() 200ms sleep workaround since the production code now guarantees the worker has exited.

Review & Validation

  • Verify the shutdown() method correctly sequences: drop sender → spawn_blocking join — no deadlock risk since the worker only blocks on the channel.
  • Confirm the borrow scope { let worker_tx = &worker.tx; loop { … } } releases the borrow before worker.shutdown().await.
  • Run source plugin tests under coverage: CARGO_LLVM_COV_TARGET_DIR=target/coverage cargo llvm-cov --no-report nextest -p streamkit-plugin-native --test source_plugin

Closes #481

Link to Devin session: https://staging.itsdev.in/sessions/6f336a5db7d444018cff5cf532c6f6c7
Requested by: @streamer45


Devin Review

Status Commit
🟢 Reviewed b988723
Open in Devin Review (Staging)

Store the worker thread's JoinHandle in InstanceWorker and add an
async shutdown() method that drops the channel sender then joins
the thread via spawn_blocking.  All pre-Start and post-Start exit
paths in run_source now call shutdown() so the worker completes
before the function returns, preventing use-after-dlclose races
during process teardown.

The test-side drain_detached_worker() sleep workaround is removed
since the production code now guarantees the worker has exited.

Closes #481

Signed-off-by: streamkit-devin <devin@streamkit.dev>
@staging-devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

Copy link
Copy Markdown
Contributor Author

@staging-devin-ai-integration staging-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review (Staging)
Debug

Playground

Comment on lines +426 to +441
impl InstanceWorker {
/// Drop the channel sender (signalling the worker to exit) and join
/// its thread via `spawn_blocking` so we don't block the async runtime.
async fn shutdown(mut self) {
let handle = self.join_handle.take();
let node_id = self.node_id.clone();
drop(self);
if let Some(h) = handle {
let _ = tokio::task::spawn_blocking(move || {
if let Err(panic) = h.join() {
tracing::warn!(node = %node_id, "Worker thread panicked: {panic:?}");
}
})
.await;
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Processor workers are still detached on non-source exit paths

The PR adds a join handle to every InstanceWorker, but worker.shutdown().await is only wired into the source-plugin lifecycle. Processor plugins still rely on InstanceWorker::Drop detaching the thread when run_processor returns or errors, because run_processor never calls shutdown before returning at crates/plugin-native/src/wrapper.rs:1635-1650. This appears intentional given the existing timeout contract and the comment in Drop, but reviewers should be aware that the teardown-race fix is scoped to source plugins/tests rather than all native plugin workers.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct — this is intentional. The Drop fallback (detach) remains for run_processor and is documented in both the Drop comment and the PR summary. Extending shutdown() to run_processor would require the same restructuring there and is a separate concern; the issue (#481) is specifically about the source plugin teardown race exposed by coverage instrumentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed tracking issues for the two observations above:

Comment on lines +1973 to +1995
let (reply_tx, reply_rx) = tokio::sync::oneshot::channel();
self.send_to_worker(
WorkerCallContext {
op: "tick",
node: &node_name,
state_tx: Some(&context.state_tx),
telemetry: Some(&telemetry),
metric_labels: &self.state.labels_tick,
},
worker_tx,
WorkerRequest::Tick { reply: reply_tx },
)
.await?;

// Send outputs produced by tick. If the output channel is closed,
// stop ticking — source nodes have no input-close backstop so we must
// detect consumer disconnect here.
let mut output_closed = false;
for (pin, pkt) in reply.outputs {
if context.output_sender.send(&pin, pkt).await.is_err() {
tracing::debug!(node = %node_name, "Output channel closed during tick");
output_closed = true;
let reply = self
.await_reply(
"tick",
&node_name,
reply_rx,
Some(&context.state_tx),
Some(&telemetry),
&self.state.labels_tick,
)
.await?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Early timeout errors bypass source shutdown join

Inside the source tick loop, send_to_worker(...).await?, await_reply(...).await?, and apply_params_update(...).await? still return early from run_source before the final worker.shutdown().await at crates/plugin-native/src/wrapper.rs:2067. That means timed-out or wedged plugin calls keep the old detach behavior instead of joining, which avoids blocking forever on a stuck FFI call but also means the new deterministic join only applies to clean source exits and plugin-reported tick errors.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — this is deliberate. The ? exits are for dead-worker or timeout scenarios where the FFI call may be stuck indefinitely. Joining there would risk blocking spawn_blocking for the full timeout duration (or forever if the plugin wedges). The Arc<InstanceState> keeps the plugin instance alive until the worker finishes, so detaching is safe in those error paths.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #523 — a bounded join with a short secondary timeout could be explored there.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 25, 2026

Codecov Report

❌ Patch coverage is 67.34694% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.41%. Comparing base (b2d01c2) to head (b988723).

Files with missing lines Patch % Lines
crates/plugin-native/src/wrapper.rs 67.34% 48 Missing ⚠️

❌ Your patch check has failed because the patch coverage (67.34%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #519      +/-   ##
==========================================
+ Coverage   79.39%   79.41%   +0.01%     
==========================================
  Files         232      232              
  Lines       66904    66935      +31     
  Branches     1909     1909              
==========================================
+ Hits        53117    53155      +38     
+ Misses      13781    13774       -7     
  Partials        6        6              
Flag Coverage Δ
backend 79.14% <67.34%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
core 85.27% <ø> (ø)
engine 83.59% <ø> (-0.10%) ⬇️
api 89.98% <ø> (ø)
nodes 75.38% <ø> (+0.04%) ⬆️
server 80.26% <ø> (ø)
plugin-native 83.55% <67.34%> (+0.07%) ⬆️
plugin-wasm 91.90% <ø> (ø)
ui-services 84.67% <ø> (ø)
ui-components 60.49% <ø> (ø)
Files with missing lines Coverage Δ
crates/plugin-native/src/wrapper.rs 81.50% <67.34%> (+0.11%) ⬆️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

plugin-native: teardown race in NativeNodeWrapper::run_source between detached worker thread and dlclose

2 participants