ref: Stop arroyo ChildProcessTerminated from creating Sentry issues#8094
Merged
Conversation
A multiprocessing worker in a consumer being killed by a signal (almost always an OOM-kill of the worker, sometimes a native crash) makes arroyo raise ChildProcessTerminated. arroyo logs it at ERROR via logger.exception and then re-raises it, so a single worker death surfaces as several ERROR-level Sentry issues (SNUBA-BA7/BA8/BA9/BAD/B1T) -- four from the "Caught exception, shutting down..." ERROR log (grouped by wherever the SIGCHLD interrupted the run loop) plus one from the re-raised, unhandled exception. The consumer simply shuts down and is restarted by the orchestrator, so these are transient/operational noise, not actionable bugs. These are ERROR-level, so the WARN->log policy from #8077 does not cover them. Filter them by exception type in before_send (an isinstance check on the cause/context chain -- the robust, non-string approach), alongside the existing AllocationPolicyViolations / RPCAllocationPolicyException filtering. The worker death is still observable via logs and arroyo's sigchld.detected metric. Fixes SNUBA-BA7 Fixes SNUBA-BA8 Fixes SNUBA-BA9 Fixes SNUBA-BAD Fixes SNUBA-B1T Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 503f013. Configure here.
…ore_send After `raise X from None`, Python keeps the prior exception on __context__ but sets __suppress_context__ = True. The chain walk used `__cause__ or __context__`, which followed that suppressed context and could drop a legitimate Sentry event whose explicitly-suppressed context happened to contain a noise type. Follow the chain the way CPython's own traceback printer does: explicit cause wins, else the implicit context unless it was suppressed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
RPCAllocationPolicyException.__init__ requires (message, routing_decision_dict). The test passed only the message, which would raise TypeError before reaching before_send. Pass an empty routing_decision_dict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
Build on the pre-commit autofix (ruff line-wrap) with the type fixes the autofix did not make: annotate event/hint as sentry_sdk Event/Hint, drop the dict(event) calls (a plain dict isn't assignable to the Event TypedDict param) and the bare `-> dict` return (disallow_any_generics), asserting identity against the typed event instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
MeredithAnya
approved these changes
Jun 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
SNUBA-BA7,SNUBA-BA8,SNUBA-BA9,SNUBA-BAD, andSNUBA-B1Tare all the same underlying event: a multiprocessing worker in thetransactionsconsumer was killed by a signal (ChildProcessTerminated: 17, where17isSIGCHLD— i.e. the parent was notified a child died, not the child's own exit code).Root cause
arroyo's
RunTaskWithMultiprocessinginstalls aSIGCHLDhandler. When a worker child terminates unexpectedly while the strategy is still running, the handler raisesChildProcessTerminated. InStreamProcessor.run()arroyo does:...and then re-raises. So a single worker death fans out into multiple Sentry issues:
SNUBA-BA7/BA8/BAD/B1T,handled: yes,logger: arroyo.processing.processor) — the ERROR-level"Caught exception, shutting down..."log, captured byLoggingIntegration. They're separate issues only because Sentry groups them by wherever the SIGCHLD happened to interrupt the run loop (submit,pickle.dumps,buffer_callback,poll, …).SNUBA-BA9,handled: no,mechanism: excepthook) — the same exception, re-raised and uncaught, crashing the process.The worker death itself is almost always an OOM-kill of the worker (sometimes a native crash). The consumer then shuts down and is restarted by the orchestrator — it recovers on its own. These have
0users impacted andsuper_low/lowSeer actionability.These are ERROR-level, so the
WARN → logpolicy from #8077 does not cover them, which is why they surfaced as top unresolved issues the day after #8077 shipped.Change
Extend
before_sendinsnuba/environment.pyto dropChildProcessTerminatedby exception type (anisinstancecheck walking the__cause__/__context__chain — the robust, non-string approach #8077's review preferred), alongside the existingAllocationPolicyViolations/RPCAllocationPolicyExceptionfiltering. Both the logging path and the excepthook path flow throughbefore_sendwithhint["exc_info"]set, so this suppresses all 5 issues at once.The events are still captured as logs/breadcrumbs, and the underlying worker death remains observable via arroyo's
sigchld.detectedmetric.Note / trade-off
This stops the noise; it does not stop workers from dying. If the worker deaths are OOM-kills (the most likely cause), the real mitigation is operational — worker memory limits,
--processes, andinput_block_size/output_block_size— which live in the deployment config, not this repo. Worth keeping an eye on thesigchld.detectedmetric for thetransactionsconsumer; a sustained spike there would indicate a genuine crash-loop rather than occasional worker churn.Test
Adds
tests/test_environment.pycoveringbefore_send: pass-through (no/None/unrelated exc), droppingChildProcessTerminated(top-level and nested in a cause chain), dropping the allocation-policy exceptions, and safe termination on a cyclic exception chain.Fixes SNUBA-BA7
Fixes SNUBA-BA8
Fixes SNUBA-BA9
Fixes SNUBA-BAD
Fixes SNUBA-B1T
🤖 Generated with Claude Code
https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
Generated by Claude Code