Skip to content

ref: Stop arroyo ChildProcessTerminated from creating Sentry issues#8094

Merged
phacops merged 5 commits into
masterfrom
claude/happy-pascal-cyazn0
Jun 23, 2026
Merged

ref: Stop arroyo ChildProcessTerminated from creating Sentry issues#8094
phacops merged 5 commits into
masterfrom
claude/happy-pascal-cyazn0

Conversation

@phacops

@phacops phacops commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

SNUBA-BA7, SNUBA-BA8, SNUBA-BA9, SNUBA-BAD, and SNUBA-B1T are all the same underlying event: a multiprocessing worker in the transactions consumer was killed by a signal (ChildProcessTerminated: 17, where 17 is SIGCHLD — i.e. the parent was notified a child died, not the child's own exit code).

Root cause

arroyo's RunTaskWithMultiprocessing installs a SIGCHLD handler. When a worker child terminates unexpectedly while the strategy is still running, the handler raises ChildProcessTerminated. In StreamProcessor.run() arroyo does:

except Exception:
    logger.exception("Caught exception, shutting down...")  # ERROR level, with exc_info

...and then re-raises. So a single worker death fans out into multiple Sentry issues:

  • 4 issues (SNUBA-BA7/BA8/BAD/B1T, handled: yes, logger: arroyo.processing.processor) — the ERROR-level "Caught exception, shutting down..." log, captured by LoggingIntegration. They're separate issues only because Sentry groups them by wherever the SIGCHLD happened to interrupt the run loop (submit, pickle.dumps, buffer_callback, poll, …).
  • 1 issue (SNUBA-BA9, handled: no, mechanism: excepthook) — the same exception, re-raised and uncaught, crashing the process.

The worker death itself is almost always an OOM-kill of the worker (sometimes a native crash). The consumer then shuts down and is restarted by the orchestrator — it recovers on its own. These have 0 users impacted and super_low/low Seer actionability.

These are ERROR-level, so the WARN → log policy from #8077 does not cover them, which is why they surfaced as top unresolved issues the day after #8077 shipped.

Change

Extend before_send in snuba/environment.py to drop ChildProcessTerminated by exception type (an isinstance check walking the __cause__/__context__ chain — the robust, non-string approach #8077's review preferred), alongside the existing AllocationPolicyViolations / RPCAllocationPolicyException filtering. Both the logging path and the excepthook path flow through before_send with hint["exc_info"] set, so this suppresses all 5 issues at once.

The events are still captured as logs/breadcrumbs, and the underlying worker death remains observable via arroyo's sigchld.detected metric.

Note / trade-off

This stops the noise; it does not stop workers from dying. If the worker deaths are OOM-kills (the most likely cause), the real mitigation is operational — worker memory limits, --processes, and input_block_size/output_block_size — which live in the deployment config, not this repo. Worth keeping an eye on the sigchld.detected metric for the transactions consumer; a sustained spike there would indicate a genuine crash-loop rather than occasional worker churn.

Test

Adds tests/test_environment.py covering before_send: pass-through (no/None/unrelated exc), dropping ChildProcessTerminated (top-level and nested in a cause chain), dropping the allocation-policy exceptions, and safe termination on a cyclic exception chain.

Fixes SNUBA-BA7
Fixes SNUBA-BA8
Fixes SNUBA-BA9
Fixes SNUBA-BAD
Fixes SNUBA-B1T

🤖 Generated with Claude Code

https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB


Generated by Claude Code

A multiprocessing worker in a consumer being killed by a signal (almost
always an OOM-kill of the worker, sometimes a native crash) makes arroyo
raise ChildProcessTerminated. arroyo logs it at ERROR via logger.exception
and then re-raises it, so a single worker death surfaces as several
ERROR-level Sentry issues (SNUBA-BA7/BA8/BA9/BAD/B1T) -- four from the
"Caught exception, shutting down..." ERROR log (grouped by wherever the
SIGCHLD interrupted the run loop) plus one from the re-raised, unhandled
exception. The consumer simply shuts down and is restarted by the
orchestrator, so these are transient/operational noise, not actionable bugs.

These are ERROR-level, so the WARN->log policy from #8077 does not cover
them. Filter them by exception type in before_send (an isinstance check on
the cause/context chain -- the robust, non-string approach), alongside the
existing AllocationPolicyViolations / RPCAllocationPolicyException filtering.
The worker death is still observable via logs and arroyo's sigchld.detected
metric.

Fixes SNUBA-BA7
Fixes SNUBA-BA8
Fixes SNUBA-BA9
Fixes SNUBA-BAD
Fixes SNUBA-B1T

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
@phacops phacops requested a review from a team as a code owner June 23, 2026 18:11

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 503f013. Configure here.

Comment thread snuba/environment.py Outdated
…ore_send

After `raise X from None`, Python keeps the prior exception on __context__ but
sets __suppress_context__ = True. The chain walk used `__cause__ or __context__`,
which followed that suppressed context and could drop a legitimate Sentry event
whose explicitly-suppressed context happened to contain a noise type. Follow the
chain the way CPython's own traceback printer does: explicit cause wins, else
the implicit context unless it was suppressed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
Comment thread tests/test_environment.py Outdated
claude and others added 3 commits June 23, 2026 18:15
RPCAllocationPolicyException.__init__ requires (message, routing_decision_dict).
The test passed only the message, which would raise TypeError before reaching
before_send. Pass an empty routing_decision_dict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
Build on the pre-commit autofix (ruff line-wrap) with the type fixes the
autofix did not make: annotate event/hint as sentry_sdk Event/Hint, drop the
dict(event) calls (a plain dict isn't assignable to the Event TypedDict param)
and the bare `-> dict` return (disallow_any_generics), asserting identity
against the typed event instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NmYA6zVesfHV8aXRSFGkjB
@phacops phacops merged commit 7c2f1f5 into master Jun 23, 2026
68 checks passed
@phacops phacops deleted the claude/happy-pascal-cyazn0 branch June 23, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants