Skip to content

stall-detection: rate-limit symbolication of resolved stalls#16

Open
josefbacik wants to merge 1 commit into
anthropics:anthropic-1.49.0from
josefbacik:stall-detection-symbolication-rate-limit
Open

stall-detection: rate-limit symbolication of resolved stalls#16
josefbacik wants to merge 1 commit into
anthropics:anthropic-1.49.0from
josefbacik:stall-detection-symbolication-rate-limit

Conversation

@josefbacik

Copy link
Copy Markdown

Motivation

A production systing memory trace of public-api (Python service, multiple hundred-MB Rust PyO3 .sos, triokio tokio runtime per worker) showed the stall monitor thread's symbolication pipeline at ~54 GB/min of allocations, ~21% of the node's entire allocation volume — stacks anchored at run_monitorsymbolicate_tracegimli::resolveelf::parse.

Two mechanisms compound the cost:

  1. Cache thrash. backtrace::resolve caches at most 4 mappings globally (MAPPINGS_CACHE_SIZE in the backtrace crate). A process with dozens of .sos thrashes that cache — each symbolication re-parses ELF/DWARF for most frames. A single stall trace spanning 5+ .sos can evict the whole cache before it finishes.
  2. Global-lock feedback loop. backtrace::resolve takes the crate's global cache mutex. The monitor thread holding it while parsing a 300 MB .so's DWARF can block any other in-process resolver (e.g. an eyre::Report backtrace, see antenv-rs #470087). If that blocked caller is itself a tokio worker inside poll, the monitor detects another stall — which triggers more symbolication.

An in-fork symbolication cache (per-.so addr2line::Context keyed by mapping) was considered and rejected: the parsed abbreviation tables and line-program headers are ~5–30 MB of private memory per .so for binaries this size. With ~8 Rust .sos × 24 worker processes per pod, that's several GB of new resident memory for a feature that should be lightweight observability.

What this does

Add a SymbolicationLimiter that permits at most one symbolicated resolved-stall report per min_symbolication_interval (default: the escalation threshold's default, 10 s).

  • Resolved stalls that arrive faster than the interval are still detected, counted, and reported via tracing and the on_stall callback — with raw {ip:#x} frames instead of symbolicated ones. Stall counting is never suppressed, only the expensive per-frame lookup.
  • Escalated stalls (those that cross escalation_threshold) always symbolicate — they are rare and the cost is justified.
  • The tracing::warn! line carries symbolication_skipped_since_last so operators can see how much the limiter is suppressing without watching allocation dashboards.
  • Duration::ZERO disables the limiter, restoring pre-limiter behavior.
  • symbolicated_frames.len() == backtrace_frames.len() still holds for callback consumers that zip the two — rate-limited entries are formatted as {ip:#x}, which is also what a failed resolution produces.
  • New builder method stall_detection_min_symbolication_interval(Duration).

Bump version to 1.49.7000+anthropic.

Testing

  • 4 new unit tests on SymbolicationLimiter (first request, suppression window, zero-interval disable, clock-skew safety via saturating_duration_since).
  • 2 new integration tests in rt_stall_detection.rs exercising a real runtime with three back-to-back blocking stalls: rate_limits_symbolication asserts exactly one symbolicated event at a large interval, zero_interval_symbolicates_every_stall asserts every event symbolicates when disabled.
  • All 17 existing rt_stall_detection tests pass unchanged.
  • cargo check without stall-detection feature — no impact on default builds.
  • cargo clippy --features full,stall-detection --lib --tests — no new warnings.

Follow-up (monorepo side, separate PR)

  • Bump the [patch.crates-io] pin to 1.49.7000+anthropic in the root workspace + the other four workspaces.
  • Add TOKIO_STALL_DETECTION_MIN_SYMBOLICATION_SECS env var parsing in ants-rs/ant-async/src/stall_detection.rs alongside the existing TOKIO_STALL_DETECTION_* vars.
  • Regenerate buildinfra/patches/tokio.patch for the Bazel build.

Planning to A/B via a GrowthBook gate on the env var before default-on.

🤖 Generated with Claude Code

Symbolicating a stall trace calls backtrace::resolve per instruction
pointer, which parses each mapped object's DWARF debug info and caches at
most 4 mappings globally (MAPPINGS_CACHE_SIZE in the backtrace crate). A
process with dozens of loaded .so files thrashes that cache — every
symbolication re-parses ELF/DWARF for most frames. A production trace of
a Python service loading several hundred-MB Rust PyO3 extensions showed
the stall monitor's symbolication at ~54 GB/min of allocations, ~21% of
the node's entire allocation volume.

It also contends on the backtrace crate's global cache mutex, so the
monitor thread's symbolication can block any concurrent in-process
resolver (e.g. an eyre::Report backtrace) — and if that blocked caller is
itself a tokio worker inside poll, trigger another stall detection, which
symbolicates again: a positive feedback loop.

An in-fork symbolication cache was considered and rejected: a cached
addr2line::Context materializes the parsed abbreviation tables and
line-program headers in private memory, roughly 5-30 MB per hundred-MB
.so. With eight Rust .so files and two dozen worker processes per pod
that's several GB of new resident memory for a feature that should be
lightweight observability.

Rate-limit instead. Resolved stalls that arrive faster than
min_symbolication_interval (default: the escalation threshold, 10s) are
still detected, counted, and reported via tracing and on_stall, but with
raw hex instruction pointers instead of symbolicated frames — the
expensive per-frame lookup is skipped. Escalated stalls always
symbolicate; they are rare and the cost is justified. The log line
carries symbolication_skipped_since_last so operators can see how much
the limiter is suppressing. Duration::ZERO restores the pre-limiter
behavior.

symbolicated_frames stays the same length as backtrace_frames so callback
consumers that zip the two are unaffected; rate-limited entries are
formatted as {ip:#x}, which is also what a failed resolution produces.

Bump version to 1.49.7000+anthropic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@antsujay antsujay left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable to me!

@njsmith

njsmith commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Seems fine, though it's more convoluted than I would have done. I'd just put a rate limit on stall reporting. In practice the way this is used, if you're hitting stalls so frequently that you're having trouble reporting them all, then those reports are not valuable -- either you screwed up your reporting thresholds and should fix those, or your program really does stall constantly in which case stalls are cheap and easy to reproduce and the only useful thing to do is pick some stall report and fix it and then remeasure, repeat. Either way, getting an exhaustive report has no value, so no point in jumping through hoops to make sure we're still providing non-symbolicated reports.

This would also provide some protection against other possible pathological interactions between the instrumentation and the code-being-instrumented.

(Fun example from Anthropic's early days: we had a version of this running on Python, PyTorch's networking code's timeout strategy relied on the kernel's SO_TIMEOUT feature -> the socket operations caused stalls -> we used a signal to collect a traceback -> the signal forces the socket sycall to EINTR and be reissued -> from the kernel's perspective this is a new syscall with a new SO_TIMEOUT -> if your 10 second timeout is being reset every 5 seconds, that is the same as not having a timeout at all. Tldr turning on stall detection made PyTorch hang irrecoverably whenever a peer died.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants