stall-detection: rate-limit symbolication of resolved stalls by josefbacik · Pull Request #16 · anthropics/tokio

josefbacik · 2026-05-18T18:29:18Z

Motivation

A production systing memory trace of public-api (Python service, multiple hundred-MB Rust PyO3 .sos, triokio tokio runtime per worker) showed the stall monitor thread's symbolication pipeline at ~54 GB/min of allocations, ~21% of the node's entire allocation volume — stacks anchored at run_monitor → symbolicate_trace → gimli::resolve → elf::parse.

Two mechanisms compound the cost:

Cache thrash. backtrace::resolve caches at most 4 mappings globally (MAPPINGS_CACHE_SIZE in the backtrace crate). A process with dozens of .sos thrashes that cache — each symbolication re-parses ELF/DWARF for most frames. A single stall trace spanning 5+ .sos can evict the whole cache before it finishes.
Global-lock feedback loop. backtrace::resolve takes the crate's global cache mutex. The monitor thread holding it while parsing a 300 MB .so's DWARF can block any other in-process resolver (e.g. an eyre::Report backtrace, see antenv-rs #470087). If that blocked caller is itself a tokio worker inside poll, the monitor detects another stall — which triggers more symbolication.

An in-fork symbolication cache (per-.so addr2line::Context keyed by mapping) was considered and rejected: the parsed abbreviation tables and line-program headers are ~5–30 MB of private memory per .so for binaries this size. With ~8 Rust .sos × 24 worker processes per pod, that's several GB of new resident memory for a feature that should be lightweight observability.

What this does

Add a SymbolicationLimiter that permits at most one symbolicated resolved-stall report per min_symbolication_interval (default: the escalation threshold's default, 10 s).

Resolved stalls that arrive faster than the interval are still detected, counted, and reported via tracing and the on_stall callback — with raw {ip:#x} frames instead of symbolicated ones. Stall counting is never suppressed, only the expensive per-frame lookup.
Escalated stalls (those that cross escalation_threshold) always symbolicate — they are rare and the cost is justified.
The tracing::warn! line carries symbolication_skipped_since_last so operators can see how much the limiter is suppressing without watching allocation dashboards.
Duration::ZERO disables the limiter, restoring pre-limiter behavior.
symbolicated_frames.len() == backtrace_frames.len() still holds for callback consumers that zip the two — rate-limited entries are formatted as {ip:#x}, which is also what a failed resolution produces.
New builder method stall_detection_min_symbolication_interval(Duration).

Bump version to 1.49.7000+anthropic.

Testing

4 new unit tests on SymbolicationLimiter (first request, suppression window, zero-interval disable, clock-skew safety via saturating_duration_since).
2 new integration tests in rt_stall_detection.rs exercising a real runtime with three back-to-back blocking stalls: rate_limits_symbolication asserts exactly one symbolicated event at a large interval, zero_interval_symbolicates_every_stall asserts every event symbolicates when disabled.
All 17 existing rt_stall_detection tests pass unchanged.
cargo check without stall-detection feature — no impact on default builds.
cargo clippy --features full,stall-detection --lib --tests — no new warnings.

Follow-up (monorepo side, separate PR)

Bump the [patch.crates-io] pin to 1.49.7000+anthropic in the root workspace + the other four workspaces.
Add TOKIO_STALL_DETECTION_MIN_SYMBOLICATION_SECS env var parsing in ants-rs/ant-async/src/stall_detection.rs alongside the existing TOKIO_STALL_DETECTION_* vars.
Regenerate buildinfra/patches/tokio.patch for the Bazel build.

Planning to A/B via a GrowthBook gate on the env var before default-on.

🤖 Generated with Claude Code

Symbolicating a stall trace calls backtrace::resolve per instruction pointer, which parses each mapped object's DWARF debug info and caches at most 4 mappings globally (MAPPINGS_CACHE_SIZE in the backtrace crate). A process with dozens of loaded .so files thrashes that cache — every symbolication re-parses ELF/DWARF for most frames. A production trace of a Python service loading several hundred-MB Rust PyO3 extensions showed the stall monitor's symbolication at ~54 GB/min of allocations, ~21% of the node's entire allocation volume. It also contends on the backtrace crate's global cache mutex, so the monitor thread's symbolication can block any concurrent in-process resolver (e.g. an eyre::Report backtrace) — and if that blocked caller is itself a tokio worker inside poll, trigger another stall detection, which symbolicates again: a positive feedback loop. An in-fork symbolication cache was considered and rejected: a cached addr2line::Context materializes the parsed abbreviation tables and line-program headers in private memory, roughly 5-30 MB per hundred-MB .so. With eight Rust .so files and two dozen worker processes per pod that's several GB of new resident memory for a feature that should be lightweight observability. Rate-limit instead. Resolved stalls that arrive faster than min_symbolication_interval (default: the escalation threshold, 10s) are still detected, counted, and reported via tracing and on_stall, but with raw hex instruction pointers instead of symbolicated frames — the expensive per-frame lookup is skipped. Escalated stalls always symbolicate; they are rare and the cost is justified. The log line carries symbolication_skipped_since_last so operators can see how much the limiter is suppressing. Duration::ZERO restores the pre-limiter behavior. symbolicated_frames stays the same length as backtrace_frames so callback consumers that zip the two are unaffected; rate-limited entries are formatted as {ip:#x}, which is also what a failed resolution produces. Bump version to 1.49.7000+anthropic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

antsujay

seems reasonable to me!

njsmith · 2026-05-26T19:35:19Z

Seems fine, though it's more convoluted than I would have done. I'd just put a rate limit on stall reporting. In practice the way this is used, if you're hitting stalls so frequently that you're having trouble reporting them all, then those reports are not valuable -- either you screwed up your reporting thresholds and should fix those, or your program really does stall constantly in which case stalls are cheap and easy to reproduce and the only useful thing to do is pick some stall report and fix it and then remeasure, repeat. Either way, getting an exhaustive report has no value, so no point in jumping through hoops to make sure we're still providing non-symbolicated reports.

This would also provide some protection against other possible pathological interactions between the instrumentation and the code-being-instrumented.

(Fun example from Anthropic's early days: we had a version of this running on Python, PyTorch's networking code's timeout strategy relied on the kernel's SO_TIMEOUT feature -> the socket operations caused stalls -> we used a signal to collect a traceback -> the signal forces the socket sycall to EINTR and be reissued -> from the kernel's perspective this is a new syscall with a new SO_TIMEOUT -> if your 10 second timeout is being reset every 5 seconds, that is the same as not having a timeout at all. Tldr turning on stall detection made PyTorch hang irrecoverably whenever a peer died.)

edwinsmith requested review from antsujay, edwinsmith and njsmith May 22, 2026 10:05

antsujay approved these changes May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

stall-detection: rate-limit symbolication of resolved stalls#16

stall-detection: rate-limit symbolication of resolved stalls#16
josefbacik wants to merge 1 commit into
anthropics:anthropic-1.49.0from
josefbacik:stall-detection-symbolication-rate-limit

josefbacik commented May 18, 2026

Uh oh!

antsujay left a comment

Uh oh!

njsmith commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

josefbacik commented May 18, 2026

Motivation

What this does

Testing

Follow-up (monorepo side, separate PR)

Uh oh!

antsujay left a comment

Choose a reason for hiding this comment

Uh oh!

njsmith commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants