stall-detection: rate-limit symbolication of resolved stalls#16
stall-detection: rate-limit symbolication of resolved stalls#16josefbacik wants to merge 1 commit into
Conversation
Symbolicating a stall trace calls backtrace::resolve per instruction
pointer, which parses each mapped object's DWARF debug info and caches at
most 4 mappings globally (MAPPINGS_CACHE_SIZE in the backtrace crate). A
process with dozens of loaded .so files thrashes that cache — every
symbolication re-parses ELF/DWARF for most frames. A production trace of
a Python service loading several hundred-MB Rust PyO3 extensions showed
the stall monitor's symbolication at ~54 GB/min of allocations, ~21% of
the node's entire allocation volume.
It also contends on the backtrace crate's global cache mutex, so the
monitor thread's symbolication can block any concurrent in-process
resolver (e.g. an eyre::Report backtrace) — and if that blocked caller is
itself a tokio worker inside poll, trigger another stall detection, which
symbolicates again: a positive feedback loop.
An in-fork symbolication cache was considered and rejected: a cached
addr2line::Context materializes the parsed abbreviation tables and
line-program headers in private memory, roughly 5-30 MB per hundred-MB
.so. With eight Rust .so files and two dozen worker processes per pod
that's several GB of new resident memory for a feature that should be
lightweight observability.
Rate-limit instead. Resolved stalls that arrive faster than
min_symbolication_interval (default: the escalation threshold, 10s) are
still detected, counted, and reported via tracing and on_stall, but with
raw hex instruction pointers instead of symbolicated frames — the
expensive per-frame lookup is skipped. Escalated stalls always
symbolicate; they are rare and the cost is justified. The log line
carries symbolication_skipped_since_last so operators can see how much
the limiter is suppressing. Duration::ZERO restores the pre-limiter
behavior.
symbolicated_frames stays the same length as backtrace_frames so callback
consumers that zip the two are unaffected; rate-limited entries are
formatted as {ip:#x}, which is also what a failed resolution produces.
Bump version to 1.49.7000+anthropic.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
antsujay
left a comment
There was a problem hiding this comment.
seems reasonable to me!
|
Seems fine, though it's more convoluted than I would have done. I'd just put a rate limit on stall reporting. In practice the way this is used, if you're hitting stalls so frequently that you're having trouble reporting them all, then those reports are not valuable -- either you screwed up your reporting thresholds and should fix those, or your program really does stall constantly in which case stalls are cheap and easy to reproduce and the only useful thing to do is pick some stall report and fix it and then remeasure, repeat. Either way, getting an exhaustive report has no value, so no point in jumping through hoops to make sure we're still providing non-symbolicated reports. This would also provide some protection against other possible pathological interactions between the instrumentation and the code-being-instrumented. (Fun example from Anthropic's early days: we had a version of this running on Python, PyTorch's networking code's timeout strategy relied on the kernel's |
Motivation
A production systing memory trace of
public-api(Python service, multiple hundred-MB Rust PyO3.sos,triokiotokio runtime per worker) showed the stall monitor thread's symbolication pipeline at ~54 GB/min of allocations, ~21% of the node's entire allocation volume — stacks anchored atrun_monitor→symbolicate_trace→gimli::resolve→elf::parse.Two mechanisms compound the cost:
backtrace::resolvecaches at most 4 mappings globally (MAPPINGS_CACHE_SIZEin thebacktracecrate). A process with dozens of.sos thrashes that cache — each symbolication re-parses ELF/DWARF for most frames. A single stall trace spanning 5+.sos can evict the whole cache before it finishes.backtrace::resolvetakes the crate's global cache mutex. The monitor thread holding it while parsing a 300 MB.so's DWARF can block any other in-process resolver (e.g. aneyre::Reportbacktrace, see antenv-rs #470087). If that blocked caller is itself a tokio worker insidepoll, the monitor detects another stall — which triggers more symbolication.An in-fork symbolication cache (per-
.soaddr2line::Contextkeyed by mapping) was considered and rejected: the parsed abbreviation tables and line-program headers are ~5–30 MB of private memory per.sofor binaries this size. With ~8 Rust.sos × 24 worker processes per pod, that's several GB of new resident memory for a feature that should be lightweight observability.What this does
Add a
SymbolicationLimiterthat permits at most one symbolicated resolved-stall report permin_symbolication_interval(default: the escalation threshold's default, 10 s).tracingand theon_stallcallback — with raw{ip:#x}frames instead of symbolicated ones. Stall counting is never suppressed, only the expensive per-frame lookup.escalation_threshold) always symbolicate — they are rare and the cost is justified.tracing::warn!line carriessymbolication_skipped_since_lastso operators can see how much the limiter is suppressing without watching allocation dashboards.Duration::ZEROdisables the limiter, restoring pre-limiter behavior.symbolicated_frames.len() == backtrace_frames.len()still holds for callback consumers that zip the two — rate-limited entries are formatted as{ip:#x}, which is also what a failed resolution produces.stall_detection_min_symbolication_interval(Duration).Bump version to
1.49.7000+anthropic.Testing
SymbolicationLimiter(first request, suppression window, zero-interval disable, clock-skew safety viasaturating_duration_since).rt_stall_detection.rsexercising a real runtime with three back-to-back blocking stalls:rate_limits_symbolicationasserts exactly one symbolicated event at a large interval,zero_interval_symbolicates_every_stallasserts every event symbolicates when disabled.rt_stall_detectiontests pass unchanged.cargo checkwithoutstall-detectionfeature — no impact on default builds.cargo clippy --features full,stall-detection --lib --tests— no new warnings.Follow-up (monorepo side, separate PR)
[patch.crates-io]pin to1.49.7000+anthropicin the root workspace + the other four workspaces.TOKIO_STALL_DETECTION_MIN_SYMBOLICATION_SECSenv var parsing inants-rs/ant-async/src/stall_detection.rsalongside the existingTOKIO_STALL_DETECTION_*vars.buildinfra/patches/tokio.patchfor the Bazel build.Planning to A/B via a GrowthBook gate on the env var before default-on.
🤖 Generated with Claude Code