fix(autopilot): upload the hard crash and stop false app-hangs#15
Merged
Conversation
Two issues surfaced by the scheduled run-demo jobs: - The headline "Convoluted Chain" crash was captured but never reached Sentry. The crash upload mode was async, so on a one-shot CI run the daemon was torn down with the job before it finished uploading (and nothing relaunches to flush it). Add SentryConfig::crash_upload_sync and set it for the headless runner so the process blocks until the daemon has sent the minidump, then exits non-zero. - The autopilot produced spurious "(anonymous namespace)::sleep_ms" app hangs that didn't group (different durations). The loop only fed the app-hang watchdog once per iteration, so a slow backend call plus the 1.5s pace wait exceeded the 2s threshold. Heartbeat after every step, wait via a heartbeating idle(), and widen the headless app-hang timeout to 4s. Now only the deliberate 8s app-hang scenario trips the watchdog, and it groups under its "app-hang" fingerprint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
So a manual dispatch on a feature branch exercises that branch's binary, and scheduled master runs keep using master builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Cap the backend checkout call to 3s (curl + WinHTTP) and widen the
headless app-hang threshold to 6s, so a slow Flask response can no longer
be mistaken for a UI hang. This removes the spurious app-hang that was
grouping into the backend-error issue ("checkout failed / App hung").
- run-demo: relaunch the binary once after the autopilot crash so the native
backend uploads any crash still on disk (belt-and-suspenders with the sync
upload). Temporarily enable EMPOWER_DEBUG to verify in CI.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hard crash (the Convoluted Chain, transaction "firmware.flash") was never a capture/format problem - it just wasn't uploading before the one-shot CI job tore down. crash_upload_sync handles that on its own (the equivalent of crashpad's prompt upload), so the relaunch/flush and the temporary debug logging are removed. Crash reporting stays on the native backend with a minidump. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native backend already keeps the crashed process alive until the daemon is done (crash_upload_mode=SYNC, the default). The reason the crash didn't reach Sentry from CI was the daemon's flush window: shutdown_timeout defaults to 2s, too short for our ~1MB crash envelope (minidump + screenshot), so the SDK dumped it to disk to send "on next restart" - which never happens in a one-shot CI run. Raise shutdown_timeout to 8s (under the ~10s crash-handler wait cap) when sync upload is enabled, so the upload completes in-process. This is the native-backend equivalent of crashpad's wait_for_upload; no relaunch/flush needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Convoluted Chain crashes by jumping to a corrupted callback pointer (0xc0c0c0c0), which symbolicates oddly server-side and made the crash issue hard to recognize in CI. Use a plain null dereference (SIGSEGV at 0x0) for the autopilot's final crash so it surfaces as an obvious, normally-symbolicated crash issue. The Convoluted Chain remains in the Chaos Lab for manual/GUI and Seer demos. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause of the missing CI crashes: the external crash reporter
(Sentry.CrashReporter) is bundled next to the binary and was auto-detected
for every component. With it set, the SDK hands the crash envelope to that
separate GUI app to submit - but headless/CI can't launch it ("execv failed:
Permission denied"), so the crash was written out for the reporter and never
sent (while logs/sessions still uploaded, which is why ingest looked fine).
Gate it behind SentryConfig::use_external_crash_reporter, enabled only for the
interactive GUI. Headless and the smoke test now submit crashes themselves.
Also reverts the temporary null-deref simplification: the autopilot's final
crash is the Convoluted Chain again.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #12, fixing two issues observed in the scheduled
run-demojobs.1. The hard crash never reached Sentry
The autopilot's final Convoluted Chain crash was captured to disk but not uploaded: crash upload was async, so on a one-shot CI run the crash daemon got torn down with the job before it finished sending (and nothing relaunches to flush the pending envelope).
Fix: add
SentryConfig::crash_upload_syncand enable it for the headless runner, so the process blocks until the daemon has uploaded the minidump, then exits non-zero. Verified locally: a single--crash convolutedrun now sends the minidump envelope before exit.2. Spurious, ungrouped app-hangs
The runs showed
(anonymous namespace)::sleep_ms"App hung for ~2.1s / ~2.3s" issues that didn't group. These were false positives: the autopilot fed the app-hang watchdog only once per iteration, so a slow backend call plus the 1.5s pace wait exceeded the 2s threshold.Fix: heartbeat after every step, wait via a heartbeating
idle(), and widen the headless app-hang timeout to 4s. Now only the deliberate 8s app-hang scenario trips the watchdog, grouped under itsapp-hangfingerprint. (The(anonymous namespace)::prefix is just how Sentry renders internal-linkage C++ symbols.)🤖 Generated with Claude Code