Skip to content

fix(autopilot): upload the hard crash and stop false app-hangs#15

Merged
mujacica merged 9 commits into
masterfrom
fix/autopilot-crash-and-app-hangs
Jun 25, 2026
Merged

fix(autopilot): upload the hard crash and stop false app-hangs#15
mujacica merged 9 commits into
masterfrom
fix/autopilot-crash-and-app-hangs

Conversation

@mujacica

Copy link
Copy Markdown
Collaborator

Follow-up to #12, fixing two issues observed in the scheduled run-demo jobs.

1. The hard crash never reached Sentry

The autopilot's final Convoluted Chain crash was captured to disk but not uploaded: crash upload was async, so on a one-shot CI run the crash daemon got torn down with the job before it finished sending (and nothing relaunches to flush the pending envelope).

Fix: add SentryConfig::crash_upload_sync and enable it for the headless runner, so the process blocks until the daemon has uploaded the minidump, then exits non-zero. Verified locally: a single --crash convoluted run now sends the minidump envelope before exit.

2. Spurious, ungrouped app-hangs

The runs showed (anonymous namespace)::sleep_ms "App hung for ~2.1s / ~2.3s" issues that didn't group. These were false positives: the autopilot fed the app-hang watchdog only once per iteration, so a slow backend call plus the 1.5s pace wait exceeded the 2s threshold.

Fix: heartbeat after every step, wait via a heartbeating idle(), and widen the headless app-hang timeout to 4s. Now only the deliberate 8s app-hang scenario trips the watchdog, grouped under its app-hang fingerprint. (The (anonymous namespace):: prefix is just how Sentry renders internal-linkage C++ symbols.)

🤖 Generated with Claude Code

mujacica and others added 9 commits June 25, 2026 15:46
Two issues surfaced by the scheduled run-demo jobs:

- The headline "Convoluted Chain" crash was captured but never reached
  Sentry. The crash upload mode was async, so on a one-shot CI run the
  daemon was torn down with the job before it finished uploading (and
  nothing relaunches to flush it). Add SentryConfig::crash_upload_sync and
  set it for the headless runner so the process blocks until the daemon has
  sent the minidump, then exits non-zero.

- The autopilot produced spurious "(anonymous namespace)::sleep_ms" app
  hangs that didn't group (different durations). The loop only fed the
  app-hang watchdog once per iteration, so a slow backend call plus the
  1.5s pace wait exceeded the 2s threshold. Heartbeat after every step,
  wait via a heartbeating idle(), and widen the headless app-hang timeout
  to 4s. Now only the deliberate 8s app-hang scenario trips the watchdog,
  and it groups under its "app-hang" fingerprint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
So a manual dispatch on a feature branch exercises that branch's binary,
and scheduled master runs keep using master builds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Cap the backend checkout call to 3s (curl + WinHTTP) and widen the
  headless app-hang threshold to 6s, so a slow Flask response can no longer
  be mistaken for a UI hang. This removes the spurious app-hang that was
  grouping into the backend-error issue ("checkout failed / App hung").
- run-demo: relaunch the binary once after the autopilot crash so the native
  backend uploads any crash still on disk (belt-and-suspenders with the sync
  upload). Temporarily enable EMPOWER_DEBUG to verify in CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hard crash (the Convoluted Chain, transaction "firmware.flash") was
never a capture/format problem - it just wasn't uploading before the
one-shot CI job tore down. crash_upload_sync handles that on its own (the
equivalent of crashpad's prompt upload), so the relaunch/flush and the
temporary debug logging are removed. Crash reporting stays on the native
backend with a minidump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native backend already keeps the crashed process alive until the daemon
is done (crash_upload_mode=SYNC, the default). The reason the crash didn't
reach Sentry from CI was the daemon's flush window: shutdown_timeout defaults
to 2s, too short for our ~1MB crash envelope (minidump + screenshot), so the
SDK dumped it to disk to send "on next restart" - which never happens in a
one-shot CI run. Raise shutdown_timeout to 8s (under the ~10s crash-handler
wait cap) when sync upload is enabled, so the upload completes in-process.

This is the native-backend equivalent of crashpad's wait_for_upload; no
relaunch/flush needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Convoluted Chain crashes by jumping to a corrupted callback pointer
(0xc0c0c0c0), which symbolicates oddly server-side and made the crash issue
hard to recognize in CI. Use a plain null dereference (SIGSEGV at 0x0) for the
autopilot's final crash so it surfaces as an obvious, normally-symbolicated
crash issue. The Convoluted Chain remains in the Chaos Lab for manual/GUI and
Seer demos.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Root cause of the missing CI crashes: the external crash reporter
(Sentry.CrashReporter) is bundled next to the binary and was auto-detected
for every component. With it set, the SDK hands the crash envelope to that
separate GUI app to submit - but headless/CI can't launch it ("execv failed:
Permission denied"), so the crash was written out for the reporter and never
sent (while logs/sessions still uploaded, which is why ingest looked fine).

Gate it behind SentryConfig::use_external_crash_reporter, enabled only for the
interactive GUI. Headless and the smoke test now submit crashes themselves.

Also reverts the temporary null-deref simplification: the autopilot's final
crash is the Convoluted Chain again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mujacica mujacica merged commit 65080e4 into master Jun 25, 2026
7 checks passed
@mujacica mujacica deleted the fix/autopilot-crash-and-app-hangs branch June 25, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant