Skip to content

Dev loop leaves orphaned processes: dev.ts signal handlers preempt lifecycle shutdown; unhandled EADDRINUSE crashes after full boot #459

@markhayden

Description

@markhayden

Summary

Killing bun run dev routinely leaves orphaned processes behind: the in-process HTTP server's antfly child keeps 127.0.0.1:3738 bound, and (when the process is wedged) the dev process itself survives SIGTERM holding :3737. Multiple half-dead generations then overlap and produce phantom UI behavior. This predates the antfly-zig migration — the migration just made it much easier to hit.

Diagnosed live on a machine that had accumulated several generations overnight (2026-06-05/06).

Architecture (for context)

bun run devscripts/dev.ts, which imports server.ts in-process (there is no separate server child). Children of the one dev process: the tailwind watcher and (via the antfly adapter) the antfly swarm server. So "the server child holding 3737" is actually the dev process itself.

Three interacting defects

1. dev.ts's signal handlers preempt the lifecycle shutdown

scripts/dev.ts registerShutdown() registers SIGINT/SIGTERM handlers before await import('../server'):

process.on('SIGINT', () => { cleanup(); process.exit(0) })   // cleanup() only kills tailwind
process.on('SIGTERM', () => { cleanup(); process.exit(0) })

src/core/lifecycle.ts registerShutdownHandlers() later registers the real async shutdown (plugins → dispatch/watchdog/doctor → watcher → search.shutdown()stopAntflyServer()) on the same signals. Node runs signal listeners in registration order: dev.ts's handler runs first and calls process.exit(0) synchronously, so the lifecycle handler never executes. The antfly child (spawned detached: false, but children survive parent death) is orphaned on 3738.

2. Unhandled EADDRINUSE crashes after full boot, skipping cleanup

server.ts calls server.listen(port, ...) with no 'error' listener, at the end of main() — after the watcher is started, dispatch state is written, and antfly is spawned/adopted. If another generation holds 3737, the EADDRINUSE 'error' event throws as an uncaught exception (verified under Bun 1.3.13: prints error: Failed to start server. Is port 34737 in use? and exits 1). An uncaught exception does not run the signal-based lifecycle shutdown → the freshly spawned/adopted antfly is orphaned again. Every retry against a squatted port mints another orphan.

3. When the process is deadlocked, no JS handler can run at all

The antfly-zig migration put the private instance's --data-dir at ~/.bakin/antfly/inside the chokidar-watched content dir. antfly's segment/WAL churn floods the watcher and deadlocks Bun natively (sampled: main thread + all 12 Bun Pool threads + the File Watcher thread parked on os_unfair_lock/__ulock_wait2, 0% CPU, every HTTP request hangs). A deadlocked event loop can't run SIGINT/SIGTERM JS handlers → operator escalates to kill -9 → antfly orphaned. (Observed directly: dev.ts survived two SIGTERMs, needed SIGKILL.)

The watcher fix (shouldIgnoreContentWatcherPath now ignores antfly/) ships on the migration branch (PR #457), as does a belt-and-braces sync process.on('exit') hook in the adapter that kills the antfly child on any JS-level exit. This issue tracks the dev-loop/server defects (1) and (2), which are independent of the migration.

Proposed fix (small, branch off main)

  • scripts/dev.ts: the dev shutdown handler should only own tailwind + fall-through exit while the server's lifecycle handlers aren't registered yet; once they are, it must NOT call process.exit(0) (kill tailwind, let the lifecycle listener — registered later on the same signal — run the full shutdown and exit).
  • server.ts: attach server.on('error', ...): on EADDRINUSE, log a clear message naming the port and the lsof -nP -iTCP:3737 -sTCP:LISTEN remediation, then run the same graceful-shutdown path (so the antfly child is stopped) and exit non-zero. Optionally pre-flight the bind early in main() to fail before any side effects.

Repro

  1. bun run dev, wait for ready.
  2. kill -TERM <dev pid> → dev exits, lsof -nP -iTCP:3738 -sTCP:LISTEN still shows antfly (defect 1).
  3. bun run dev again (3738 squatter is adopted by the probe) — now squat 3737 with anything and start a third: full boot, then uncaught EADDRINUSE crash, antfly stays (defect 2).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions