Skip to content

Fix FD-leak self-shutdown becoming a half-dead zombie#81

Open
rephapeng wants to merge 1 commit into
anvie:mainfrom
rephapeng:fix/fd-leak-zombie-shutdown
Open

Fix FD-leak self-shutdown becoming a half-dead zombie#81
rephapeng wants to merge 1 commit into
anvie:mainfrom
rephapeng:fix/fd-leak-zombie-shutdown

Conversation

@rephapeng

@rephapeng rephapeng commented Jun 24, 2026

Copy link
Copy Markdown

Summary

Three fixes for the recurring failure where evonic.service dies on its own — the docker-backend watchdog SIGTERMs the process at fd > 400, but the process kept surviving as a half-dead zombie (runtime drained, web server still serving), so systemd never restarted it.

Background

The FD-watchdog (backend/tools/lib/backends/docker_backend.py) sends SIGTERM to the process when the open-FD count climbs past 400, to prevent a cascade. The original FD source was an SFTP permission-denied retry loop (fixed in #3 below). After that, two further problems remained:

Fixes

1. Close the rate-limiter's thread-local SQLite connections on teardown (app.py)

The dominant remaining FD leak. Flask's dev server runs threaded=True (one thread per request). The before_request rate-limit check opens a thread-local SQLite connection — 3 FDs each in WAL mode (db + -wal + -shm). teardown_request closed the main db connection but not the api_rate_limit / rate_limit ones, so they accumulated until GC — ~180 FDs on api_rate_limit.db alone, tripping the watchdog.

Fix: teardown_request now also calls api_rate_limit.close() and rate_limit.close(), mirroring the existing db.close() pattern.

Verified: open handles on api_rate_limit.db stayed flat at 8 across 70+ requests (previously grew unbounded toward 180+).

2. Guarantee the process actually exits on SIGTERM (backend/agent_runtime/runtime.py)

_signal_handler runs graceful_shutdown() then sys.exit(0). But sys.exit() only raises SystemExit in the main thread, and when SIGTERM arrives while the main thread is blocked in the threaded WSGI accept loop, the server swallows that SystemExit. The result: queue workers + background executor stop, but the process keeps serving requests. systemd still sees it as active, so Restart=always never fires.

Fix: arm a daemon threading.Timer(WORKER_JOIN_TIMEOUT_SECONDS + 3, os._exit(0)) backstop before sys.exit(0). The clean exit (running atexit Docker cleanup) is still attempted first; the backstop guarantees the process dies — and thus restarts — if the clean exit is swallowed.

3. Resolve the remote evonic dir against the REMOTE $HOME (backend/tools/lib/backends/ssh_backend.py)

Root cause of the original FD leak. _ensure_evonic_on_remote used os.path.expanduser('~/.evonic/evonic'), which expands ~ against the local process HOME (/root, since evonic runs as root). On the deploy VPS we log in as ubuntu, who can't write under /root, producing an endless SFTP [Errno 13] Permission denied retry loop (~every 10–20s) that leaked sockets until fd > 400.

Fix: _resolve_remote_evonic_dir() resolves the leading ~ via the remote shell (printf %s "$HOME"), cached per connection.

Testing

  • ast.parse(...) on both changed Python modules — parse OK
  • Imports models.api_rate_limit.close / models.rate_limit.close resolve
  • Live: restarted the service, hammered 70+ /api/* requests, confirmed FD count stays bounded and queue workers/scheduler come back up

Three fixes for the recurring "evonic mati sendiri" (FD watchdog SIGTERMs
at fd>400) where the process survived as a half-dead zombie:

1. app.py teardown_request: also close the thread-local SQLite connections
   for api_rate_limit and rate_limit. The before_request rate-limit check
   opened a per-thread connection (3 FDs each in WAL mode) on every /api/*
   request; with Flask's thread-per-request these accumulated until GC
   (~180 FDs on api_rate_limit.db alone) — the dominant FD-leak source now
   that the SFTP loop is fixed. Mirrors the existing db.close() pattern.
   Verified flat at 8 handles across 70+ requests (was growing unbounded).

2. runtime._signal_handler: arm a daemon hard-exit backstop before sys.exit.
   sys.exit() only raises SystemExit in the main thread; when SIGTERM lands
   while the threaded WSGI server blocks in its accept loop, the server
   swallows SystemExit — runtime drains but the process keeps serving.
   systemd still sees it active so Restart=always never fires. os._exit
   backstop guarantees the restart after the graceful attempt.

3. ssh_backend: resolve _REMOTE_EVONIC_DIR's ~ against the REMOTE $HOME
   instead of os.path.expanduser (local HOME) — the original SFTP
   permission-denied retry loop that leaked sockets (root-cause fix,
   previously uncommitted).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant