Fix FD-leak self-shutdown becoming a half-dead zombie by rephapeng · Pull Request #81 · anvie/evonic

rephapeng · 2026-06-24T14:24:30Z

Summary

Three fixes for the recurring failure where evonic.service dies on its own — the docker-backend watchdog SIGTERMs the process at fd > 400, but the process kept surviving as a half-dead zombie (runtime drained, web server still serving), so systemd never restarted it.

Background

The FD-watchdog (backend/tools/lib/backends/docker_backend.py) sends SIGTERM to the process when the open-FD count climbs past 400, to prevent a cascade. The original FD source was an SFTP permission-denied retry loop (fixed in #3 below). After that, two further problems remained:

Fixes

1. Close the rate-limiter's thread-local SQLite connections on teardown (`app.py`)

The dominant remaining FD leak. Flask's dev server runs threaded=True (one thread per request). The before_request rate-limit check opens a thread-local SQLite connection — 3 FDs each in WAL mode (db + -wal + -shm). teardown_request closed the main db connection but not the api_rate_limit / rate_limit ones, so they accumulated until GC — ~180 FDs on api_rate_limit.db alone, tripping the watchdog.

Fix: teardown_request now also calls api_rate_limit.close() and rate_limit.close(), mirroring the existing db.close() pattern.

Verified: open handles on api_rate_limit.db stayed flat at 8 across 70+ requests (previously grew unbounded toward 180+).

2. Guarantee the process actually exits on SIGTERM (`backend/agent_runtime/runtime.py`)

_signal_handler runs graceful_shutdown() then sys.exit(0). But sys.exit() only raises SystemExit in the main thread, and when SIGTERM arrives while the main thread is blocked in the threaded WSGI accept loop, the server swallows that SystemExit. The result: queue workers + background executor stop, but the process keeps serving requests. systemd still sees it as active, so Restart=always never fires.

Fix: arm a daemon threading.Timer(WORKER_JOIN_TIMEOUT_SECONDS + 3, os._exit(0)) backstop before sys.exit(0). The clean exit (running atexit Docker cleanup) is still attempted first; the backstop guarantees the process dies — and thus restarts — if the clean exit is swallowed.

3. Resolve the remote evonic dir against the REMOTE `$HOME` (`backend/tools/lib/backends/ssh_backend.py`)

Root cause of the original FD leak. _ensure_evonic_on_remote used os.path.expanduser('~/.evonic/evonic'), which expands ~ against the local process HOME (/root, since evonic runs as root). On the deploy VPS we log in as ubuntu, who can't write under /root, producing an endless SFTP [Errno 13] Permission denied retry loop (~every 10–20s) that leaked sockets until fd > 400.

Fix: _resolve_remote_evonic_dir() resolves the leading ~ via the remote shell (printf %s "$HOME"), cached per connection.

Testing

ast.parse(...) on both changed Python modules — parse OK
Imports models.api_rate_limit.close / models.rate_limit.close resolve
Live: restarted the service, hammered 70+ /api/* requests, confirmed FD count stays bounded and queue workers/scheduler come back up

Three fixes for the recurring "evonic mati sendiri" (FD watchdog SIGTERMs at fd>400) where the process survived as a half-dead zombie: 1. app.py teardown_request: also close the thread-local SQLite connections for api_rate_limit and rate_limit. The before_request rate-limit check opened a per-thread connection (3 FDs each in WAL mode) on every /api/* request; with Flask's thread-per-request these accumulated until GC (~180 FDs on api_rate_limit.db alone) — the dominant FD-leak source now that the SFTP loop is fixed. Mirrors the existing db.close() pattern. Verified flat at 8 handles across 70+ requests (was growing unbounded). 2. runtime._signal_handler: arm a daemon hard-exit backstop before sys.exit. sys.exit() only raises SystemExit in the main thread; when SIGTERM lands while the threaded WSGI server blocks in its accept loop, the server swallows SystemExit — runtime drains but the process keeps serving. systemd still sees it active so Restart=always never fires. os._exit backstop guarantees the restart after the graceful attempt. 3. ssh_backend: resolve _REMOTE_EVONIC_DIR's ~ against the REMOTE $HOME instead of os.path.expanduser (local HOME) — the original SFTP permission-denied retry loop that leaked sockets (root-cause fix, previously uncommitted).

rephapeng force-pushed the fix/fd-leak-zombie-shutdown branch 2 times, most recently from a178330 to e827f64 Compare June 24, 2026 14:33

rephapeng mentioned this pull request Jun 24, 2026

Sync main: bulk kanban create tool + FD-leak zombie-shutdown fixes rephapeng/evonic#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FD-leak self-shutdown becoming a half-dead zombie#81

Fix FD-leak self-shutdown becoming a half-dead zombie#81
rephapeng wants to merge 1 commit into
anvie:mainfrom
rephapeng:fix/fd-leak-zombie-shutdown

rephapeng commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rephapeng commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Fixes

1. Close the rate-limiter's thread-local SQLite connections on teardown (app.py)

2. Guarantee the process actually exits on SIGTERM (backend/agent_runtime/runtime.py)

3. Resolve the remote evonic dir against the REMOTE $HOME (backend/tools/lib/backends/ssh_backend.py)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rephapeng commented Jun 24, 2026 •

edited

Loading

1. Close the rate-limiter's thread-local SQLite connections on teardown (`app.py`)

2. Guarantee the process actually exits on SIGTERM (`backend/agent_runtime/runtime.py`)

3. Resolve the remote evonic dir against the REMOTE `$HOME` (`backend/tools/lib/backends/ssh_backend.py`)