Fix FD-leak self-shutdown becoming a half-dead zombie#81
Open
rephapeng wants to merge 1 commit into
Open
Conversation
Three fixes for the recurring "evonic mati sendiri" (FD watchdog SIGTERMs at fd>400) where the process survived as a half-dead zombie: 1. app.py teardown_request: also close the thread-local SQLite connections for api_rate_limit and rate_limit. The before_request rate-limit check opened a per-thread connection (3 FDs each in WAL mode) on every /api/* request; with Flask's thread-per-request these accumulated until GC (~180 FDs on api_rate_limit.db alone) — the dominant FD-leak source now that the SFTP loop is fixed. Mirrors the existing db.close() pattern. Verified flat at 8 handles across 70+ requests (was growing unbounded). 2. runtime._signal_handler: arm a daemon hard-exit backstop before sys.exit. sys.exit() only raises SystemExit in the main thread; when SIGTERM lands while the threaded WSGI server blocks in its accept loop, the server swallows SystemExit — runtime drains but the process keeps serving. systemd still sees it active so Restart=always never fires. os._exit backstop guarantees the restart after the graceful attempt. 3. ssh_backend: resolve _REMOTE_EVONIC_DIR's ~ against the REMOTE $HOME instead of os.path.expanduser (local HOME) — the original SFTP permission-denied retry loop that leaked sockets (root-cause fix, previously uncommitted).
a178330 to
e827f64
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three fixes for the recurring failure where
evonic.servicedies on its own — the docker-backend watchdog SIGTERMs the process atfd > 400, but the process kept surviving as a half-dead zombie (runtime drained, web server still serving), so systemd never restarted it.Background
The FD-watchdog (
backend/tools/lib/backends/docker_backend.py) sendsSIGTERMto the process when the open-FD count climbs past 400, to prevent a cascade. The original FD source was an SFTP permission-denied retry loop (fixed in #3 below). After that, two further problems remained:Fixes
1. Close the rate-limiter's thread-local SQLite connections on teardown (
app.py)The dominant remaining FD leak. Flask's dev server runs
threaded=True(one thread per request). Thebefore_requestrate-limit check opens a thread-local SQLite connection — 3 FDs each in WAL mode (db+-wal+-shm).teardown_requestclosed the maindbconnection but not theapi_rate_limit/rate_limitones, so they accumulated until GC — ~180 FDs onapi_rate_limit.dbalone, tripping the watchdog.Fix:
teardown_requestnow also callsapi_rate_limit.close()andrate_limit.close(), mirroring the existingdb.close()pattern.Verified: open handles on
api_rate_limit.dbstayed flat at 8 across 70+ requests (previously grew unbounded toward 180+).2. Guarantee the process actually exits on SIGTERM (
backend/agent_runtime/runtime.py)_signal_handlerrunsgraceful_shutdown()thensys.exit(0). Butsys.exit()only raisesSystemExitin the main thread, and when SIGTERM arrives while the main thread is blocked in the threaded WSGI accept loop, the server swallows thatSystemExit. The result: queue workers + background executor stop, but the process keeps serving requests. systemd still sees it asactive, soRestart=alwaysnever fires.Fix: arm a daemon
threading.Timer(WORKER_JOIN_TIMEOUT_SECONDS + 3, os._exit(0))backstop beforesys.exit(0). The clean exit (running atexit Docker cleanup) is still attempted first; the backstop guarantees the process dies — and thus restarts — if the clean exit is swallowed.3. Resolve the remote evonic dir against the REMOTE
$HOME(backend/tools/lib/backends/ssh_backend.py)Root cause of the original FD leak.
_ensure_evonic_on_remoteusedos.path.expanduser('~/.evonic/evonic'), which expands~against the local process HOME (/root, since evonic runs as root). On the deploy VPS we log in asubuntu, who can't write under/root, producing an endless SFTP[Errno 13] Permission deniedretry loop (~every 10–20s) that leaked sockets untilfd > 400.Fix:
_resolve_remote_evonic_dir()resolves the leading~via the remote shell (printf %s "$HOME"), cached per connection.Testing
ast.parse(...)on both changed Python modules — parse OKmodels.api_rate_limit.close/models.rate_limit.closeresolve/api/*requests, confirmed FD count stays bounded and queue workers/scheduler come back up