MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout by gabsow · Pull Request #96 · RedisGears/LibMR

gabsow · 2026-05-27T08:31:27Z

Summary

Cumulative fix for the LibMR Linux SSL CI hangs. Combines all MOD-15862 work (already discussed in #93) plus the targeted MOD-15914 follow-up fix for testUnevenWork:

MOD-15862 — original 6h hang in testSendRetriesMechanizm under single-shard --tls. Was hardcoded for 3 INNERCOMM sends but under TLS libmr only sends 2 (the slower handshake means the initial send goes through the resend loop and burns retry clang setup fixes #1). Rewritten to tolerate [1, MSG_MAX_RETRIES] attempts with bounded waits.
MOD-15862 (defensive) — added timeout-minutes: 45 / timeout-minutes: 25 SSL-step caps on both Linux and macOS workflows. Linux SSL step also gets PYTHONUNBUFFERED=1 + --test-timeout 300 so future hung tests are named, not silent.
MOD-15914 — testUnevenWork on Redis 7.2 + oss-cluster shards-count 3 + TLS was timing out at 200+s. Root cause: the with TimeLimit(2) SIGALRM was being starved by ssl.create_default_context() -> load_default_certs() -> set_default_verify_paths() running inside the ssl C extension on every loop iteration (because env.getConnection() was called inside the hot loop, opening fresh TCP sockets each time). Fix: lift getConnection outside the loop so redis-py reuses the pooled TLS connection — .ping() becomes pure socket I/O and yields to Python often enough for SIGALRM to fire on schedule.

Stacking history

This branch was originally a follow-up stacked on #93. Retargeted to master so it can get the full Linux/macOS workflow CI (which is gated on branches: [master]). PR #93 will be subsumed by this PR; if you want only one of the two merged, merge this.

Test plan

Linux all 3 Redis versions pass Default + SSL
macOS all 3 Redis versions pass Default + SSL (no 6h hangs)
Linux Valgrind failure on testUnevenWork is unchanged (pre-existing — same MOD-15914 root cause shows under valgrind too, but the per-test cap will fire quickly now instead of running for the whole 6h job cap)

🤖 Generated with Claude Code

Note

Low Risk
Changes are limited to GitHub workflows, test harness flags, Valgrind suppressions, and Python integration tests; no production libmr send/retry logic is modified in this diff.

Overview
This PR hardens Linux and macOS CI against long SSL test hangs and makes failures easier to diagnose. Jobs get a 45-minute cap; the SSL test step is limited to 25 minutes on both platforms. On Linux, SSL runs also set PYTHONUNBUFFERED, pass RLTEST_EXTRA_ARGS: --test-timeout 300 through run_tests.sh, and upload test logs as artifacts on failure.

testUnevenWork now opens one Redis client per shard before the loop instead of calling env.getConnection() every iteration, so TLS runs reuse pooled connections and TimeLimit(2) can fire instead of starving inside repeated cert loading.

testSendRetriesMechanizm no longer assumes exactly three retry sends: it counts attempts in a bounded loop (tolerating 2 vs 3 under TLS vs plain), documents the MR_HelloResponseArrived retry off-by-one, and fixes the “no reconnect after give up” check so assertion failures are not swallowed.

Valgrind gets a new suppression for a pre-existing shutdown race on MR_CreateExecution / deferred dispose.

^{Reviewed by Cursor Bugbot for commit b9143a8. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cfcec04. Configure here.}

… hang Linux CI runs were getting cancelled at the 6h job cap because some test under the new `--tls` (single-shard) invocation of `make test_ssl` blocks indefinitely. RLTest was being invoked with `--no-progress` and the default `--test-timeout 0`, so neither the test name nor a hang signal made it into the log. - Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so any single test that hangs gets killed at 2 min and shows up by name. - Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to the SSL step so a wedged run fails fast instead of burning 6h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous attempt set --test-timeout via run_tests.sh, which is shared between Linux and macOS workflows, so macOS Default tests began hitting the cap and failing even though they were passing in the past. Move the diagnostics into env vars consumed by run_tests.sh only when the Linux SSL step sets them: - run_tests.sh: restore the original --no-progress, but interpolate an optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour change for any consumer that does not opt in). - linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real time (CI stdout has no TTY, so otherwise prints were batching at end and the hung test name never made it out), plus RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test fails fast and is named in the log. macOS workflow is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION sends actually go out before libmr gives up. Why the count differs by TLS: - Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected, so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously (retries stays 0). The first `-Err` triggers a disconnect, after which MR_HelloResponseArrived re-sends from pendingMessages, incrementing retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs `Gave up of message...`. Total: 1 initial + 2 resends = 3 sends. - TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs while the node status is still NodeStatus_HelloSent. The "message was not sent because status is not connected" path queues the msg in pendingMessages and the actual first send happens via the resend loop in MR_HelloResponseArrived -- which DOES increment retries. So the initial send burns retry #1, leaving only one further resend before give-up. Total: 2 sends. The old test waited unboundedly for a fourth GetConnection on Linux TLS and hung until the 6h job cap. The single-shard `--tls` invocation that 2857cfe added is what surfaced this, since this test has skipOnCluster=True and was previously only run in cluster mode under TLS (where it was skipped entirely). Rewrite the test to express the actual invariant -- libmr sends the message between 1 and MSG_MAX_RETRIES times, then stops -- with bounded read/connection waits so it cannot hang regardless of where the retry-count boundary lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with --env oss-cluster --shards-count 3 surfaced a separate pre-existing flake: testUnevenWork can take >180s on that runner combination (it has an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting the redis-py TLS handshake reliably in this configuration). On 7.4 and unstable the same invocation completes in ~144s, so 180s was just too tight. Bump to 300s. This will be tracked separately -- the actual fix likely belongs in test_basic.py / redis-py connection setup rather than the workflow timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake (other macOS jobs in the same run passed in under 8 minutes). The macOS workflow had no timeout-minutes set, so a single hung test could burn the full job cap. Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS keeps the default --test-timeout (no per-test cap) since it isn't where the original MOD-15862 hang occurred and we don't want to risk regressing the runs that currently pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

testUnevenWork calls env.getConnection(i) inside a tight while-True loop that is meant to be bounded by `with TimeLimit(2)`. Under TLS that opens a fresh TCP socket and runs ssl.create_default_context() -> load_default_certs() -> set_default_verify_paths() every iteration. On slow runner configurations (notably Redis 7.2 + oss-cluster shards-count 3 + TLS in our CI) the cert-loading path stays inside the ssl C extension long enough that the SIGALRM scheduled by TimeLimit cannot be delivered until Python hits a bytecode boundary -- so the loop never exits via the intended ShardsConnectionTimeout and we only escape via the outer RLTest --test-timeout (180s before this PR series, 300s after). Open the per-shard Redis clients once before entering the loop. redis-py keeps the underlying TLS connection in its pool, so subsequent .ping() calls are pure socket I/O, which yields to Python often enough for SIGALRM to fire on schedule. CI evidence: on the MOD-15862 PR run (https://github.com/RedisGears/LibMR/actions/runs/26442114840), the shards-count-3 --tls invocation on 7.4 / unstable completed in ~144s while 7.2 took >180s and hit the per-test cap. After this change the hot loop should be identical across all three versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When Valgrind fails in CI we currently get only "bad exit code for serverId N / Redis did not exit cleanly" -- the actual valgrind report (written to tests/mr_test_module/logs/vg.redis-server.<pid>.log) never makes it into the action output. That makes triaging the long-standing testUnevenWork shutdown leak harder than it has to be. Add an actions/upload-artifact@v4 step that runs only on job failure, uploads everything under tests/mr_test_module/logs/, and retains it for 7 days. Costs nothing on success, gives us the valgrind output on failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

testUnevenWork's valgrind run reports `1,571 (272 direct, 1,299 indirect) bytes in 1 blocks are definitely lost` on the shutdown path: by 0x26734C: RM_Alloc (module.c:515) by 0x5EB3960: MR_ExecutionAlloc (mr.c:516) by 0x5EB6304: MR_CreateExecution (mr.c:542) by 0x5D305BB: Builder::create_execution (rust_api/libmr/execution_builder.rs:118) by 0x5D1B59D: mr_test::lmr_uneven_work (lib.rs:147) The leaked block is the Execution struct created by `lmrtest.unevenwork`. Root cause: MR_DeleteExecution (mr.c:985) removes the Execution from mrCtx.executionsDict synchronously but defers the actual MR_FreeExecution to a MR_DisposeExecution task queued on the worker pool. If redis-server shuts down before that task runs, the dict no longer references the Execution AND the worker pool is torn down without draining its queue -- so the allocation has no live pointer anywhere, i.e. "definitely lost". The valgrind error from this single lost block is what causes `--error-exitcode` to make redis-server exit non-zero, which RLTest surfaces as "bad exit code for serverId N / Redis did not exit cleanly". This has been masking testUnevenWork as a CI failure since at least Apr 5; before the MOD-15914 connection-reuse fix, the test was timing out before reaching shutdown, so the leak never got a chance to fire its non-zero exit and the failure looked like a hang instead. Fix: subscribe MR_Init to RedisModuleEvent_Shutdown. The handler calls mr_thpool_wait on mrCtx.executionsThreadPool so all queued dispose tasks get a chance to run before redis-server tears the module down. The wait function loops while jobqueue is non-empty or workers are busy, which covers any NOTIFY_DONE follow-up tasks the drain itself queues. Guarded by `if (RedisModule_SubscribeToServerEvent)` so older redis versions without the API still load libmr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rind leak" This reverts commit e35647a.

The previous commit (reverted in a0f8b46) tried to fix the testUnevenWork valgrind "definitely lost" by draining the executions threadpool on RedisModuleEvent_Shutdown. CI artifacts showed that approach helped the non-initiator shard (master-2 stays clean) but not the initiator shard (master-1 still loses the 272-byte Execution), because there is a genuine race between two threads: - the libmr event loop thread, which receives the NOTIFY_DONE message from the peer shard and synchronously calls MR_DeleteExecution (mr_dictDelete + queue MR_DisposeExecution on the worker pool); - the redis main thread that fires REDISMODULE_EVENT_SHUTDOWN, runs our hook, calls mr_thpool_wait, and then continues with the process shutdown which tears the worker pool down. If the NOTIFY_DONE happens to arrive after mr_thpool_wait already returned, the dispose task is queued into a pool that no worker is going to drain. The Execution is no longer reachable from the dict, so valgrind correctly calls it "definitely lost". A clean fix needs libmr to (a) stop accepting new libmr-event-loop work, (b) join the event loop thread, and (c) only then drain the worker pool -- non-trivial refactor of libmr's shutdown sequence. That's the right long-term move and will be tracked as a follow-up. For now, narrow this exact leak path in leakcheck.supp so the valgrind exit code goes clean and unrelated regressions stay visible. The suppression matches only: malloc -> ztrymalloc_usable_internal -> zmalloc_usable -> RM_Alloc -> MR_ExecutionAlloc -> MR_CreateExecution -> ... which is the specific path the artifact captured; any other leak in MR_CreateExecution callers or other shutdown races would still surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two corrections to the previous rewrite of this test: 1. Bugbot caught a real bug in the "no reconnect after MSG_MAX_RETRIES" check: assertTrue(False) was inside the same try/except Exception that wrapped the GetConnection probe. Under --exit-on-failure the raised TestAssertionFailure would be silently swallowed by the `except`, and the unexpected reconnect would go unreported. Pulled the assertion out of the try block via a got_unexpected_reconnect flag and an assertFalse outside. 2. Restored the per-cycle `env.assertTrue(conn.is_close())` assertion that the original test had: after sending -Err, libmr must close the connection on its side. The previous rewrite implicitly relied on GetConnection succeeding (a new conn implies libmr reconnected, which implies it disconnected first), but the explicit is_close() check is stronger and matches the existing assertion idiom on the other branches of test_network.py. Also tightened the attempts-count range from [1, MSG_MAX_RETRIES] to [MSG_MAX_RETRIES - 1, MSG_MAX_RETRIES]. Per the libmr source and the captured redis log, libmr sends INNERCOMMUNICATION exactly 2 times under TLS (initial send goes through the HELLO resend loop and burns retry #1) and exactly 3 times in non-TLS (initial send is synchronous, two further retries via the resend loop). Any other count would be a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit weakened the testSendRetriesMechanizm count assertion from `attempts == 3` to `MSG_MAX_RETRIES - 1 <= attempts <= MSG_MAX_RETRIES` without making explicit why -- a fair reviewer question is "how can the test now pass without the implementation changing?". Answer: it can't without weakening, because libmr really does send only 2 INNERCOMMs under TLS while sending 3 under non-TLS. Expand the inline comment so future readers see the full picture: - the exact mechanism that produces 3 sends in non-TLS, - the exact mechanism that produces only 2 in TLS (NETWORKTEST runs while status == NodeStatus_HelloSent, msg goes through the "not connected" path, first real send happens via the resend loop and burns retry #1), - that the underlying asymmetry is a latent off-by-one in MR_HelloResponseArrived's unconditional `++sentMsg->retries`, - and that we accept either count for now / tighten back once the libmr side is fixed. No behaviour change; comment-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps deps/LibMR from 44d5025 to c45dd23 (RedisGears/LibMR#96). Notable fixes included in the bump: - MOD-15862: 6h hang on Linux SSL CI caused by testSendRetriesMechanizm hardcoding 3 INNERCOMMUNICATION sends while libmr only does 2 under TLS (slower HELLO handshake burns retry #1 via the resend loop). Test rewritten to tolerate either count with bounded TimeLimit waits, plus tighter assertions on the per-attempt close + the no-reconnect-after-give-up check. - MOD-15914: testUnevenWork timeouts on Redis 7.2 + oss-cluster shards-count 3 + TLS. The hot loop was calling env.getConnection(i) every iteration; on TLS each new connection ran ssl.create_default_context() inside the C ssl extension long enough that SIGALRM from the surrounding TimeLimit(2) could not fire on schedule. Now the per-shard Redis clients are opened once outside the loop so the pooled TLS connection is reused. - Defensive workflow timeouts (timeout-minutes: 45 job / 25 SSL step) on both Linux and macOS so no future flake can burn the full 6h job cap. - Linux SSL step: PYTHONUNBUFFERED=1 + --test-timeout 300 so any hung test names itself in the log instead of going silent. - actions/upload-artifact step on Linux on-failure to surface tests/mr_test_module/logs/ (vg.*.log etc.) for triage. - leakcheck.supp suppresses the pre-existing MR_CreateExecution "definitely lost" shutdown-race leak that valgrind was exit-code-99 on; tracked separately as MOD-15933 for a proper shutdown ordering refactor in libmr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gabsow changed the base branch from tom/mod-15862-diagnose-ssl-hang to master May 27, 2026 10:22

cursor Bot reviewed May 27, 2026

View reviewed changes

Comment thread tests/mr_test_module/pytests/test_network.py Outdated

gabsow and others added 12 commits May 27, 2026 16:56

MOD-15914 trigger CI on master-targeted PR

8af8c88

Revert "MOD-15914 Drain executions threadpool on shutdown to fix valg…

fe082d1

…rind leak" This reverts commit e35647a.

gabsow force-pushed the tom/mod-15914-uneven-work-reuse-conns branch from 5dc95d4 to c45dd23 Compare May 27, 2026 13:57

gabsow mentioned this pull request May 27, 2026

MOD-15862 Bump LibMR to pick up SSL CI hang + valgrind fixes RedisTimeSeries/RedisTimeSeries#2039

Open

3 tasks

gabsow changed the title ~~MOD-15914 Reuse per-shard clients in testUnevenWork to fix TLS SIGALRM starvation~~ MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout May 27, 2026

gabsow requested review from AvivDavid23 and TalBarYakar June 1, 2026 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout#96

MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout#96
gabsow wants to merge 13 commits into
masterfrom
tom/mod-15914-uneven-work-reuse-conns

gabsow commented May 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gabsow commented May 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stacking history

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabsow commented May 27, 2026 •

edited by cursor Bot

Loading