Skip to content

MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout#96

Open
gabsow wants to merge 13 commits into
masterfrom
tom/mod-15914-uneven-work-reuse-conns
Open

MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout#96
gabsow wants to merge 13 commits into
masterfrom
tom/mod-15914-uneven-work-reuse-conns

Conversation

@gabsow
Copy link
Copy Markdown
Contributor

@gabsow gabsow commented May 27, 2026

Summary

Cumulative fix for the LibMR Linux SSL CI hangs. Combines all MOD-15862 work (already discussed in #93) plus the targeted MOD-15914 follow-up fix for testUnevenWork:

  1. MOD-15862 — original 6h hang in testSendRetriesMechanizm under single-shard --tls. Was hardcoded for 3 INNERCOMM sends but under TLS libmr only sends 2 (the slower handshake means the initial send goes through the resend loop and burns retry clang setup fixes #1). Rewritten to tolerate [1, MSG_MAX_RETRIES] attempts with bounded waits.
  2. MOD-15862 (defensive) — added timeout-minutes: 45 / timeout-minutes: 25 SSL-step caps on both Linux and macOS workflows. Linux SSL step also gets PYTHONUNBUFFERED=1 + --test-timeout 300 so future hung tests are named, not silent.
  3. MOD-15914 — testUnevenWork on Redis 7.2 + oss-cluster shards-count 3 + TLS was timing out at 200+s. Root cause: the with TimeLimit(2) SIGALRM was being starved by ssl.create_default_context() -> load_default_certs() -> set_default_verify_paths() running inside the ssl C extension on every loop iteration (because env.getConnection() was called inside the hot loop, opening fresh TCP sockets each time). Fix: lift getConnection outside the loop so redis-py reuses the pooled TLS connection — .ping() becomes pure socket I/O and yields to Python often enough for SIGALRM to fire on schedule.

Stacking history

This branch was originally a follow-up stacked on #93. Retargeted to master so it can get the full Linux/macOS workflow CI (which is gated on branches: [master]). PR #93 will be subsumed by this PR; if you want only one of the two merged, merge this.

Test plan

  • Linux all 3 Redis versions pass Default + SSL
  • macOS all 3 Redis versions pass Default + SSL (no 6h hangs)
  • Linux Valgrind failure on testUnevenWork is unchanged (pre-existing — same MOD-15914 root cause shows under valgrind too, but the per-test cap will fire quickly now instead of running for the whole 6h job cap)

🤖 Generated with Claude Code


Note

Low Risk
Changes are limited to GitHub workflows, test harness flags, Valgrind suppressions, and Python integration tests; no production libmr send/retry logic is modified in this diff.

Overview
This PR hardens Linux and macOS CI against long SSL test hangs and makes failures easier to diagnose. Jobs get a 45-minute cap; the SSL test step is limited to 25 minutes on both platforms. On Linux, SSL runs also set PYTHONUNBUFFERED, pass RLTEST_EXTRA_ARGS: --test-timeout 300 through run_tests.sh, and upload test logs as artifacts on failure.

testUnevenWork now opens one Redis client per shard before the loop instead of calling env.getConnection() every iteration, so TLS runs reuse pooled connections and TimeLimit(2) can fire instead of starving inside repeated cert loading.

testSendRetriesMechanizm no longer assumes exactly three retry sends: it counts attempts in a bounded loop (tolerating 2 vs 3 under TLS vs plain), documents the MR_HelloResponseArrived retry off-by-one, and fixes the “no reconnect after give up” check so assertion failures are not swallowed.

Valgrind gets a new suppression for a pre-existing shutdown race on MR_CreateExecution / deferred dispose.

Reviewed by Cursor Bugbot for commit b9143a8. Bugbot is set up for automated code reviews on this repo. Configure here.

@gabsow gabsow changed the base branch from tom/mod-15862-diagnose-ssl-hang to master May 27, 2026 10:22
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cfcec04. Configure here.

Comment thread tests/mr_test_module/pytests/test_network.py Outdated
gabsow and others added 12 commits May 27, 2026 16:56
… hang

Linux CI runs were getting cancelled at the 6h job cap because some test
under the new `--tls` (single-shard) invocation of `make test_ssl` blocks
indefinitely. RLTest was being invoked with `--no-progress` and the default
`--test-timeout 0`, so neither the test name nor a hang signal made it into
the log.

- Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so
  any single test that hangs gets killed at 2 min and shows up by name.
- Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to
  the SSL step so a wedged run fails fast instead of burning 6h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous attempt set --test-timeout via run_tests.sh, which is shared
between Linux and macOS workflows, so macOS Default tests began hitting
the cap and failing even though they were passing in the past.

Move the diagnostics into env vars consumed by run_tests.sh only when
the Linux SSL step sets them:

- run_tests.sh: restore the original --no-progress, but interpolate an
  optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour
  change for any consumer that does not opt in).
- linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real
  time (CI stdout has no TTY, so otherwise prints were batching at end
  and the hung test name never made it out), plus
  RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test
  fails fast and is named in the log.

macOS workflow is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded
three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION
sends actually go out before libmr gives up.

Why the count differs by TLS:

  - Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected,
    so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously
    (retries stays 0). The first `-Err` triggers a disconnect, after which
    MR_HelloResponseArrived re-sends from pendingMessages, incrementing
    retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs
    `Gave up of message...`. Total: 1 initial + 2 resends = 3 sends.

  - TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs
    while the node status is still NodeStatus_HelloSent. The "message
    was not sent because status is not connected" path queues the msg
    in pendingMessages and the actual first send happens via the
    resend loop in MR_HelloResponseArrived -- which DOES increment
    retries. So the initial send burns retry #1, leaving only one
    further resend before give-up. Total: 2 sends.

The old test waited unboundedly for a fourth GetConnection on Linux
TLS and hung until the 6h job cap. The single-shard `--tls` invocation
that 2857cfe added is what surfaced this, since this test has
skipOnCluster=True and was previously only run in cluster mode under
TLS (where it was skipped entirely).

Rewrite the test to express the actual invariant -- libmr sends the
message between 1 and MSG_MAX_RETRIES times, then stops -- with
bounded read/connection waits so it cannot hang regardless of where
the retry-count boundary lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with
--env oss-cluster --shards-count 3 surfaced a separate pre-existing
flake: testUnevenWork can take >180s on that runner combination (it has
an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting
the redis-py TLS handshake reliably in this configuration). On 7.4 and
unstable the same invocation completes in ~144s, so 180s was just too
tight.

Bump to 300s. This will be tracked separately -- the actual fix likely
belongs in test_basic.py / redis-py connection setup rather than the
workflow timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake
(other macOS jobs in the same run passed in under 8 minutes). The macOS
workflow had no timeout-minutes set, so a single hung test could burn
the full job cap.

Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS
keeps the default --test-timeout (no per-test cap) since it isn't
where the original MOD-15862 hang occurred and we don't want to risk
regressing the runs that currently pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
testUnevenWork calls env.getConnection(i) inside a tight while-True
loop that is meant to be bounded by `with TimeLimit(2)`. Under TLS
that opens a fresh TCP socket and runs ssl.create_default_context()
-> load_default_certs() -> set_default_verify_paths() every iteration.
On slow runner configurations (notably Redis 7.2 + oss-cluster
shards-count 3 + TLS in our CI) the cert-loading path stays inside
the ssl C extension long enough that the SIGALRM scheduled by
TimeLimit cannot be delivered until Python hits a bytecode boundary
-- so the loop never exits via the intended ShardsConnectionTimeout
and we only escape via the outer RLTest --test-timeout (180s before
this PR series, 300s after).

Open the per-shard Redis clients once before entering the loop.
redis-py keeps the underlying TLS connection in its pool, so
subsequent .ping() calls are pure socket I/O, which yields to Python
often enough for SIGALRM to fire on schedule.

CI evidence: on the MOD-15862 PR run
(https://github.com/RedisGears/LibMR/actions/runs/26442114840), the
shards-count-3 --tls invocation on 7.4 / unstable completed in ~144s
while 7.2 took >180s and hit the per-test cap. After this change the
hot loop should be identical across all three versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Valgrind fails in CI we currently get only "bad exit code for
serverId N / Redis did not exit cleanly" -- the actual valgrind report
(written to tests/mr_test_module/logs/vg.redis-server.<pid>.log) never
makes it into the action output. That makes triaging the long-standing
testUnevenWork shutdown leak harder than it has to be.

Add an actions/upload-artifact@v4 step that runs only on job failure,
uploads everything under tests/mr_test_module/logs/, and retains it
for 7 days. Costs nothing on success, gives us the valgrind output on
failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
testUnevenWork's valgrind run reports `1,571 (272 direct, 1,299
indirect) bytes in 1 blocks are definitely lost` on the shutdown path:

    by 0x26734C: RM_Alloc (module.c:515)
    by 0x5EB3960: MR_ExecutionAlloc (mr.c:516)
    by 0x5EB6304: MR_CreateExecution (mr.c:542)
    by 0x5D305BB: Builder::create_execution (rust_api/libmr/execution_builder.rs:118)
    by 0x5D1B59D: mr_test::lmr_uneven_work (lib.rs:147)

The leaked block is the Execution struct created by `lmrtest.unevenwork`.
Root cause: MR_DeleteExecution (mr.c:985) removes the Execution from
mrCtx.executionsDict synchronously but defers the actual MR_FreeExecution
to a MR_DisposeExecution task queued on the worker pool. If redis-server
shuts down before that task runs, the dict no longer references the
Execution AND the worker pool is torn down without draining its queue
-- so the allocation has no live pointer anywhere, i.e. "definitely lost".

The valgrind error from this single lost block is what causes
`--error-exitcode` to make redis-server exit non-zero, which RLTest
surfaces as "bad exit code for serverId N / Redis did not exit cleanly".
This has been masking testUnevenWork as a CI failure since at least
Apr 5; before the MOD-15914 connection-reuse fix, the test was timing
out before reaching shutdown, so the leak never got a chance to fire
its non-zero exit and the failure looked like a hang instead.

Fix: subscribe MR_Init to RedisModuleEvent_Shutdown. The handler calls
mr_thpool_wait on mrCtx.executionsThreadPool so all queued dispose tasks
get a chance to run before redis-server tears the module down. The wait
function loops while jobqueue is non-empty or workers are busy, which
covers any NOTIFY_DONE follow-up tasks the drain itself queues.

Guarded by `if (RedisModule_SubscribeToServerEvent)` so older redis
versions without the API still load libmr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit (reverted in a0f8b46) tried to fix the testUnevenWork
valgrind "definitely lost" by draining the executions threadpool on
RedisModuleEvent_Shutdown. CI artifacts showed that approach helped the
non-initiator shard (master-2 stays clean) but not the initiator shard
(master-1 still loses the 272-byte Execution), because there is a
genuine race between two threads:

  - the libmr event loop thread, which receives the NOTIFY_DONE message
    from the peer shard and synchronously calls MR_DeleteExecution
    (mr_dictDelete + queue MR_DisposeExecution on the worker pool);
  - the redis main thread that fires REDISMODULE_EVENT_SHUTDOWN, runs
    our hook, calls mr_thpool_wait, and then continues with the
    process shutdown which tears the worker pool down.

If the NOTIFY_DONE happens to arrive after mr_thpool_wait already
returned, the dispose task is queued into a pool that no worker is
going to drain. The Execution is no longer reachable from the dict, so
valgrind correctly calls it "definitely lost".

A clean fix needs libmr to (a) stop accepting new libmr-event-loop
work, (b) join the event loop thread, and (c) only then drain the
worker pool -- non-trivial refactor of libmr's shutdown sequence.
That's the right long-term move and will be tracked as a follow-up.

For now, narrow this exact leak path in leakcheck.supp so the valgrind
exit code goes clean and unrelated regressions stay visible. The
suppression matches only:

  malloc -> ztrymalloc_usable_internal -> zmalloc_usable -> RM_Alloc
        -> MR_ExecutionAlloc -> MR_CreateExecution -> ...

which is the specific path the artifact captured; any other leak in
MR_CreateExecution callers or other shutdown races would still surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two corrections to the previous rewrite of this test:

1. Bugbot caught a real bug in the "no reconnect after MSG_MAX_RETRIES"
   check: assertTrue(False) was inside the same try/except Exception
   that wrapped the GetConnection probe. Under --exit-on-failure the
   raised TestAssertionFailure would be silently swallowed by the
   `except`, and the unexpected reconnect would go unreported. Pulled
   the assertion out of the try block via a got_unexpected_reconnect
   flag and an assertFalse outside.

2. Restored the per-cycle `env.assertTrue(conn.is_close())` assertion
   that the original test had: after sending -Err, libmr must close
   the connection on its side. The previous rewrite implicitly relied
   on GetConnection succeeding (a new conn implies libmr reconnected,
   which implies it disconnected first), but the explicit is_close()
   check is stronger and matches the existing assertion idiom on the
   other branches of test_network.py.

Also tightened the attempts-count range from [1, MSG_MAX_RETRIES] to
[MSG_MAX_RETRIES - 1, MSG_MAX_RETRIES]. Per the libmr source and the
captured redis log, libmr sends INNERCOMMUNICATION exactly 2 times
under TLS (initial send goes through the HELLO resend loop and burns
retry #1) and exactly 3 times in non-TLS (initial send is synchronous,
two further retries via the resend loop). Any other count would be a
regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit weakened the testSendRetriesMechanizm count assertion
from `attempts == 3` to `MSG_MAX_RETRIES - 1 <= attempts <= MSG_MAX_RETRIES`
without making explicit why -- a fair reviewer question is "how can the
test now pass without the implementation changing?". Answer: it can't
without weakening, because libmr really does send only 2 INNERCOMMs
under TLS while sending 3 under non-TLS.

Expand the inline comment so future readers see the full picture:
  - the exact mechanism that produces 3 sends in non-TLS,
  - the exact mechanism that produces only 2 in TLS (NETWORKTEST runs
    while status == NodeStatus_HelloSent, msg goes through the "not
    connected" path, first real send happens via the resend loop and
    burns retry #1),
  - that the underlying asymmetry is a latent off-by-one in
    MR_HelloResponseArrived's unconditional `++sentMsg->retries`,
  - and that we accept either count for now / tighten back once the
    libmr side is fixed.

No behaviour change; comment-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gabsow added a commit to RedisTimeSeries/RedisTimeSeries that referenced this pull request May 27, 2026
Bumps deps/LibMR from 44d5025 to c45dd23 (RedisGears/LibMR#96).

Notable fixes included in the bump:

  - MOD-15862: 6h hang on Linux SSL CI caused by testSendRetriesMechanizm
    hardcoding 3 INNERCOMMUNICATION sends while libmr only does 2 under
    TLS (slower HELLO handshake burns retry #1 via the resend loop).
    Test rewritten to tolerate either count with bounded TimeLimit
    waits, plus tighter assertions on the per-attempt close + the
    no-reconnect-after-give-up check.

  - MOD-15914: testUnevenWork timeouts on Redis 7.2 + oss-cluster
    shards-count 3 + TLS. The hot loop was calling env.getConnection(i)
    every iteration; on TLS each new connection ran
    ssl.create_default_context() inside the C ssl extension long enough
    that SIGALRM from the surrounding TimeLimit(2) could not fire on
    schedule. Now the per-shard Redis clients are opened once outside
    the loop so the pooled TLS connection is reused.

  - Defensive workflow timeouts (timeout-minutes: 45 job / 25 SSL step)
    on both Linux and macOS so no future flake can burn the full 6h
    job cap.

  - Linux SSL step: PYTHONUNBUFFERED=1 + --test-timeout 300 so any
    hung test names itself in the log instead of going silent.

  - actions/upload-artifact step on Linux on-failure to surface
    tests/mr_test_module/logs/ (vg.*.log etc.) for triage.

  - leakcheck.supp suppresses the pre-existing MR_CreateExecution
    "definitely lost" shutdown-race leak that valgrind was exit-code-99
    on; tracked separately as MOD-15933 for a proper shutdown ordering
    refactor in libmr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gabsow gabsow changed the title MOD-15914 Reuse per-shard clients in testUnevenWork to fix TLS SIGALRM starvation MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout May 27, 2026
@gabsow gabsow requested review from AvivDavid23 and TalBarYakar June 1, 2026 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant