MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts by gabsow · Pull Request #93 · RedisGears/LibMR

gabsow · 2026-05-26T07:53:37Z

Summary

Linux CI runs were being cancelled at the 6h job cap. Root cause: commit 2857cfe (MOD-14850) added a single-shard --tls invocation to make test_ssl so the new testNoCrashOnTLSHandshakeFailure could run. That invocation also un-skipped the entire test_network.py ShardMock suite under TLS for the first time, and at least one of those tests is hanging on Linux.
The hang is invisible in the log because run_tests.sh passes RLTest --no-progress and the default --test-timeout 0 (no timeout), so the test name never makes it out before the job is killed.

This PR is the diagnostic step — not the fix:

Drop --no-progress from run_tests.sh and add --test-timeout 120, so any single test that hangs gets killed at 2 min and shows up by name.
Add timeout-minutes: 45 to the Linux job and timeout-minutes: 25 to the SSL step so a wedged run fails fast instead of burning 6h.

Once this run completes, the failing test name will be in the log and the next PR can fix the actual hang.

Test plan

Linux CI run finishes within 45 minutes (vs. previous 6h hang)
If a test still hangs, the offending test name is visible in the SSL step output before it's killed by --test-timeout
macOS CI continues to pass on unstable/7.4 (no behaviour change expected there)

🤖 Generated with Claude Code

Note

Low Risk
Changes are limited to CI configuration and test harness behavior; no production cluster or auth logic is modified.

Overview
Adds CI guardrails so wedged Linux SSL runs fail fast instead of hitting the 6h job cap: 45m job timeout on Linux/macOS, 25m on the SSL test step, and on Linux SSL PYTHONUNBUFFERED, plus RLTEST_EXTRA_ARGS (e.g. --test-timeout 300) passed through run_tests.sh.

Relaxes testSendRetriesMechanizm so it no longer assumes exactly three hard-coded error/reconnect cycles. It now loops with timeouts, allows 1–3 INNERCOMMUNICATION attempts (TLS can queue the first send in the retry path), and still asserts no reconnect after MSG_MAX_RETRIES.

^{Reviewed by Cursor Bugbot for commit 09446a2. Bugbot is set up for automated code reviews on this repo. Configure here.}

… hang Linux CI runs were getting cancelled at the 6h job cap because some test under the new `--tls` (single-shard) invocation of `make test_ssl` blocks indefinitely. RLTest was being invoked with `--no-progress` and the default `--test-timeout 0`, so neither the test name nor a hang signal made it into the log. - Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so any single test that hangs gets killed at 2 min and shows up by name. - Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to the SSL step so a wedged run fails fast instead of burning 6h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous attempt set --test-timeout via run_tests.sh, which is shared between Linux and macOS workflows, so macOS Default tests began hitting the cap and failing even though they were passing in the past. Move the diagnostics into env vars consumed by run_tests.sh only when the Linux SSL step sets them: - run_tests.sh: restore the original --no-progress, but interpolate an optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour change for any consumer that does not opt in). - linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real time (CI stdout has no TTY, so otherwise prints were batching at end and the hung test name never made it out), plus RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test fails fast and is named in the log. macOS workflow is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION sends actually go out before libmr gives up. Why the count differs by TLS: - Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected, so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously (retries stays 0). The first `-Err` triggers a disconnect, after which MR_HelloResponseArrived re-sends from pendingMessages, incrementing retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs `Gave up of message...`. Total: 1 initial + 2 resends = 3 sends. - TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs while the node status is still NodeStatus_HelloSent. The "message was not sent because status is not connected" path queues the msg in pendingMessages and the actual first send happens via the resend loop in MR_HelloResponseArrived -- which DOES increment retries. So the initial send burns retry #1, leaving only one further resend before give-up. Total: 2 sends. The old test waited unboundedly for a fourth GetConnection on Linux TLS and hung until the 6h job cap. The single-shard `--tls` invocation that 2857cfe added is what surfaced this, since this test has skipOnCluster=True and was previously only run in cluster mode under TLS (where it was skipped entirely). Rewrite the test to express the actual invariant -- libmr sends the message between 1 and MSG_MAX_RETRIES times, then stops -- with bounded read/connection waits so it cannot hang regardless of where the retry-count boundary lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gabsow · 2026-05-26T08:46:56Z

Root cause found and fixed in latest commit (175346e)

The hung test is testSendRetriesMechanizm in test_network.py:432. It was hardcoded to send 3 -Err responses and read 3 MRTESTS.INNERCOMMUNICATION requests — but under TLS, libmr only sends two before giving up.

Why the count differs

Per src/cluster.c:445-451, MSG_MAX_RETRIES=3 and the resend loop increments retries before checking retries < 3:

Non-TLS: NETWORKTEST runs after NodeStatus_Connected, so MR_ClusterSendMsgToNode sends synchronously (retries=0). After each -Err + reconnect, MR_HelloResponseArrived re-sends and bumps retries to 1, 2, 3. Total 3 sends.
TLS: the TLS+AUTH+HELLO handshake is slower, so NETWORKTEST runs while status is NodeStatus_HelloSent. The msg hits the "message was not sent because status is not connected" path, gets queued, and the actual first send happens via the resend loop — which burns retry clang setup fixes #1. Only 2 sends total.

The redis log embedded in the timeout dump confirmed this: only two Received an invalid status reply log lines before Gave up of message because failed to send it for more than 3 time.

Why this surfaced now

Commit 2857cfe (MOD-14850) added a single-shard --tls invocation to test_ssl. testSendRetriesMechanizm has skipOnCluster=True and was previously only ever run in cluster mode under TLS — i.e., always skipped. Under the new single-shard --tls invocation it ran for the first time and immediately hung. macOS happens to pass it (probably faster TLS path → status reaches Connected before NETWORKTEST), so only Linux saw the 6h burn.

Fix

Rewrite the test to express the actual invariant — libmr sends the message between 1 and MSG_MAX_RETRIES times then stops — with bounded read/GetConnection timeouts so the test cannot hang regardless of where the count lands.

testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with --env oss-cluster --shards-count 3 surfaced a separate pre-existing flake: testUnevenWork can take >180s on that runner combination (it has an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting the redis-py TLS handshake reliably in this configuration). On 7.4 and unstable the same invocation completes in ~144s, so 180s was just too tight. Bump to 300s. This will be tracked separately -- the actual fix likely belongs in test_basic.py / redis-py connection setup rather than the workflow timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake (other macOS jobs in the same run passed in under 8 minutes). The macOS workflow had no timeout-minutes set, so a single hung test could burn the full job cap. Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS keeps the default --test-timeout (no per-test cap) since it isn't where the original MOD-15862 hang occurred and we don't want to risk regressing the runs that currently pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 09446a2. Configure here.}

cursor · 2026-05-27T08:31:37Z



-"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command
+"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress ${RLTEST_EXTRA_ARGS:-} --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command


--no-progress not removed despite stated PR intent

Medium Severity

The PR description explicitly states "Drop --no-progress from run_tests.sh" as a key diagnostic measure, noting that "the hang is invisible in the log because run_tests.sh passes RLTest --no-progress…so the test name never makes it out before the job is killed." However, --no-progress is still present on the updated line. This defeats the PR's diagnostic goal of making hanging test names visible in CI output before --test-timeout kills them.

^{Reviewed by Cursor Bugbot for commit 09446a2. Configure here.}

cursor Bot reviewed May 26, 2026

View reviewed changes

Comment thread tests/mr_test_module/pytests/run_tests.sh Outdated

gabsow and others added 2 commits May 26, 2026 11:08

gabsow and others added 2 commits May 26, 2026 12:38

gabsow mentioned this pull request May 27, 2026

MOD-15862 / MOD-15914 Fix Linux SSL CI hang + testUnevenWork timeout #96

Open

3 tasks

cursor Bot reviewed May 27, 2026

View reviewed changes

gabsow closed this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts#93

MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts#93
gabsow wants to merge 5 commits into
masterfrom
tom/mod-15862-diagnose-ssl-hang

gabsow commented May 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

gabsow commented May 26, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command
		"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress ${RLTEST_EXTRA_ARGS:-} --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command

Conversation

gabsow commented May 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

gabsow commented May 26, 2026

Root cause found and fixed in latest commit (175346e)

Why the count differs

Why this surfaced now

Fix

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 27, 2026

Choose a reason for hiding this comment

--no-progress not removed despite stated PR intent

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gabsow commented May 26, 2026 •

edited by cursor Bot

Loading

`--no-progress` not removed despite stated PR intent