Skip to content

MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts#93

Closed
gabsow wants to merge 5 commits into
masterfrom
tom/mod-15862-diagnose-ssl-hang
Closed

MOD-15862 Diagnostic: bound SSL CI runs with per-test + step timeouts#93
gabsow wants to merge 5 commits into
masterfrom
tom/mod-15862-diagnose-ssl-hang

Conversation

@gabsow
Copy link
Copy Markdown
Contributor

@gabsow gabsow commented May 26, 2026

Summary

  • Linux CI runs were being cancelled at the 6h job cap. Root cause: commit 2857cfe (MOD-14850) added a single-shard --tls invocation to make test_ssl so the new testNoCrashOnTLSHandshakeFailure could run. That invocation also un-skipped the entire test_network.py ShardMock suite under TLS for the first time, and at least one of those tests is hanging on Linux.
  • The hang is invisible in the log because run_tests.sh passes RLTest --no-progress and the default --test-timeout 0 (no timeout), so the test name never makes it out before the job is killed.

This PR is the diagnostic step — not the fix:

  • Drop --no-progress from run_tests.sh and add --test-timeout 120, so any single test that hangs gets killed at 2 min and shows up by name.
  • Add timeout-minutes: 45 to the Linux job and timeout-minutes: 25 to the SSL step so a wedged run fails fast instead of burning 6h.

Once this run completes, the failing test name will be in the log and the next PR can fix the actual hang.

Test plan

  • Linux CI run finishes within 45 minutes (vs. previous 6h hang)
  • If a test still hangs, the offending test name is visible in the SSL step output before it's killed by --test-timeout
  • macOS CI continues to pass on unstable/7.4 (no behaviour change expected there)

🤖 Generated with Claude Code


Note

Low Risk
Changes are limited to CI configuration and test harness behavior; no production cluster or auth logic is modified.

Overview
Adds CI guardrails so wedged Linux SSL runs fail fast instead of hitting the 6h job cap: 45m job timeout on Linux/macOS, 25m on the SSL test step, and on Linux SSL PYTHONUNBUFFERED, plus RLTEST_EXTRA_ARGS (e.g. --test-timeout 300) passed through run_tests.sh.

Relaxes testSendRetriesMechanizm so it no longer assumes exactly three hard-coded error/reconnect cycles. It now loops with timeouts, allows 1–3 INNERCOMMUNICATION attempts (TLS can queue the first send in the retry path), and still asserts no reconnect after MSG_MAX_RETRIES.

Reviewed by Cursor Bugbot for commit 09446a2. Bugbot is set up for automated code reviews on this repo. Configure here.

… hang

Linux CI runs were getting cancelled at the 6h job cap because some test
under the new `--tls` (single-shard) invocation of `make test_ssl` blocks
indefinitely. RLTest was being invoked with `--no-progress` and the default
`--test-timeout 0`, so neither the test name nor a hang signal made it into
the log.

- Drop `--no-progress` from run_tests.sh and add `--test-timeout 120`, so
  any single test that hangs gets killed at 2 min and shows up by name.
- Add `timeout-minutes: 45` to the Linux job and `timeout-minutes: 25` to
  the SSL step so a wedged run fails fast instead of burning 6h.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/mr_test_module/pytests/run_tests.sh Outdated
gabsow and others added 2 commits May 26, 2026 11:08
The previous attempt set --test-timeout via run_tests.sh, which is shared
between Linux and macOS workflows, so macOS Default tests began hitting
the cap and failing even though they were passing in the past.

Move the diagnostics into env vars consumed by run_tests.sh only when
the Linux SSL step sets them:

- run_tests.sh: restore the original --no-progress, but interpolate an
  optional RLTEST_EXTRA_ARGS env var (empty by default — no behaviour
  change for any consumer that does not opt in).
- linux.yml SSL step: PYTHONUNBUFFERED=1 so test names flush in real
  time (CI stdout has no TTY, so otherwise prints were batching at end
  and the hung test name never made it out), plus
  RLTEST_EXTRA_ARGS="--test-timeout 180" so an individual hung test
  fails fast and is named in the log.

macOS workflow is untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the Linux SSL hang: testSendRetriesMechanizm hardcoded
three `-Err` exchanges, but under TLS only TWO INNERCOMMUNICATION
sends actually go out before libmr gives up.

Why the count differs by TLS:

  - Non-TLS: NETWORKTEST runs after the node reaches NodeStatus_Connected,
    so MR_ClusterSendMsgToNode sends INNERCOMMUNICATION synchronously
    (retries stays 0). The first `-Err` triggers a disconnect, after which
    MR_HelloResponseArrived re-sends from pendingMessages, incrementing
    retries to 1, 2, then 3. At retries==MSG_MAX_RETRIES (=3) libmr logs
    `Gave up of message...`. Total: 1 initial + 2 resends = 3 sends.

  - TLS: the TLS+AUTH+HELLO handshake takes longer; NETWORKTEST runs
    while the node status is still NodeStatus_HelloSent. The "message
    was not sent because status is not connected" path queues the msg
    in pendingMessages and the actual first send happens via the
    resend loop in MR_HelloResponseArrived -- which DOES increment
    retries. So the initial send burns retry #1, leaving only one
    further resend before give-up. Total: 2 sends.

The old test waited unboundedly for a fourth GetConnection on Linux
TLS and hung until the 6h job cap. The single-shard `--tls` invocation
that 2857cfe added is what surfaced this, since this test has
skipOnCluster=True and was previously only run in cluster mode under
TLS (where it was skipped entirely).

Rewrite the test to express the actual invariant -- libmr sends the
message between 1 and MSG_MAX_RETRIES times, then stops -- with
bounded read/connection waits so it cannot hang regardless of where
the retry-count boundary lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gabsow
Copy link
Copy Markdown
Contributor Author

gabsow commented May 26, 2026

Root cause found and fixed in latest commit (175346e)

The hung test is testSendRetriesMechanizm in test_network.py:432. It was hardcoded to send 3 -Err responses and read 3 MRTESTS.INNERCOMMUNICATION requests — but under TLS, libmr only sends two before giving up.

Why the count differs

Per src/cluster.c:445-451, MSG_MAX_RETRIES=3 and the resend loop increments retries before checking retries < 3:

  • Non-TLS: NETWORKTEST runs after NodeStatus_Connected, so MR_ClusterSendMsgToNode sends synchronously (retries=0). After each -Err + reconnect, MR_HelloResponseArrived re-sends and bumps retries to 1, 2, 3. Total 3 sends.
  • TLS: the TLS+AUTH+HELLO handshake is slower, so NETWORKTEST runs while status is NodeStatus_HelloSent. The msg hits the "message was not sent because status is not connected" path, gets queued, and the actual first send happens via the resend loop — which burns retry clang setup fixes #1. Only 2 sends total.

The redis log embedded in the timeout dump confirmed this: only two Received an invalid status reply log lines before Gave up of message because failed to send it for more than 3 time.

Why this surfaced now

Commit 2857cfe (MOD-14850) added a single-shard --tls invocation to test_ssl. testSendRetriesMechanizm has skipOnCluster=True and was previously only ever run in cluster mode under TLS — i.e., always skipped. Under the new single-shard --tls invocation it ran for the first time and immediately hung. macOS happens to pass it (probably faster TLS path → status reaches Connected before NETWORKTEST), so only Linux saw the 6h burn.

Fix

Rewrite the test to express the actual invariant — libmr sends the message between 1 and MSG_MAX_RETRIES times then stops — with bounded read/GetConnection timeouts so the test cannot hang regardless of where the count lands.

gabsow and others added 2 commits May 26, 2026 12:38
testSendRetriesMechanizm hang is now fixed, but the SSL run on 7.2 with
--env oss-cluster --shards-count 3 surfaced a separate pre-existing
flake: testUnevenWork can take >180s on that runner combination (it has
an internal TimeLimit(2) but the SIGALRM evidently isn't interrupting
the redis-py TLS handshake reliably in this configuration). On 7.4 and
unstable the same invocation completes in ~144s, so 180s was just too
tight.

Bump to 300s. This will be tracked separately -- the actual fix likely
belongs in test_basic.py / redis-py connection setup rather than the
workflow timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This PR's CI run hit a 6h cancel on macOS 7.4 SSL — a runner-side flake
(other macOS jobs in the same run passed in under 8 minutes). The macOS
workflow had no timeout-minutes set, so a single hung test could burn
the full job cap.

Mirror the Linux workflow: 45min job cap + 25min SSL step cap. macOS
keeps the default --test-timeout (no per-test cap) since it isn't
where the original MOD-15862 hang occurred and we don't want to risk
regressing the runs that currently pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 09446a2. Configure here.



"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command
"${PYTHON:-python}" -m RLTest --verbose-information-on-failure --no-progress ${RLTEST_EXTRA_ARGS:-} --randomize-ports --module $MODULE_PATH --clear-logs "$@" --oss_password "password" --enable-debug-command
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-progress not removed despite stated PR intent

Medium Severity

The PR description explicitly states "Drop --no-progress from run_tests.sh" as a key diagnostic measure, noting that "the hang is invisible in the log because run_tests.sh passes RLTest --no-progress…so the test name never makes it out before the job is killed." However, --no-progress is still present on the updated line. This defeats the PR's diagnostic goal of making hanging test names visible in CI output before --test-timeout kills them.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09446a2. Configure here.

@gabsow gabsow closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant