Skip to content

Retry transient request failures with a short per-attempt timeout#164

Merged
kellerza merged 1 commit into
kellerza:mainfrom
jason-curtis:retry-on-timeout
Jun 30, 2026
Merged

Retry transient request failures with a short per-attempt timeout#164
kellerza merged 1 commit into
kellerza:mainfrom
jason-curtis:retry-on-timeout

Conversation

@jason-curtis

@jason-curtis jason-curtis commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Problem

_request_json retries only ServerDisconnectedError. An asyncio.TimeoutError or ClientError (the usual symptom of packet loss) raises SmaConnectionException immediately with no retry, on a single request with a 15s timeout. On a lossy network a single long attempt often fails outright, so Home Assistant marks the whole device unavailable on one missed poll and the entities flap (cycling available and unavailable every minute or two), even though most polls would succeed on a quick retry.

Change

Retry all transient connection failures (timeouts, disconnects, generic client errors) with a fresh connection, not just ServerDisconnectedError. DEFAULT_TIMEOUT becomes a per-attempt timeout (5s) with DEFAULT_REQUEST_RETRIES (3) attempts, so the total request time stays about 15s, now spread over 3 independent attempts instead of 1. A fresh short attempt resets TCP backoff and samples the link several times, which recovers better than one long attempt whose stuck connection keeps retransmitting.

Evidence

Measured against a real SMA inverter on a WiFi link with 64 to 86% packet loss, 18 poll cycles on the identical link:

Strategy Per-poll success
current (1 x 15s) 44%
3 x 6s retry 72%
6 x 2.5s retry 72%

Retry roughly halves the flapping. The residual failures are multi-second dead bursts longer than any timeout budget, which the client cannot fix. A separate live test in Home Assistant showed about 45% fewer dropouts at the same poll interval.

Notes

The exact values (5s and 3 attempts) keep the original 15s worst-case budget and are easy to tune. DEFAULT_TIMEOUT changes meaning from a single request timeout to a per-attempt timeout; I can add a separate constant instead if you prefer to keep its semantics. The previous "Server at ... disconnected N times." message is folded into the existing "Could not connect to SMA at ...: ..." message. Existing tests pass unchanged (25 passed) and ruff format and ruff check are clean.

@codecov

codecov Bot commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.87%. Comparing base (63058ff) to head (2d1c389).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #164      +/-   ##
==========================================
- Coverage   99.87%   99.87%   -0.01%     
==========================================
  Files          12       12              
  Lines         792      791       -1     
==========================================
- Hits          791      790       -1     
  Misses          1        1              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/pysma/sma_webconnect.py Outdated
@jason-curtis jason-curtis marked this pull request as ready for review June 29, 2026 16:32
_request_json retried only ServerDisconnectedError; asyncio.TimeoutError
and ClientError raised SmaConnectionException immediately on a single 15s
request. Retry all transient connection failures with a fresh connection,
and make DEFAULT_TIMEOUT per-attempt (5s) with DEFAULT_REQUEST_RETRIES (3),
keeping total request time about 15s across 3 attempts.

Measured on a real inverter on a 64-86% packet-loss link, per-poll success
rose from 44% to 72%.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kellerza kellerza merged commit be86cb3 into kellerza:main Jun 30, 2026
8 checks passed
@jason-curtis jason-curtis deleted the retry-on-timeout branch June 30, 2026 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants