fix: tighten WS heap defenses (cap=4, broadcast floor 32K)#61
Merged
Conversation
Under sustained multi-protocol load (4+ WS clients + BLE + USB) the
default WS client cap of 8 lets per-client queue + AsyncWebSocketMessage
allocations overcommit heap to the point where lwIP TX buffers starve
silently — WiFi reports status=CONNECTED, disc/rec never increment, but
packets drop. Two small knobs together close the gap:
* cleanupClients(4) — halves worst-case connection overcommit.
Real workloads (on-device UI + PRs/Decenza + occasional second
client) cap out at 3–4; 8 was always headroom for edge cases that
don't happen on this hardware.
* WS_BROADCAST_HEAP_FLOOR raised 25000 → 32000. The original 25K
floor sat too close to the lwIP starvation knee: a 4-client status
burst dipped free heap to ~22K, below where pbufs can be allocated.
32K keeps the post-burst trough in safe territory.
Validated on hardware: 2h+ soak with 4 ws_drop_repro generators + BLE
(Decenza) + USB serial, 0% ping loss, 0 reconnects, 0 panics, ~300
broadcasts skipped by the existing wsBroadcastHeapOk gate. minheap
stable across the run (no leak signature).
Holding for an 8h soak before merge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 27, 2026
tadelv
pushed a commit
to skialpine/openscale
that referenced
this pull request
May 28, 2026
The /snapshot WebSocket was the only protocol pushing periodic state
snapshots. BT notifies only weight + on-event (power_off, buttons); USB
binary follows the same shape. WiFi did a periodic status frame every
5 s when events_enabled, which:
- allocated ~310 B x N clients every 5 s for state the on-device web
apps never consumed (verified: zero references to battery_percent,
timer_seconds, charging, low_power, etc. in web_apps/)
- was the dominant heap-pressure source in the 8 h soak that
motivated PR decentespresso#61
- was inconsistent with BT, surprising clients that interoperate
across both transports
After this change, clients request status on demand via
{"command":"status"} -- same model as BT's 0x22 voltage request, same
per-client reply path that already exists (sendWebsocketStatus, kept
intact). The events_enabled flag now gates only button events and
power_off broadcasts -- the cross-protocol event types that BT also
notifies, just without an opt-in over there. Weight, button, and
power_off push behavior on WiFi is unchanged.
Cleanup follows:
- sendWebsocketStatusAll() removed (no callers; was the periodic
broadcast helper)
- t_lastWebsocketStatusUpdate global removed (no periodic tick to
track)
- WEBSOCKET_STATUS_NOTIFY_INTERVAL_MS constant removed
- WS event handler no longer initialises/resets the period timer
- README updated: status frame is on-request only; the "events on"
paragraph no longer mentions periodic status; cross-references the
BT pattern explicitly
Verified on hardware:
20 s window, events on, rate 10k -> 201 weight frames (10.05 Hz)
+ 0 status frames. On-request {"command":"status"} still answers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tadelv
pushed a commit
that referenced
this pull request
May 28, 2026
…#62) * fix(ws): remove periodic status broadcast; align push pattern with BT The /snapshot WebSocket was the only protocol pushing periodic state snapshots. BT notifies only weight + on-event (power_off, buttons); USB binary follows the same shape. WiFi did a periodic status frame every 5 s when events_enabled, which: - allocated ~310 B x N clients every 5 s for state the on-device web apps never consumed (verified: zero references to battery_percent, timer_seconds, charging, low_power, etc. in web_apps/) - was the dominant heap-pressure source in the 8 h soak that motivated PR #61 - was inconsistent with BT, surprising clients that interoperate across both transports After this change, clients request status on demand via {"command":"status"} -- same model as BT's 0x22 voltage request, same per-client reply path that already exists (sendWebsocketStatus, kept intact). The events_enabled flag now gates only button events and power_off broadcasts -- the cross-protocol event types that BT also notifies, just without an opt-in over there. Weight, button, and power_off push behavior on WiFi is unchanged. Cleanup follows: - sendWebsocketStatusAll() removed (no callers; was the periodic broadcast helper) - t_lastWebsocketStatusUpdate global removed (no periodic tick to track) - WEBSOCKET_STATUS_NOTIFY_INTERVAL_MS constant removed - WS event handler no longer initialises/resets the period timer - README updated: status frame is on-request only; the "events on" paragraph no longer mentions periodic status; cross-references the BT pattern explicitly Verified on hardware: 20 s window, events on, rate 10k -> 201 weight frames (10.05 Hz) + 0 status frames. On-request {"command":"status"} still answers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address review findings on periodic-status removal Three findings from the code review on cb569d3: 1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame is invisible (the next weight frame is <=500 ms away, status <=5 s); crashing is not." The "status <=5 s" recovery guarantee no longer holds after this PR removes the only periodic status broadcaster. Rewrote to describe what's actually still gated by wsBroadcastHeapOk (weight + button + power_off). 2. include/websocket.h:243-247 had an addendum paragraph framing the surviving printfAll comment as "Note: status is no longer broadcast periodically...". The historical framing reads as documentation for a function that no longer exists; dropping it leaves the printfAll technical commentary applying cleanly to sendWebsocketWeightAll (which is what it now sits above). 3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch directly from the AsyncTCP task. CLAUDE.md threading model #1 footgun: "stopWatch.* -- No -- multi-field (running flag + start ts + accumulator) and also mutated from loop(), BLE and USB; a status- frame read can tear a write across tasks." The bug is pre-existing on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you find along the way" applies. Mirroring the snapshot pattern used by PR #57: new g_timerRunning/g_timerElapsed volatiles in parameter.h, refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping block, read by sendWebsocketStatus. (PR #57 adds the same snapshot independently; merge order between #57 and #62 will produce a trivial same-lines conflict that git auto-resolves.) Verified on hardware: flash + 12s window with events on + rate 10k -> 121 weight frames at 10 Hz, 0 periodic status frames; on-request {"command":"status"} returns the full 16-field frame with timer_running/timer_seconds populated from the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small knobs that together close the silent-WiFi-drop failure mode observed during multi-protocol soaks on top of #56's WS-OOM defense.
websocket.cleanupClients(4)— halve the WS client cap from the lib default (8) to 4WS_BROADCAST_HEAP_FLOORraised 25 KB → 32 KBWhy
Under sustained multi-protocol load (4+ WS clients + BLE + USB), per-client queue +
AsyncWebSocketMessageallocations overcommit heap until lwIP can't allocate pbufs. Symptom is nasty:wifi_statusstaysCONNECTED,disc/recnever increment, but packets drop silently. The on-device watchdog never trips because nothing has crashed.The existing
wsBroadcastHeapOk()gate (PR #58) prevents thebad_alloc → abortreboot, but at the original 25 KB floor a 4-client status burst still dipped free heap to ~22 KB — below the lwIP starvation knee. 32 KB keeps the post-burst trough in safe territory.cap=8 was always overhead for edge cases. Real workloads (on-device UI + PRs/Decenza app + an occasional second viewer) cap out at 3–4 concurrent clients on this hardware.
Validation
2h+ soak on hardware (skialpine/test/adc-lib+telemetry worktree with these edits applied):
disc=0 rec=0for the full run)bad_alloc, 0 AsyncTCP stallsHolding for an 8h soak before merge.
What this doesn't fix
This is a brake, not a root-cause fix. The underlying pressure — large
statusAllbroadcasts × 4 clients allocating ~2.3 KB per burst — is still there; we're just guaranteeing the gate catches it. Follow-up will split immutable fields (firmware_version,protocol_version,reset_reason) and quasi-immutable fields (stall_count,last_stall_*,adc_recovery_count,soc_temp_max_c) out of the hotstatusAllpath into a one-shotsession_infoframe + on-change deltas. Targets ~37% payload reduction on the status broadcast.Test plan
pio run -e esp32s3)[ws] low heap … skip broadcaststill fires correctly under load🤖 Generated with Claude Code