fix: prevent WS-broadcast OOM crash under connection churn#58
Merged
Conversation
Root-caused from a captured panic backtrace: under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses and AsyncWebSocket's printfAll path allocates an AsyncWebSocketMessage per client -> operator new throws std::bad_alloc -> (Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot. That OOM-reboot is the "weight stops being collected under load" failure (not thermal -- die temp was 33 C). Decoded stack: operator new -> __cxa_throw -> std::terminate -> abort AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490) AsyncWebSocket::printfAll sendWebsocketWeightAll (websocket.h) <- loop() 10 Hz broadcast Fix: - Heap-gate every broadcast-to-all helper (weight, status, button, power-off) with wsBroadcastHeapOk(): skip the frame when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog) instead of allocating into an exhausted heap and crashing. Dropping a frame is invisible; the next is <=500 ms away. - Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32) so a backed-up/half-open client can't hoard heap. - Document the footgun in CLAUDE.md (notes + troubleshooting table). Verified on hardware: under the exact crashing load (conn_churn --rst 8x8 + 10 Hz WS + mDNS + BT) free heap was driven to 6436 bytes (old build died at 4684), the gate engaged ([ws] low heap 17736 < 25000 -> skip broadcast), and there was no abort, no reboot, weight stream continuous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 26, 2026
skialpine
added a commit
to skialpine/openscale
that referenced
this pull request
May 27, 2026
Rebased onto current main (post decentespresso#58 WS-OOM gate, post decentespresso#60 ADC library swap) and restructured from the original "embed telemetry in status" shape to deliver telemetry out-of-band, so apps that don't care about diagnostics (on-device UI, decentespresso app, third-party scale apps) aren't paying ~21% extra bytes on every status broadcast tick. The collection bits — SoC die-temperature sampler, weight-stall watchdog, ADC recovery counter widened to volatile uint32_t, reset-reason capture at boot, StopWatch-tear-fix timer snapshot — are unchanged from review- round-2. Wire-protocol delivery is reorganized into three frames: session_info one-shot, server→client on WS_EVT_CONNECT. Carries the fields immutable for the connection (protocol_version, firmware_version, reset_reason). Clients no longer have to ask for these. debug events broadcast on diagnostic-relevant change. Emitted by: - sendWebsocketDebugStall(true) on stall onset - sendWebsocketDebugStall(false) on stall resume - sendWebsocketDebugAdcRecovery() on ADC power-cycle - sendWebsocketDebugTempPeak() on new SoC max temp Subscribers see the event the moment it happens; non- subscribers ignore the unknown type. No periodic broadcast — temp_peak fires at most once per warm-up curve, stall_*/adc_recovery only on real events. debug reply on-request snapshot per-client. Send {"command":"debug"} (also accepted: "diag") and the server replies with the full diagnostic set (current/peak temp, stall state + count + last, recovery count). Per-client, not a broadcast — no heap cost for other clients. All event broadcasts go through the existing wsBroadcastHeapOk() gate (PR decentespresso#58). Per-client sends don't need it (one allocation, not one-per- client). Net effect: - Status frame stays at its current 16 fields (≈310 B payload). Apps that don't care about diagnostics get back ~21% per status broadcast — meaningful under multi-client load where allocations stack. - Diagnostic consumers (soak tools, debug dashboards, this PR's own thermal_load_test.sh) get *immediate* notification of stall / recovery events instead of waiting up to 5 s for the next status. - session_info means clients no longer have to roundtrip a status request just to learn reset_reason after reconnect. Also includes: - Updated ADS1232 debug callback for the new library's field set (rebased from old API; the old dataMin/Max/Avg/StdDev / tareInProgress/tareTimes are gone in the upstream lib). Callback stays dormant by default (registered but setDebugEnabled(false) is the default in the lib). - StopWatch read-tear fix: sendWebsocketStatus and StatusAll now read g_timerRunning/g_timerElapsed (snapshotted once per main-loop pass) instead of touching stopWatch directly from the AsyncTCP task. - CLAUDE.md: "Fixing bugs you find along the way" guidance from review round 2. - README.md: full documentation of session_info + debug frames. - tools/thermal_load_test.sh: 1-hour multi-protocol soak runner. Build verified clean on esp32s3 (RAM 17.2%, Flash 45.5%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Independent of PR #57 (the scale-telemetry PR). They share two files (
include/websocket.h,CLAUDE.md) but the changes don't overlap semantically — both can land in either order. This PR carries the bug fix alone, againstmain.Root cause (from a captured + decoded panic backtrace)
Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an
AsyncWebSocketMessageper client andoperator newthrowsstd::bad_alloc; Arduino-ESP32 builds-fno-exceptions, so the throw goes tostd::terminate()→abort()→ reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).Decoded stack:
The existing 15 KB heap watchdog (
wifi_setup.cpp) can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocationbad_allocs long before it acts.Fix (38 lines added, 0 removed)
wsBroadcastHeapOk()heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll,sendWebsocketStatusAll, button, power-off): when free heap is belowWS_BROADCAST_HEAP_FLOOR(25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).-D WS_MAX_QUEUED_MESSAGES=8(lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.Verification (on hardware)
Re-ran the exact load that crashed the unpatched build —
conn_churn --rst8×8 + 10 Hz WS + mDNS, BT connected:[ws] low heap 17736 < 25000 -> skip broadcast.abort, no reboot, weight stream uninterrupted (uptime continuous through 3500+ churn cycles).Separately, a 2-hour sustained soak (full multi-protocol load) on this fix: 0 stalls, 0 reboots, 0 lost frames, peak SoC 50.3 °C.
🤖 Generated with Claude Code