fix: prevent WS-broadcast OOM crash under connection churn by skialpine · Pull Request #58 · decentespresso/openscale

skialpine · 2026-05-26T05:49:24Z

Independent of PR #57 (the scale-telemetry PR). They share two files (include/websocket.h, CLAUDE.md) but the changes don't overlap semantically — both can land in either order. This PR carries the bug fix alone, against main.

Root cause (from a captured + decoded panic backtrace)

Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an AsyncWebSocketMessage per client and operator new throws std::bad_alloc; Arduino-ESP32 builds -fno-exceptions, so the throw goes to std::terminate() → abort() → reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).

Decoded stack:

operator new -> __cxa_throw -> std::terminate -> abort
AsyncWebSocketClient::_queueMessage   (AsyncWebSocket.cpp:490)
AsyncWebSocket::printfAll
sendWebsocketWeightAll                 (include/websocket.h, loop() 10 Hz broadcast)

The existing 15 KB heap watchdog (wifi_setup.cpp) can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation bad_allocs long before it acts.

Fix (38 lines added, 0 removed)

wsBroadcastHeapOk() heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll, sendWebsocketStatusAll, button, power-off): when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).
-D WS_MAX_QUEUED_MESSAGES=8 (lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.
CLAUDE.md: documented the footgun (notes + troubleshooting table).

Verification (on hardware)

Re-ran the exact load that crashed the unpatched build — conn_churn --rst 8×8 + 10 Hz WS + mDNS, BT connected:

Free heap driven to 6436 bytes (old build crashed at ~4684).
Gate engaged: [ws] low heap 17736 < 25000 -> skip broadcast.
No abort, no reboot, weight stream uninterrupted (uptime continuous through 3500+ churn cycles).

Separately, a 2-hour sustained soak (full multi-protocol load) on this fix: 0 stalls, 0 reboots, 0 lost frames, peak SoC 50.3 °C.

🤖 Generated with Claude Code

Root-caused from a captured panic backtrace: under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses and AsyncWebSocket's printfAll path allocates an AsyncWebSocketMessage per client -> operator new throws std::bad_alloc -> (Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot. That OOM-reboot is the "weight stops being collected under load" failure (not thermal -- die temp was 33 C). Decoded stack: operator new -> __cxa_throw -> std::terminate -> abort AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490) AsyncWebSocket::printfAll sendWebsocketWeightAll (websocket.h) <- loop() 10 Hz broadcast Fix: - Heap-gate every broadcast-to-all helper (weight, status, button, power-off) with wsBroadcastHeapOk(): skip the frame when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog) instead of allocating into an exhausted heap and crashing. Dropping a frame is invisible; the next is <=500 ms away. - Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32) so a backed-up/half-open client can't hoard heap. - Document the footgun in CLAUDE.md (notes + troubleshooting table). Verified on hardware: under the exact crashing load (conn_churn --rst 8x8 + 10 Hz WS + mDNS + BT) free heap was driven to 6436 bytes (old build died at 4684), the gate engaged ([ws] low heap 17736 < 25000 -> skip broadcast), and there was no abort, no reboot, weight stream continuous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (post decentespresso#58 WS-OOM gate, post decentespresso#60 ADC library swap) and restructured from the original "embed telemetry in status" shape to deliver telemetry out-of-band, so apps that don't care about diagnostics (on-device UI, decentespresso app, third-party scale apps) aren't paying ~21% extra bytes on every status broadcast tick. The collection bits — SoC die-temperature sampler, weight-stall watchdog, ADC recovery counter widened to volatile uint32_t, reset-reason capture at boot, StopWatch-tear-fix timer snapshot — are unchanged from review- round-2. Wire-protocol delivery is reorganized into three frames: session_info one-shot, server→client on WS_EVT_CONNECT. Carries the fields immutable for the connection (protocol_version, firmware_version, reset_reason). Clients no longer have to ask for these. debug events broadcast on diagnostic-relevant change. Emitted by: - sendWebsocketDebugStall(true) on stall onset - sendWebsocketDebugStall(false) on stall resume - sendWebsocketDebugAdcRecovery() on ADC power-cycle - sendWebsocketDebugTempPeak() on new SoC max temp Subscribers see the event the moment it happens; non- subscribers ignore the unknown type. No periodic broadcast — temp_peak fires at most once per warm-up curve, stall_*/adc_recovery only on real events. debug reply on-request snapshot per-client. Send {"command":"debug"} (also accepted: "diag") and the server replies with the full diagnostic set (current/peak temp, stall state + count + last, recovery count). Per-client, not a broadcast — no heap cost for other clients. All event broadcasts go through the existing wsBroadcastHeapOk() gate (PR decentespresso#58). Per-client sends don't need it (one allocation, not one-per- client). Net effect: - Status frame stays at its current 16 fields (≈310 B payload). Apps that don't care about diagnostics get back ~21% per status broadcast — meaningful under multi-client load where allocations stack. - Diagnostic consumers (soak tools, debug dashboards, this PR's own thermal_load_test.sh) get *immediate* notification of stall / recovery events instead of waiting up to 5 s for the next status. - session_info means clients no longer have to roundtrip a status request just to learn reset_reason after reconnect. Also includes: - Updated ADS1232 debug callback for the new library's field set (rebased from old API; the old dataMin/Max/Avg/StdDev / tareInProgress/tareTimes are gone in the upstream lib). Callback stays dormant by default (registered but setDebugEnabled(false) is the default in the lib). - StopWatch read-tear fix: sendWebsocketStatus and StatusAll now read g_timerRunning/g_timerElapsed (snapshotted once per main-loop pass) instead of touching stopWatch directly from the AsyncTCP task. - CLAUDE.md: "Fixing bugs you find along the way" guidance from review round 2. - README.md: full documentation of session_info + debug frames. - tools/thermal_load_test.sh: 1-hour multi-protocol soak runner. Build verified clean on esp32s3 (RAM 17.2%, Flash 45.5%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tadelv merged commit adac24a into decentespresso:main May 26, 2026

This was referenced May 26, 2026

BLE: nimble_host LoadProhibited crash in BLEServer::removePeerDevice (race on m_connectedServersMap) #59

Open

fix: tighten WS heap defenses (cap=4, broadcast floor 32K) #61

Open

skialpine mentioned this pull request May 27, 2026

Scale telemetry: SoC temp, weight-stall watchdog, reset reason #57

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent WS-broadcast OOM crash under connection churn#58

fix: prevent WS-broadcast OOM crash under connection churn#58
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-oom-only

skialpine commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

skialpine commented May 26, 2026

Root cause (from a captured + decoded panic backtrace)

Fix (38 lines added, 0 removed)

Verification (on hardware)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants