Skip to content

fix: prevent WS-broadcast OOM crash under connection churn#58

Merged
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-oom-only
May 26, 2026
Merged

fix: prevent WS-broadcast OOM crash under connection churn#58
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-oom-only

Conversation

@skialpine
Copy link
Copy Markdown
Contributor

Independent of PR #57 (the scale-telemetry PR). They share two files (include/websocket.h, CLAUDE.md) but the changes don't overlap semantically — both can land in either order. This PR carries the bug fix alone, against main.

Root cause (from a captured + decoded panic backtrace)

Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an AsyncWebSocketMessage per client and operator new throws std::bad_alloc; Arduino-ESP32 builds -fno-exceptions, so the throw goes to std::terminate()abort() → reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).

Decoded stack:

operator new -> __cxa_throw -> std::terminate -> abort
AsyncWebSocketClient::_queueMessage   (AsyncWebSocket.cpp:490)
AsyncWebSocket::printfAll
sendWebsocketWeightAll                 (include/websocket.h, loop() 10 Hz broadcast)

The existing 15 KB heap watchdog (wifi_setup.cpp) can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation bad_allocs long before it acts.

Fix (38 lines added, 0 removed)

  • wsBroadcastHeapOk() heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll, sendWebsocketStatusAll, button, power-off): when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).
  • -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.
  • CLAUDE.md: documented the footgun (notes + troubleshooting table).

Verification (on hardware)

Re-ran the exact load that crashed the unpatched build — conn_churn --rst 8×8 + 10 Hz WS + mDNS, BT connected:

  • Free heap driven to 6436 bytes (old build crashed at ~4684).
  • Gate engaged: [ws] low heap 17736 < 25000 -> skip broadcast.
  • No abort, no reboot, weight stream uninterrupted (uptime continuous through 3500+ churn cycles).

Separately, a 2-hour sustained soak (full multi-protocol load) on this fix: 0 stalls, 0 reboots, 0 lost frames, peak SoC 50.3 °C.

🤖 Generated with Claude Code

Root-caused from a captured panic backtrace: under sustained multi-client
WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap
collapses and AsyncWebSocket's printfAll path allocates an
AsyncWebSocketMessage per client -> operator new throws std::bad_alloc ->
(Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot.
That OOM-reboot is the "weight stops being collected under load" failure
(not thermal -- die temp was 33 C). Decoded stack:

  operator new -> __cxa_throw -> std::terminate -> abort
  AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490)
  AsyncWebSocket::printfAll
  sendWebsocketWeightAll (websocket.h)  <- loop() 10 Hz broadcast

Fix:
- Heap-gate every broadcast-to-all helper (weight, status, button,
  power-off) with wsBroadcastHeapOk(): skip the frame when free heap is
  below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog)
  instead of allocating into an exhausted heap and crashing. Dropping a
  frame is invisible; the next is <=500 ms away.
- Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib
  default 32) so a backed-up/half-open client can't hoard heap.
- Document the footgun in CLAUDE.md (notes + troubleshooting table).

Verified on hardware: under the exact crashing load (conn_churn --rst 8x8
+ 10 Hz WS + mDNS + BT) free heap was driven to 6436 bytes (old build
died at 4684), the gate engaged ([ws] low heap 17736 < 25000 -> skip
broadcast), and there was no abort, no reboot, weight stream continuous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tadelv tadelv merged commit adac24a into decentespresso:main May 26, 2026
skialpine added a commit to skialpine/openscale that referenced this pull request May 27, 2026
Rebased onto current main (post decentespresso#58 WS-OOM gate, post decentespresso#60 ADC library
swap) and restructured from the original "embed telemetry in status"
shape to deliver telemetry out-of-band, so apps that don't care about
diagnostics (on-device UI, decentespresso app, third-party scale apps)
aren't paying ~21% extra bytes on every status broadcast tick.

The collection bits — SoC die-temperature sampler, weight-stall watchdog,
ADC recovery counter widened to volatile uint32_t, reset-reason capture
at boot, StopWatch-tear-fix timer snapshot — are unchanged from review-
round-2.

Wire-protocol delivery is reorganized into three frames:

  session_info  one-shot, server→client on WS_EVT_CONNECT. Carries the
                fields immutable for the connection (protocol_version,
                firmware_version, reset_reason). Clients no longer have
                to ask for these.

  debug events  broadcast on diagnostic-relevant change. Emitted by:
                  - sendWebsocketDebugStall(true)  on stall onset
                  - sendWebsocketDebugStall(false) on stall resume
                  - sendWebsocketDebugAdcRecovery() on ADC power-cycle
                  - sendWebsocketDebugTempPeak()    on new SoC max temp
                Subscribers see the event the moment it happens; non-
                subscribers ignore the unknown type. No periodic
                broadcast — temp_peak fires at most once per warm-up
                curve, stall_*/adc_recovery only on real events.

  debug reply   on-request snapshot per-client. Send {"command":"debug"}
                (also accepted: "diag") and the server replies with the
                full diagnostic set (current/peak temp, stall state +
                count + last, recovery count). Per-client, not a
                broadcast — no heap cost for other clients.

All event broadcasts go through the existing wsBroadcastHeapOk() gate
(PR decentespresso#58). Per-client sends don't need it (one allocation, not one-per-
client).

Net effect:

  - Status frame stays at its current 16 fields (≈310 B payload). Apps
    that don't care about diagnostics get back ~21% per status broadcast
    — meaningful under multi-client load where allocations stack.
  - Diagnostic consumers (soak tools, debug dashboards, this PR's own
    thermal_load_test.sh) get *immediate* notification of stall /
    recovery events instead of waiting up to 5 s for the next status.
  - session_info means clients no longer have to roundtrip a status
    request just to learn reset_reason after reconnect.

Also includes:

  - Updated ADS1232 debug callback for the new library's field set
    (rebased from old API; the old dataMin/Max/Avg/StdDev /
    tareInProgress/tareTimes are gone in the upstream lib). Callback
    stays dormant by default (registered but setDebugEnabled(false) is
    the default in the lib).
  - StopWatch read-tear fix: sendWebsocketStatus and StatusAll now read
    g_timerRunning/g_timerElapsed (snapshotted once per main-loop pass)
    instead of touching stopWatch directly from the AsyncTCP task.
  - CLAUDE.md: "Fixing bugs you find along the way" guidance from
    review round 2.
  - README.md: full documentation of session_info + debug frames.
  - tools/thermal_load_test.sh: 1-hour multi-protocol soak runner.

Build verified clean on esp32s3 (RAM 17.2%, Flash 45.5%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants