fix: prevent WS-broadcast OOM crash under connection churn#1
Open
skialpine wants to merge 8 commits into
Open
Conversation
Diagnostics for the field "weight stops being collected under sustained load" reports (suspected thermal). Adds to the WS status frame and serial logs: - soc_temp_c / soc_temp_max_c: live + peak ESP32-S3 die temperature (temperatureRead()), sampled every 2s on the main loop. - weight_stalled + stall_count + last_stall_ms + last_stall_temp_c: a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed for >8s (a live cell dithers every sample), recording the die temp at the moment of the stall to correlate failures with heat. (This failure is not firmware- recoverable, so it's surfaced, not silently retried.) - reset_reason: esp_reset_reason() captured at boot, so a brownout/panic/WDT reset is attributable instead of looking like a clean power-on. Telemetry-only; no behavior change to the weight/WiFi/BLE paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nitor) Drives USB 10Hz + WS 10Hz + HTTP/WS churn + mDNS (BT driven externally) and polls the new temp/stall telemetry every ~60s, watching for the weight-stall failure and the die temp at which it occurs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Review follow-ups on the telemetry watchdog: - Skip the stall check while b_adc_recovery_active (the ADS1232 power-cycle freezes the raw value by design); re-seed the window on resume so a genuine signal-timeout recovery isn't miscounted as a railed/frozen stall. - Check every 250 ms instead of every loop iteration -- the ADC only produces ~10 samples/s, so polling getDebugInfo() at full loop rate (with its sqrt + dataset passes) just burns CPU/heat, which is counterproductive on the chip we're trying to characterize. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
From the toolkit review of PR decentespresso#57: - temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON and a frozen peak since NaN compares false); keep last valid + log once. - g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the AsyncTCP status path); status frame casts it for printf. - Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop keeps re-seeding the stall window so weight_stalled may never trip -- the climbing recovery count makes that failure mode visible. i_adc_recovery_count is now volatile (newly read cross-task). - reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons (CPU_LOCKUP/USB/JTAG) stay attributable. - Comment fixes: volatile cross-task rationale; stall-window re-seed wording + recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- DURATION/IP/HOST now positional args; warn (don't silently skip) if no USB port. - Telemetry monitor logs reset_reason + adc_recovery_count per line, waits a full status interval after (re)connect, tracks peak temp / stalls / recoveries / reboots across the whole run (so a firmware reset doesn't lose the peak), and prints a SUMMARY line with a PASS/FAIL verdict. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the iteration-2 review findings on PR decentespresso#57: - Status frame no longer reads the multi-field stopWatch object directly off the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The loop task now snapshots it into aligned volatiles (g_timerRunning/ g_timerElapsed) that both status frames read. - Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps counting truthfully over a long soak instead of saturating; update the WS format specifier %u -> %lu accordingly. - SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON. - Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed sentinel) at boot/rollover. - README: document the new status-frame telemetry fields. - thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status frames or a crashed load generator, and exit non-zero on FAIL so it works as a CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- thermal_load_test.sh: close the silent-PASS holes a review found. A flapping wedge (scale answers one frame per reconnect, resetting the consecutive-miss streak) now fails via a cumulative total_no_status counter, not just max_no_status_streak. Each load generator's PID is captured and waited on individually so a never-started/crashed driver (non-zero exit) fails the run instead of being missed by a Traceback-only grep. A run that never saw soc_temp_max_c (peak stuck at the -999 sentinel) also fails, since the thermal data the test exists to capture is absent. - CLAUDE.md: add "Fixing bugs you find along the way" — pre-existing bugs get fixed in the same change, not deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root-caused from a captured panic backtrace: under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses and AsyncWebSocket's printfAll path allocates an AsyncWebSocketMessage per client -> operator new throws std::bad_alloc -> (Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot. That OOM-reboot is the "weight stops being collected under load" failure (not thermal -- die temp was 33 C). Decoded stack: operator new -> __cxa_throw -> std::terminate -> abort AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490) AsyncWebSocket::printfAll sendWebsocketWeightAll (websocket.h) <- loop() 10 Hz broadcast Fix: - Heap-gate every broadcast-to-all helper (weight, status, button, power-off) with wsBroadcastHeapOk(): skip the frame when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog) instead of allocating into an exhausted heap and crashing. Dropping a frame is invisible; the next is <=500 ms away. - Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32) so a backed-up/half-open client can't hoard heap. - Document the footgun in CLAUDE.md (notes + troubleshooting table). Stacked on the scale-telemetry branch (PR decentespresso#57) whose reset_reason / heap telemetry made this diagnosable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f3adb84 to
62a09f9
Compare
skialpine
added a commit
that referenced
this pull request
May 27, 2026
Three findings from the code review on cb569d3: 1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame is invisible (the next weight frame is <=500 ms away, status <=5 s); crashing is not." The "status <=5 s" recovery guarantee no longer holds after this PR removes the only periodic status broadcaster. Rewrote to describe what's actually still gated by wsBroadcastHeapOk (weight + button + power_off). 2. include/websocket.h:243-247 had an addendum paragraph framing the surviving printfAll comment as "Note: status is no longer broadcast periodically...". The historical framing reads as documentation for a function that no longer exists; dropping it leaves the printfAll technical commentary applying cleanly to sendWebsocketWeightAll (which is what it now sits above). 3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch directly from the AsyncTCP task. CLAUDE.md threading model #1 footgun: "stopWatch.* -- No -- multi-field (running flag + start ts + accumulator) and also mutated from loop(), BLE and USB; a status- frame read can tear a write across tasks." The bug is pre-existing on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you find along the way" applies. Mirroring the snapshot pattern used by PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h, refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial same-lines conflict that git auto-resolves.) Verified on hardware: flash + 12s window with events on + rate 10k -> 121 weight frames at 10 Hz, 0 periodic status frames; on-request {"command":"status"} returns the full 16-field frame with timer_running/timer_seconds populated from the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on
feat/scale-telemetry(PR decentespresso#57). Base will be retargeted tomainonce decentespresso#57 merges. The telemetry in decentespresso#57 (reset_reason, heap) is what made this diagnosable.Root cause (from a captured + decoded panic backtrace)
Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an
AsyncWebSocketMessageper client andoperator newthrowsstd::bad_alloc; Arduino-ESP32 builds-fno-exceptions, so the throw goes tostd::terminate()→abort()→ reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).Decoded stack:
The existing 15 KB heap watchdog can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation
bad_allocs long before it acts.Fix
wsBroadcastHeapOk()heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll,sendWebsocketStatusAll, button, power-off): when free heap is belowWS_BROADCAST_HEAP_FLOOR(25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).-D WS_MAX_QUEUED_MESSAGES=8(lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.Verification (on hardware)
Re-ran the exact load that crashed the unpatched build —
conn_churn --rst8×8 + 10 Hz WS + mDNS, BT connected:[ws] low heap 17736 < 25000 -> skip broadcast.abort, no reboot, weight stream uninterrupted (uptime continuous).🤖 Generated with Claude Code