Scale telemetry: SoC temp, weight-stall watchdog, reset reason by skialpine · Pull Request #57 · decentespresso/openscale

skialpine · 2026-05-25T18:28:49Z

Summary

Diagnostics for field reports of "weight stops being collected under sustained multi-protocol load" — the only recovery seen was a long battery-out cooldown (a quick power-cycle didn't help), pointing at a thermal/analog failure rather than firmware state. Adds telemetry to confirm/rule it out, with no behavior change to the weight/WiFi/BLE paths.

New fields in the /snapshot WS status frame (and serial logs):

soc_temp_c / soc_temp_max_c — live + peak ESP32-S3 die temp (temperatureRead()), sampled every 2 s.
weight_stalled / stall_count / last_stall_ms / last_stall_temp_c — a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed >8 s (a live cell dithers every sample), recording the die temp at failure to correlate stalls with heat. Skipped during the deliberate ADC power-cycle recovery; throttled to 250 ms.
reset_reason — esp_reset_reason() at boot, so a brownout/panic/WDT reset is attributable.

Plus tools/thermal_load_test.sh: a 1-hour USB+WiFi+churn+mDNS soak that polls the telemetry (BT driven externally).

Threading: new cross-task scalars are volatile per CLAUDE.md; the status frame only reads them.

Test plan

Builds for esp32s3; flashed; status frame reports the new fields.
1-hour multi-protocol soak to capture peak temp + any stall and the die temp at which it occurs.

🤖 Generated with Claude Code

skialpine · 2026-05-25T18:29:25Z

Code review

Reviewed the telemetry diff (deep bug scan + threading/CLAUDE.md check). Found one real issue, now fixed in this PR (commit 8e345d9):

Stall watchdog false-trip during ADC recovery — the watchdog read getDebugInfo().rawValue, which is frozen by definition during the firmware's own powerDown()/powerUp() recovery, so a genuine signal-timeout recovery would be miscounted as a railed/frozen stall (corrupting the very metric this adds). Fixed: skip the check while b_adc_recovery_active and re-seed the window on resume.
Also (cost): it ran getDebugInfo() (which does a sqrt + dataset passes) every loop iteration though only rawValue is used — wasteful on a chip we're characterizing for heat. Now throttled to 250 ms (the ADC only samples ~10/s).

Verified clean: printf format/arg pairing in both status frames; cross-task reads are volatile (benign torn read only); temperatureRead()/resetReasonStr() safe; at-rest false-positive risk is low (24-bit raw at SAMPLES=1 dithers every sample, so 8 s of byte-identical raw is a genuine freeze).

🤖 Generated with Claude Code

From the toolkit review of PR decentespresso#57: - temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON and a frozen peak since NaN compares false); keep last valid + log once. - g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the AsyncTCP status path); status frame casts it for printf. - Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop keeps re-seeding the stall window so weight_stalled may never trip -- the climbing recovery count makes that failure mode visible. i_adc_recovery_count is now volatile (newly read cross-task). - reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons (CPU_LOCKUP/USB/JTAG) stay attributable. - Comment fixes: volatile cross-task rationale; stall-window re-seed wording + recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Addresses the iteration-2 review findings on PR decentespresso#57: - Status frame no longer reads the multi-field stopWatch object directly off the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The loop task now snapshots it into aligned volatiles (g_timerRunning/ g_timerElapsed) that both status frames read. - Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps counting truthfully over a long soak instead of saturating; update the WS format specifier %u -> %lu accordingly. - SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON. - Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed sentinel) at boot/rollover. - README: document the new status-frame telemetry fields. - thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status frames or a crashed load generator, and exit non-zero on FAIL so it works as a CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (post decentespresso#58 WS-OOM gate, post decentespresso#60 ADC library swap) and restructured from the original "embed telemetry in status" shape to deliver telemetry out-of-band, so apps that don't care about diagnostics (on-device UI, decentespresso app, third-party scale apps) aren't paying ~21% extra bytes on every status broadcast tick. The collection bits — SoC die-temperature sampler, weight-stall watchdog, ADC recovery counter widened to volatile uint32_t, reset-reason capture at boot, StopWatch-tear-fix timer snapshot — are unchanged from review- round-2. Wire-protocol delivery is reorganized into three frames: session_info one-shot, server→client on WS_EVT_CONNECT. Carries the fields immutable for the connection (protocol_version, firmware_version, reset_reason). Clients no longer have to ask for these. debug events broadcast on diagnostic-relevant change. Emitted by: - sendWebsocketDebugStall(true) on stall onset - sendWebsocketDebugStall(false) on stall resume - sendWebsocketDebugAdcRecovery() on ADC power-cycle - sendWebsocketDebugTempPeak() on new SoC max temp Subscribers see the event the moment it happens; non- subscribers ignore the unknown type. No periodic broadcast — temp_peak fires at most once per warm-up curve, stall_*/adc_recovery only on real events. debug reply on-request snapshot per-client. Send {"command":"debug"} (also accepted: "diag") and the server replies with the full diagnostic set (current/peak temp, stall state + count + last, recovery count). Per-client, not a broadcast — no heap cost for other clients. All event broadcasts go through the existing wsBroadcastHeapOk() gate (PR decentespresso#58). Per-client sends don't need it (one allocation, not one-per- client). Net effect: - Status frame stays at its current 16 fields (≈310 B payload). Apps that don't care about diagnostics get back ~21% per status broadcast — meaningful under multi-client load where allocations stack. - Diagnostic consumers (soak tools, debug dashboards, this PR's own thermal_load_test.sh) get *immediate* notification of stall / recovery events instead of waiting up to 5 s for the next status. - session_info means clients no longer have to roundtrip a status request just to learn reset_reason after reconnect. Also includes: - Updated ADS1232 debug callback for the new library's field set (rebased from old API; the old dataMin/Max/Avg/StdDev / tareInProgress/tareTimes are gone in the upstream lib). Callback stays dormant by default (registered but setDebugEnabled(false) is the default in the lib). - StopWatch read-tear fix: sendWebsocketStatus and StatusAll now read g_timerRunning/g_timerElapsed (snapshotted once per main-loop pass) instead of touching stopWatch directly from the AsyncTCP task. - CLAUDE.md: "Fixing bugs you find along the way" guidance from review round 2. - README.md: full documentation of session_info + debug frames. - tools/thermal_load_test.sh: 1-hour multi-protocol soak runner. Build verified clean on esp32s3 (RAM 17.2%, Flash 45.5%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skialpine · 2026-05-27T14:00:46Z

Force-pushed: rebased onto current main (post #58 + #60) and restructured telemetry delivery.

What changed in the protocol: the 8 telemetry fields no longer ride on the hot status broadcast. They move out into three new frame shapes:

session_info — sent once per client on WS connect. Carries protocol_version, firmware_version, reset_reason (the connection-immutable fields).
debug events — broadcast on actual diagnostic changes: stall_start / stall_end, adc_recovery (count incremented), temp_peak (new SoC max). Subscribers get the event the moment it happens; non-subscribers ignore the unknown type.
debug reply — per-client on-request snapshot via {\"command\":\"debug\"}.

Status frame stays at its current 16 fields (~310 B payload) instead of bloating to ~470 B. Apps that don't care about diagnostics get back ~21% per broadcast.

Why now: the 8-hour soak validating #61's heap-defense knobs (cap=4 + 32K floor) showed that under marginal RSSI (-83) and multi-client load, status broadcasts were the dominant heap-pressure source — gate fired 5,000+ times. Slimming the status payload at the source is structurally better than just gating harder.

Backward compat: the previous status frame shape (which had the 8 telemetry fields appended) had not shipped on any merged release, so there's nothing to deprecate. Clients written against the README's old example will fall back to ignoring unknown fields cleanly anyway.

Build clean, validated diff against current main. Squashed to one commit since the original 7-commit history was reshuffled enough to make rebase-without-conflict impossible — happy to break it back into reviewable chunks if that's preferred.

protocol_version and firmware_version are session-immutable and now delivered exclusively via the session_info frame (sent on WS_EVT_CONNECT, also requestable via {"command":"session_info"}). Removing them from status saves ~50 bytes per broadcast across all clients — on top of the ~80 bytes already saved by moving telemetry to debug frames. Status frame: 16 fields -> 14 fields. Net payload trim vs original PR decentespresso#57 status (24 fields) is ~37%. Added {"command":"session_info"} (alias: "session") so clients have an explicit way to re-request the session-immutable fields if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skialpine · 2026-05-27T14:18:53Z

Code review

Found 3 issues:

tools/thermal_load_test.sh reads telemetry fields exclusively from status frames, but commit f72d6e2 moved them all to session_info and debug frames. The monitor never subscribes to debug events nor reads session_info, so peak stays at its -999 sentinel and the verdict path forces RESULT=FAIL on every run regardless of device behavior. Fix: handle all three frame types in the Python monitor (the script also unconditionally sends events on but never debug subscribe). The script's header comment at line 5 still says "WS status-frame telemetry" — also stale.

openscale/tools/thermal_load_test.sh

Lines 67 to 96 in f72d6e2

    
                   try: 
        
                       d=json.loads(w.recv()) 
        
                       if d.get("type")=="status": st=d 
        
                   except Exception: pass 
        
               return st 
        
           ws=connect() 
        
           peak=-999.0; total_stalls=0; reboots=0; max_recov=0 
        
           prev_stalls=None; prev_max=None; last_reset="?" 
        
           no_status_streak=0; max_no_status_streak=0; total_no_status=0 
        
           first=True 
        
           while time.time()<end: 
        
               st = first_status(ws, 9 if first else 2.5); first=False 
        
               if st: 
        
                   soc=st.get('soc_temp_c'); mx=st.get('soc_temp_max_c'); sc=st.get('stall_count',0) or 0 
        
                   recov=st.get('adc_recovery_count',0) or 0; rr=st.get('reset_reason','?') 
        
                   last_reset=rr; no_status_streak=0 
        
                   if isinstance(mx,(int,float)) and mx>peak: peak=mx 
        
                   if recov>max_recov: max_recov=recov 
        
                   # reboot heuristic: since-boot counters or peak dropped vs last frame 
        
                   if prev_stalls is not None and (sc<prev_stalls or (isinstance(mx,(int,float)) and isinstance(prev_max,(int,float)) and mx<prev_max-3)): 
        
                       reboots+=1 
        
                       print("[%s] *** REBOOT detected (counters reset; reset_reason=%s) ***"%(time.strftime('%H:%M:%S'),rr),flush=True) 
        
                   total_stalls=max(total_stalls, sc) 
        
                   prev_stalls=sc; prev_max=mx 
        
                   flag=" *** STALL ***" if st.get('weight_stalled') else "" 
        
                   print("[%s] soc=%5sC max=%5sC stalled=%-5s stalls=%s recov=%s last_stall_ms=%s stall_temp=%s reset=%s grams=%s chg=%s%s"%( 
        
                       time.strftime('%H:%M:%S'), soc, mx, st.get('weight_stalled'), sc, recov, 
        
                       st.get('last_stall_ms'), st.get('last_stall_temp_c'), rr, st.get('grams'), 
        
                       st.get('charging'), flag), flush=True) 
        
               else:

Debug event broadcasts (sendWebsocketDebugStall, sendWebsocketDebugAdcRecovery, sendWebsocketDebugTempPeak) gate only on !b_wifiEnabled || websocket.count() == 0. They skip the b_websocketEventsEnabled check that every other event broadcast in this file uses (sendWebsocketButton L188, sendWebsocketPowerOff L199, sendWebsocketStatusAll L246). README says "A client must send events on before periodic status, local scale button presses, or power-off notifications are emitted." — new debug frames break this opt-in contract. Either add !b_websocketEventsEnabled to the early-return, or document the always-on choice explicitly in the README's debug-frame section.

openscale/include/websocket.h

Lines 317 to 350 in f72d6e2

    
           // keep their own running snapshot from session_info + the on-request debug reply, 
        
           // and update it from these deltas. Keeps each event small (~80-140 B). 
        
           void sendWebsocketDebugStall(bool started) { 
        
             if (!b_wifiEnabled || websocket.count() == 0) return; 
        
             if (!wsBroadcastHeapOk()) return; 
        
             if (started) { 
        
               websocket.printfAll("{\"type\":\"debug\",\"event\":\"stall_start\",\"stall_count\":%lu,\"last_stall_ms\":%lu,\"last_stall_temp_c\":%.1f,\"ms\":%lu}", 
        
                                   (unsigned long)g_stallCount, 
        
                                   g_lastStallMs, 
        
                                   g_lastStallTempC, 
        
                                   millis()); 
        
             } else { 
        
               websocket.printfAll("{\"type\":\"debug\",\"event\":\"stall_end\",\"ms\":%lu}", millis()); 
        
             } 
        
           } 
        
           void sendWebsocketDebugAdcRecovery() { 
        
             if (!b_wifiEnabled || websocket.count() == 0) return; 
        
             if (!wsBroadcastHeapOk()) return; 
        
             websocket.printfAll("{\"type\":\"debug\",\"event\":\"adc_recovery\",\"adc_recovery_count\":%lu,\"ms\":%lu}", 
        
                                 (unsigned long)i_adc_recovery_count, 
        
                                 millis()); 
        
           } 
        
           // temp_peak is broadcast when g_socTempMaxC ticks up. Rate-limited at the call 
        
           // site (main loop) -- not here -- since the temp sampler already throttles. 
        
           void sendWebsocketDebugTempPeak(float maxC) { 
        
             if (!b_wifiEnabled || websocket.count() == 0) return; 
        
             if (!wsBroadcastHeapOk()) return; 
        
             websocket.printfAll("{\"type\":\"debug\",\"event\":\"temp_peak\",\"soc_temp_max_c\":%.1f,\"ms\":%lu}", 
        
                                 maxC, 
        
                                 millis()); 
        
           }

i_adc_recovery_count documentation contradicts the implementation. include/parameter.h:197-200 says "counts truthfully ... over a long soak instead of saturating at 255" and README:256 says "number of ADC power-cycle recoveries since boot". But src/hds.ino:1011 calls resetAdcRecoveryState() after every successful scale.update(), which zeroes the counter. The actual semantics — confirmed by the display-logic use at src/hds.ino:1830 (>= ADC_ERROR_RECOVERY_COUNT ? "ADC ERROR" : "ADC RECOVER") — is "consecutive failed recoveries in the current episode," not a lifetime total. A client tracking adc_recovery_count expecting monotonic growth will see it ping-pong to 1 and back to 0. Either fix the docs (parameter.h block-comment + README field description) or rename for a true lifetime counter.

openscale/include/parameter.h

Lines 195 to 201 in f72d6e2

    
           static bool b_adc_recovery_active = false; 
        
           // volatile: written on the main loop -- incremented on each ADC power-cycle 
        
           // recovery, reset to 0 by resetAdcRecoveryState() -- and read in the WS status 
        
           // frame (which can be built on the AsyncTCP task). uint32_t (not uint8_t) so a 
        
           // *perpetual* recovery loop -- the one failure mode the stall watchdog is blind 
        
           // to -- keeps counting truthfully over a long soak instead of saturating at 255. 
        
           static volatile uint32_t i_adc_recovery_count = 0;

1 below-threshold issue (50-74)

[Worth a look] src/hds.ino:22 comment says adsDebugCallback is "enabled during soak tests" — the actual enablement is via the USB binary protocol in include/usbcomm.h, not anything soak-test-specific. One-line clarification (score: 50).

🤖 Generated with Claude Code

_{If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Four fixes from the code review on the session_info/debug refactor: 1. tools/thermal_load_test.sh — rebuild the inline Python telemetry monitor for the new frame layout. The previous version read all telemetry from status frames, which silently broke when those fields moved to debug+session_info, forcing RESULT=FAIL unconditionally. New monitor merges fields from any of the three frame types (status / debug events / debug snapshot / session_info) into a running snapshot dict, and polls {"command":"debug"} every iteration so soc_temp_c/peak/stall_count stay fresh without waiting for sparse-by-design event firings. Header comment updated to match. Verified end-to-end on hardware: 2 min run -> RESULT=PASS with all telemetry fields populated. 2. include/websocket.h — debug event broadcasts now gate on b_websocketEventsEnabled, matching the existing opt-in model (sendWebsocketButton/PowerOff/StatusAll). The on-request {"command":"debug"} snapshot remains always-available regardless of the events flag, so a client that wants only diagnostic snapshots without subscribing to periodic status doesn't have to enable events. Comment block updated to document the gate behaviour. 3. include/parameter.h + README.md — i_adc_recovery_count documentation rewritten to describe its actual per-episode semantic (resets to 0 on every successful scale.update() via resetAdcRecoveryState() at hds.ino:1011). The previous comment claimed it accumulated across long soaks, which is false; the variable's actual purpose is the threshold the OLED uses to switch from "ADC RECOVER" to "ADC ERROR" (hds.ino:1830). README now points clients to summing adc_recovery debug events for a lifetime count. 4. src/hds.ino — adsDebugCallback comment now points at the actual enablement path (USB binary protocol command 0x25 in usbcomm.h) instead of the vague "enabled during soak tests". Build clean. PR decentespresso#57 review thread: decentespresso#57 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three findings from the code review on cb569d3: 1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame is invisible (the next weight frame is <=500 ms away, status <=5 s); crashing is not." The "status <=5 s" recovery guarantee no longer holds after this PR removes the only periodic status broadcaster. Rewrote to describe what's actually still gated by wsBroadcastHeapOk (weight + button + power_off). 2. include/websocket.h:243-247 had an addendum paragraph framing the surviving printfAll comment as "Note: status is no longer broadcast periodically...". The historical framing reads as documentation for a function that no longer exists; dropping it leaves the printfAll technical commentary applying cleanly to sendWebsocketWeightAll (which is what it now sits above). 3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch directly from the AsyncTCP task. CLAUDE.md threading model #1 footgun: "stopWatch.* -- No -- multi-field (running flag + start ts + accumulator) and also mutated from loop(), BLE and USB; a status- frame read can tear a write across tasks." The bug is pre-existing on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you find along the way" applies. Mirroring the snapshot pattern used by PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h, refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial same-lines conflict that git auto-resolves.) Verified on hardware: flash + 12s window with events on + rate 10k -> 121 weight frames at 10 Hz, 0 periodic status frames; on-request {"command":"status"} returns the full 16-field frame with timer_running/timer_seconds populated from the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two README gaps surfaced while writing client-integration docs: 1. The canonical text-commands list at lines 110-129 enumerates every other WS command (tare, events, timer, display, ..., status, battery, info) but omitted the four new ones added by this PR. A reader using that block as the complete vocabulary would conclude `debug` / `session_info` don't exist. Added them with their aliases. 2. The "Event broadcasts" paragraph in the debug section didn't mention the events_enabled gate. After the review-fix commit (d1cedcc) added that gate -- matching the existing button / power-off opt-in model -- the docs hadn't caught up. Clients reading the README would think debug events arrive unprompted, then be surprised when they didn't. Clarified that events on is required for push, and the on-request snapshot is always available regardless. Pure docs change; no behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

skialpine force-pushed the feat/scale-telemetry branch from f3adb84 to 62a09f9 Compare May 27, 2026 14:00

skialpine mentioned this pull request May 27, 2026

fix(ws): remove periodic status broadcast; align push pattern with BT #62

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57
skialpine wants to merge 4 commits into
decentespresso:mainfrom
skialpine:feat/scale-telemetry

skialpine commented May 25, 2026

Uh oh!

skialpine commented May 25, 2026

Uh oh!

skialpine commented May 27, 2026

Uh oh!

skialpine commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skialpine commented May 25, 2026

Summary

Test plan

Uh oh!

skialpine commented May 25, 2026

Code review

Uh oh!

skialpine commented May 27, 2026

Uh oh!

skialpine commented May 27, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant