Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57
Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57skialpine wants to merge 4 commits into
Conversation
Code reviewReviewed the telemetry diff (deep bug scan + threading/CLAUDE.md check). Found one real issue, now fixed in this PR (commit
Verified clean: printf format/arg pairing in both status frames; cross-task reads are 🤖 Generated with Claude Code |
From the toolkit review of PR decentespresso#57: - temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON and a frozen peak since NaN compares false); keep last valid + log once. - g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the AsyncTCP status path); status frame casts it for printf. - Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop keeps re-seeding the stall window so weight_stalled may never trip -- the climbing recovery count makes that failure mode visible. i_adc_recovery_count is now volatile (newly read cross-task). - reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons (CPU_LOCKUP/USB/JTAG) stay attributable. - Comment fixes: volatile cross-task rationale; stall-window re-seed wording + recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the iteration-2 review findings on PR decentespresso#57: - Status frame no longer reads the multi-field stopWatch object directly off the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The loop task now snapshots it into aligned volatiles (g_timerRunning/ g_timerElapsed) that both status frames read. - Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps counting truthfully over a long soak instead of saturating; update the WS format specifier %u -> %lu accordingly. - SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON. - Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed sentinel) at boot/rollover. - README: document the new status-frame telemetry fields. - thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status frames or a crashed load generator, and exit non-zero on FAIL so it works as a CI gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (post decentespresso#58 WS-OOM gate, post decentespresso#60 ADC library swap) and restructured from the original "embed telemetry in status" shape to deliver telemetry out-of-band, so apps that don't care about diagnostics (on-device UI, decentespresso app, third-party scale apps) aren't paying ~21% extra bytes on every status broadcast tick. The collection bits — SoC die-temperature sampler, weight-stall watchdog, ADC recovery counter widened to volatile uint32_t, reset-reason capture at boot, StopWatch-tear-fix timer snapshot — are unchanged from review- round-2. Wire-protocol delivery is reorganized into three frames: session_info one-shot, server→client on WS_EVT_CONNECT. Carries the fields immutable for the connection (protocol_version, firmware_version, reset_reason). Clients no longer have to ask for these. debug events broadcast on diagnostic-relevant change. Emitted by: - sendWebsocketDebugStall(true) on stall onset - sendWebsocketDebugStall(false) on stall resume - sendWebsocketDebugAdcRecovery() on ADC power-cycle - sendWebsocketDebugTempPeak() on new SoC max temp Subscribers see the event the moment it happens; non- subscribers ignore the unknown type. No periodic broadcast — temp_peak fires at most once per warm-up curve, stall_*/adc_recovery only on real events. debug reply on-request snapshot per-client. Send {"command":"debug"} (also accepted: "diag") and the server replies with the full diagnostic set (current/peak temp, stall state + count + last, recovery count). Per-client, not a broadcast — no heap cost for other clients. All event broadcasts go through the existing wsBroadcastHeapOk() gate (PR decentespresso#58). Per-client sends don't need it (one allocation, not one-per- client). Net effect: - Status frame stays at its current 16 fields (≈310 B payload). Apps that don't care about diagnostics get back ~21% per status broadcast — meaningful under multi-client load where allocations stack. - Diagnostic consumers (soak tools, debug dashboards, this PR's own thermal_load_test.sh) get *immediate* notification of stall / recovery events instead of waiting up to 5 s for the next status. - session_info means clients no longer have to roundtrip a status request just to learn reset_reason after reconnect. Also includes: - Updated ADS1232 debug callback for the new library's field set (rebased from old API; the old dataMin/Max/Avg/StdDev / tareInProgress/tareTimes are gone in the upstream lib). Callback stays dormant by default (registered but setDebugEnabled(false) is the default in the lib). - StopWatch read-tear fix: sendWebsocketStatus and StatusAll now read g_timerRunning/g_timerElapsed (snapshotted once per main-loop pass) instead of touching stopWatch directly from the AsyncTCP task. - CLAUDE.md: "Fixing bugs you find along the way" guidance from review round 2. - README.md: full documentation of session_info + debug frames. - tools/thermal_load_test.sh: 1-hour multi-protocol soak runner. Build verified clean on esp32s3 (RAM 17.2%, Flash 45.5%). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f3adb84 to
62a09f9
Compare
|
Force-pushed: rebased onto current main (post #58 + #60) and restructured telemetry delivery. What changed in the protocol: the 8 telemetry fields no longer ride on the hot
Status frame stays at its current 16 fields (~310 B payload) instead of bloating to ~470 B. Apps that don't care about diagnostics get back ~21% per broadcast. Why now: the 8-hour soak validating #61's heap-defense knobs (cap=4 + 32K floor) showed that under marginal RSSI (-83) and multi-client load, status broadcasts were the dominant heap-pressure source — gate fired 5,000+ times. Slimming the status payload at the source is structurally better than just gating harder. Backward compat: the previous Build clean, validated diff against current main. Squashed to one commit since the original 7-commit history was reshuffled enough to make rebase-without-conflict impossible — happy to break it back into reviewable chunks if that's preferred. |
protocol_version and firmware_version are session-immutable and now
delivered exclusively via the session_info frame (sent on WS_EVT_CONNECT,
also requestable via {"command":"session_info"}). Removing them from
status saves ~50 bytes per broadcast across all clients — on top of the
~80 bytes already saved by moving telemetry to debug frames.
Status frame: 16 fields -> 14 fields. Net payload trim vs original
PR decentespresso#57 status (24 fields) is ~37%.
Added {"command":"session_info"} (alias: "session") so clients have an
explicit way to re-request the session-immutable fields if needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code reviewFound 3 issues:
openscale/tools/thermal_load_test.sh Lines 67 to 96 in f72d6e2
Lines 317 to 350 in f72d6e2
Lines 195 to 201 in f72d6e2 1 below-threshold issue (50-74)
🤖 Generated with Claude Code If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Four fixes from the code review on the session_info/debug refactor:
1. tools/thermal_load_test.sh — rebuild the inline Python telemetry
monitor for the new frame layout. The previous version read all
telemetry from status frames, which silently broke when those fields
moved to debug+session_info, forcing RESULT=FAIL unconditionally.
New monitor merges fields from any of the three frame types
(status / debug events / debug snapshot / session_info) into a
running snapshot dict, and polls {"command":"debug"} every iteration
so soc_temp_c/peak/stall_count stay fresh without waiting for
sparse-by-design event firings. Header comment updated to match.
Verified end-to-end on hardware: 2 min run -> RESULT=PASS with all
telemetry fields populated.
2. include/websocket.h — debug event broadcasts now gate on
b_websocketEventsEnabled, matching the existing opt-in model
(sendWebsocketButton/PowerOff/StatusAll). The on-request
{"command":"debug"} snapshot remains always-available regardless of
the events flag, so a client that wants only diagnostic snapshots
without subscribing to periodic status doesn't have to enable events.
Comment block updated to document the gate behaviour.
3. include/parameter.h + README.md — i_adc_recovery_count
documentation rewritten to describe its actual per-episode semantic
(resets to 0 on every successful scale.update() via
resetAdcRecoveryState() at hds.ino:1011). The previous comment
claimed it accumulated across long soaks, which is false; the
variable's actual purpose is the threshold the OLED uses to switch
from "ADC RECOVER" to "ADC ERROR" (hds.ino:1830). README now points
clients to summing adc_recovery debug events for a lifetime count.
4. src/hds.ino — adsDebugCallback comment now points at the actual
enablement path (USB binary protocol command 0x25 in usbcomm.h)
instead of the vague "enabled during soak tests".
Build clean. PR decentespresso#57 review thread:
decentespresso#57 (comment)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from the code review on cb569d3: 1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame is invisible (the next weight frame is <=500 ms away, status <=5 s); crashing is not." The "status <=5 s" recovery guarantee no longer holds after this PR removes the only periodic status broadcaster. Rewrote to describe what's actually still gated by wsBroadcastHeapOk (weight + button + power_off). 2. include/websocket.h:243-247 had an addendum paragraph framing the surviving printfAll comment as "Note: status is no longer broadcast periodically...". The historical framing reads as documentation for a function that no longer exists; dropping it leaves the printfAll technical commentary applying cleanly to sendWebsocketWeightAll (which is what it now sits above). 3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch directly from the AsyncTCP task. CLAUDE.md threading model #1 footgun: "stopWatch.* -- No -- multi-field (running flag + start ts + accumulator) and also mutated from loop(), BLE and USB; a status- frame read can tear a write across tasks." The bug is pre-existing on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you find along the way" applies. Mirroring the snapshot pattern used by PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h, refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial same-lines conflict that git auto-resolves.) Verified on hardware: flash + 12s window with events on + rate 10k -> 121 weight frames at 10 Hz, 0 periodic status frames; on-request {"command":"status"} returns the full 16-field frame with timer_running/timer_seconds populated from the snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two README gaps surfaced while writing client-integration docs:
1. The canonical text-commands list at lines 110-129 enumerates every
other WS command (tare, events, timer, display, ..., status, battery,
info) but omitted the four new ones added by this PR. A reader using
that block as the complete vocabulary would conclude `debug` /
`session_info` don't exist. Added them with their aliases.
2. The "Event broadcasts" paragraph in the debug section didn't mention
the events_enabled gate. After the review-fix commit (d1cedcc) added
that gate -- matching the existing button / power-off opt-in model --
the docs hadn't caught up. Clients reading the README would think
debug events arrive unprompted, then be surprised when they didn't.
Clarified that events on is required for push, and the on-request
snapshot is always available regardless.
Pure docs change; no behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Diagnostics for field reports of "weight stops being collected under sustained multi-protocol load" — the only recovery seen was a long battery-out cooldown (a quick power-cycle didn't help), pointing at a thermal/analog failure rather than firmware state. Adds telemetry to confirm/rule it out, with no behavior change to the weight/WiFi/BLE paths.
New fields in the
/snapshotWS status frame (and serial logs):soc_temp_c/soc_temp_max_c— live + peak ESP32-S3 die temp (temperatureRead()), sampled every 2 s.weight_stalled/stall_count/last_stall_ms/last_stall_temp_c— a watchdog inpureScale()that flags when the ADS1232 raw value is frozen/railed >8 s (a live cell dithers every sample), recording the die temp at failure to correlate stalls with heat. Skipped during the deliberate ADC power-cycle recovery; throttled to 250 ms.reset_reason—esp_reset_reason()at boot, so a brownout/panic/WDT reset is attributable.Plus
tools/thermal_load_test.sh: a 1-hour USB+WiFi+churn+mDNS soak that polls the telemetry (BT driven externally).Threading: new cross-task scalars are
volatileper CLAUDE.md; the status frame only reads them.Test plan
🤖 Generated with Claude Code