Skip to content

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57

Open
skialpine wants to merge 4 commits into
decentespresso:mainfrom
skialpine:feat/scale-telemetry
Open

Scale telemetry: SoC temp, weight-stall watchdog, reset reason#57
skialpine wants to merge 4 commits into
decentespresso:mainfrom
skialpine:feat/scale-telemetry

Conversation

@skialpine
Copy link
Copy Markdown
Contributor

Summary

Diagnostics for field reports of "weight stops being collected under sustained multi-protocol load" — the only recovery seen was a long battery-out cooldown (a quick power-cycle didn't help), pointing at a thermal/analog failure rather than firmware state. Adds telemetry to confirm/rule it out, with no behavior change to the weight/WiFi/BLE paths.

New fields in the /snapshot WS status frame (and serial logs):

  • soc_temp_c / soc_temp_max_c — live + peak ESP32-S3 die temp (temperatureRead()), sampled every 2 s.
  • weight_stalled / stall_count / last_stall_ms / last_stall_temp_c — a watchdog in pureScale() that flags when the ADS1232 raw value is frozen/railed >8 s (a live cell dithers every sample), recording the die temp at failure to correlate stalls with heat. Skipped during the deliberate ADC power-cycle recovery; throttled to 250 ms.
  • reset_reasonesp_reset_reason() at boot, so a brownout/panic/WDT reset is attributable.

Plus tools/thermal_load_test.sh: a 1-hour USB+WiFi+churn+mDNS soak that polls the telemetry (BT driven externally).

Threading: new cross-task scalars are volatile per CLAUDE.md; the status frame only reads them.

Test plan

  • Builds for esp32s3; flashed; status frame reports the new fields.
  • 1-hour multi-protocol soak to capture peak temp + any stall and the die temp at which it occurs.

🤖 Generated with Claude Code

@skialpine
Copy link
Copy Markdown
Contributor Author

Code review

Reviewed the telemetry diff (deep bug scan + threading/CLAUDE.md check). Found one real issue, now fixed in this PR (commit 8e345d9):

  • Stall watchdog false-trip during ADC recovery — the watchdog read getDebugInfo().rawValue, which is frozen by definition during the firmware's own powerDown()/powerUp() recovery, so a genuine signal-timeout recovery would be miscounted as a railed/frozen stall (corrupting the very metric this adds). Fixed: skip the check while b_adc_recovery_active and re-seed the window on resume.
  • Also (cost): it ran getDebugInfo() (which does a sqrt + dataset passes) every loop iteration though only rawValue is used — wasteful on a chip we're characterizing for heat. Now throttled to 250 ms (the ADC only samples ~10/s).

Verified clean: printf format/arg pairing in both status frames; cross-task reads are volatile (benign torn read only); temperatureRead()/resetReasonStr() safe; at-rest false-positive risk is low (24-bit raw at SAMPLES=1 dithers every sample, so 8 s of byte-identical raw is a genuine freeze).

🤖 Generated with Claude Code

skialpine added a commit to skialpine/openscale that referenced this pull request May 25, 2026
From the toolkit review of PR decentespresso#57:
- temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON
  and a frozen peak since NaN compares false); keep last valid + log once.
- g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the
  AsyncTCP status path); status frame casts it for printf.
- Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop
  keeps re-seeding the stall window so weight_stalled may never trip -- the
  climbing recovery count makes that failure mode visible. i_adc_recovery_count
  is now volatile (newly read cross-task).
- reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons
  (CPU_LOCKUP/USB/JTAG) stay attributable.
- Comment fixes: volatile cross-task rationale; stall-window re-seed wording +
  recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
skialpine added a commit to skialpine/openscale that referenced this pull request May 25, 2026
Addresses the iteration-2 review findings on PR decentespresso#57:

- Status frame no longer reads the multi-field stopWatch object directly off
  the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The
  loop task now snapshots it into aligned volatiles (g_timerRunning/
  g_timerElapsed) that both status frames read.
- Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a
  perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps
  counting truthfully over a long soak instead of saturating; update the WS
  format specifier %u -> %lu accordingly.
- SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON.
- Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed
  sentinel) at boot/rollover.
- README: document the new status-frame telemetry fields.
- thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status
  frames or a crashed load generator, and exit non-zero on FAIL so it works as
  a CI gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (post decentespresso#58 WS-OOM gate, post decentespresso#60 ADC library
swap) and restructured from the original "embed telemetry in status"
shape to deliver telemetry out-of-band, so apps that don't care about
diagnostics (on-device UI, decentespresso app, third-party scale apps)
aren't paying ~21% extra bytes on every status broadcast tick.

The collection bits — SoC die-temperature sampler, weight-stall watchdog,
ADC recovery counter widened to volatile uint32_t, reset-reason capture
at boot, StopWatch-tear-fix timer snapshot — are unchanged from review-
round-2.

Wire-protocol delivery is reorganized into three frames:

  session_info  one-shot, server→client on WS_EVT_CONNECT. Carries the
                fields immutable for the connection (protocol_version,
                firmware_version, reset_reason). Clients no longer have
                to ask for these.

  debug events  broadcast on diagnostic-relevant change. Emitted by:
                  - sendWebsocketDebugStall(true)  on stall onset
                  - sendWebsocketDebugStall(false) on stall resume
                  - sendWebsocketDebugAdcRecovery() on ADC power-cycle
                  - sendWebsocketDebugTempPeak()    on new SoC max temp
                Subscribers see the event the moment it happens; non-
                subscribers ignore the unknown type. No periodic
                broadcast — temp_peak fires at most once per warm-up
                curve, stall_*/adc_recovery only on real events.

  debug reply   on-request snapshot per-client. Send {"command":"debug"}
                (also accepted: "diag") and the server replies with the
                full diagnostic set (current/peak temp, stall state +
                count + last, recovery count). Per-client, not a
                broadcast — no heap cost for other clients.

All event broadcasts go through the existing wsBroadcastHeapOk() gate
(PR decentespresso#58). Per-client sends don't need it (one allocation, not one-per-
client).

Net effect:

  - Status frame stays at its current 16 fields (≈310 B payload). Apps
    that don't care about diagnostics get back ~21% per status broadcast
    — meaningful under multi-client load where allocations stack.
  - Diagnostic consumers (soak tools, debug dashboards, this PR's own
    thermal_load_test.sh) get *immediate* notification of stall /
    recovery events instead of waiting up to 5 s for the next status.
  - session_info means clients no longer have to roundtrip a status
    request just to learn reset_reason after reconnect.

Also includes:

  - Updated ADS1232 debug callback for the new library's field set
    (rebased from old API; the old dataMin/Max/Avg/StdDev /
    tareInProgress/tareTimes are gone in the upstream lib). Callback
    stays dormant by default (registered but setDebugEnabled(false) is
    the default in the lib).
  - StopWatch read-tear fix: sendWebsocketStatus and StatusAll now read
    g_timerRunning/g_timerElapsed (snapshotted once per main-loop pass)
    instead of touching stopWatch directly from the AsyncTCP task.
  - CLAUDE.md: "Fixing bugs you find along the way" guidance from
    review round 2.
  - README.md: full documentation of session_info + debug frames.
  - tools/thermal_load_test.sh: 1-hour multi-protocol soak runner.

Build verified clean on esp32s3 (RAM 17.2%, Flash 45.5%).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skialpine skialpine force-pushed the feat/scale-telemetry branch from f3adb84 to 62a09f9 Compare May 27, 2026 14:00
@skialpine
Copy link
Copy Markdown
Contributor Author

Force-pushed: rebased onto current main (post #58 + #60) and restructured telemetry delivery.

What changed in the protocol: the 8 telemetry fields no longer ride on the hot status broadcast. They move out into three new frame shapes:

  • session_info — sent once per client on WS connect. Carries protocol_version, firmware_version, reset_reason (the connection-immutable fields).
  • debug events — broadcast on actual diagnostic changes: stall_start / stall_end, adc_recovery (count incremented), temp_peak (new SoC max). Subscribers get the event the moment it happens; non-subscribers ignore the unknown type.
  • debug reply — per-client on-request snapshot via {\"command\":\"debug\"}.

Status frame stays at its current 16 fields (~310 B payload) instead of bloating to ~470 B. Apps that don't care about diagnostics get back ~21% per broadcast.

Why now: the 8-hour soak validating #61's heap-defense knobs (cap=4 + 32K floor) showed that under marginal RSSI (-83) and multi-client load, status broadcasts were the dominant heap-pressure source — gate fired 5,000+ times. Slimming the status payload at the source is structurally better than just gating harder.

Backward compat: the previous status frame shape (which had the 8 telemetry fields appended) had not shipped on any merged release, so there's nothing to deprecate. Clients written against the README's old example will fall back to ignoring unknown fields cleanly anyway.

Build clean, validated diff against current main. Squashed to one commit since the original 7-commit history was reshuffled enough to make rebase-without-conflict impossible — happy to break it back into reviewable chunks if that's preferred.

protocol_version and firmware_version are session-immutable and now
delivered exclusively via the session_info frame (sent on WS_EVT_CONNECT,
also requestable via {"command":"session_info"}). Removing them from
status saves ~50 bytes per broadcast across all clients — on top of the
~80 bytes already saved by moving telemetry to debug frames.

Status frame: 16 fields -> 14 fields. Net payload trim vs original
PR decentespresso#57 status (24 fields) is ~37%.

Added {"command":"session_info"} (alias: "session") so clients have an
explicit way to re-request the session-immutable fields if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skialpine
Copy link
Copy Markdown
Contributor Author

Code review

Found 3 issues:

  1. tools/thermal_load_test.sh reads telemetry fields exclusively from status frames, but commit f72d6e2 moved them all to session_info and debug frames. The monitor never subscribes to debug events nor reads session_info, so peak stays at its -999 sentinel and the verdict path forces RESULT=FAIL on every run regardless of device behavior. Fix: handle all three frame types in the Python monitor (the script also unconditionally sends events on but never debug subscribe). The script's header comment at line 5 still says "WS status-frame telemetry" — also stale.

try:
d=json.loads(w.recv())
if d.get("type")=="status": st=d
except Exception: pass
return st
ws=connect()
peak=-999.0; total_stalls=0; reboots=0; max_recov=0
prev_stalls=None; prev_max=None; last_reset="?"
no_status_streak=0; max_no_status_streak=0; total_no_status=0
first=True
while time.time()<end:
st = first_status(ws, 9 if first else 2.5); first=False
if st:
soc=st.get('soc_temp_c'); mx=st.get('soc_temp_max_c'); sc=st.get('stall_count',0) or 0
recov=st.get('adc_recovery_count',0) or 0; rr=st.get('reset_reason','?')
last_reset=rr; no_status_streak=0
if isinstance(mx,(int,float)) and mx>peak: peak=mx
if recov>max_recov: max_recov=recov
# reboot heuristic: since-boot counters or peak dropped vs last frame
if prev_stalls is not None and (sc<prev_stalls or (isinstance(mx,(int,float)) and isinstance(prev_max,(int,float)) and mx<prev_max-3)):
reboots+=1
print("[%s] *** REBOOT detected (counters reset; reset_reason=%s) ***"%(time.strftime('%H:%M:%S'),rr),flush=True)
total_stalls=max(total_stalls, sc)
prev_stalls=sc; prev_max=mx
flag=" *** STALL ***" if st.get('weight_stalled') else ""
print("[%s] soc=%5sC max=%5sC stalled=%-5s stalls=%s recov=%s last_stall_ms=%s stall_temp=%s reset=%s grams=%s chg=%s%s"%(
time.strftime('%H:%M:%S'), soc, mx, st.get('weight_stalled'), sc, recov,
st.get('last_stall_ms'), st.get('last_stall_temp_c'), rr, st.get('grams'),
st.get('charging'), flag), flush=True)
else:

  1. Debug event broadcasts (sendWebsocketDebugStall, sendWebsocketDebugAdcRecovery, sendWebsocketDebugTempPeak) gate only on !b_wifiEnabled || websocket.count() == 0. They skip the b_websocketEventsEnabled check that every other event broadcast in this file uses (sendWebsocketButton L188, sendWebsocketPowerOff L199, sendWebsocketStatusAll L246). README says "A client must send events on before periodic status, local scale button presses, or power-off notifications are emitted." — new debug frames break this opt-in contract. Either add !b_websocketEventsEnabled to the early-return, or document the always-on choice explicitly in the README's debug-frame section.

// keep their own running snapshot from session_info + the on-request debug reply,
// and update it from these deltas. Keeps each event small (~80-140 B).
void sendWebsocketDebugStall(bool started) {
if (!b_wifiEnabled || websocket.count() == 0) return;
if (!wsBroadcastHeapOk()) return;
if (started) {
websocket.printfAll("{\"type\":\"debug\",\"event\":\"stall_start\",\"stall_count\":%lu,\"last_stall_ms\":%lu,\"last_stall_temp_c\":%.1f,\"ms\":%lu}",
(unsigned long)g_stallCount,
g_lastStallMs,
g_lastStallTempC,
millis());
} else {
websocket.printfAll("{\"type\":\"debug\",\"event\":\"stall_end\",\"ms\":%lu}", millis());
}
}
void sendWebsocketDebugAdcRecovery() {
if (!b_wifiEnabled || websocket.count() == 0) return;
if (!wsBroadcastHeapOk()) return;
websocket.printfAll("{\"type\":\"debug\",\"event\":\"adc_recovery\",\"adc_recovery_count\":%lu,\"ms\":%lu}",
(unsigned long)i_adc_recovery_count,
millis());
}
// temp_peak is broadcast when g_socTempMaxC ticks up. Rate-limited at the call
// site (main loop) -- not here -- since the temp sampler already throttles.
void sendWebsocketDebugTempPeak(float maxC) {
if (!b_wifiEnabled || websocket.count() == 0) return;
if (!wsBroadcastHeapOk()) return;
websocket.printfAll("{\"type\":\"debug\",\"event\":\"temp_peak\",\"soc_temp_max_c\":%.1f,\"ms\":%lu}",
maxC,
millis());
}

  1. i_adc_recovery_count documentation contradicts the implementation. include/parameter.h:197-200 says "counts truthfully ... over a long soak instead of saturating at 255" and README:256 says "number of ADC power-cycle recoveries since boot". But src/hds.ino:1011 calls resetAdcRecoveryState() after every successful scale.update(), which zeroes the counter. The actual semantics — confirmed by the display-logic use at src/hds.ino:1830 (>= ADC_ERROR_RECOVERY_COUNT ? "ADC ERROR" : "ADC RECOVER") — is "consecutive failed recoveries in the current episode," not a lifetime total. A client tracking adc_recovery_count expecting monotonic growth will see it ping-pong to 1 and back to 0. Either fix the docs (parameter.h block-comment + README field description) or rename for a true lifetime counter.

static bool b_adc_recovery_active = false;
// volatile: written on the main loop -- incremented on each ADC power-cycle
// recovery, reset to 0 by resetAdcRecoveryState() -- and read in the WS status
// frame (which can be built on the AsyncTCP task). uint32_t (not uint8_t) so a
// *perpetual* recovery loop -- the one failure mode the stall watchdog is blind
// to -- keeps counting truthfully over a long soak instead of saturating at 255.
static volatile uint32_t i_adc_recovery_count = 0;


1 below-threshold issue (50-74)
  • [Worth a look] src/hds.ino:22 comment says adsDebugCallback is "enabled during soak tests" — the actual enablement is via the USB binary protocol in include/usbcomm.h, not anything soak-test-specific. One-line clarification (score: 50).

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

Four fixes from the code review on the session_info/debug refactor:

1. tools/thermal_load_test.sh — rebuild the inline Python telemetry
   monitor for the new frame layout. The previous version read all
   telemetry from status frames, which silently broke when those fields
   moved to debug+session_info, forcing RESULT=FAIL unconditionally.
   New monitor merges fields from any of the three frame types
   (status / debug events / debug snapshot / session_info) into a
   running snapshot dict, and polls {"command":"debug"} every iteration
   so soc_temp_c/peak/stall_count stay fresh without waiting for
   sparse-by-design event firings. Header comment updated to match.
   Verified end-to-end on hardware: 2 min run -> RESULT=PASS with all
   telemetry fields populated.

2. include/websocket.h — debug event broadcasts now gate on
   b_websocketEventsEnabled, matching the existing opt-in model
   (sendWebsocketButton/PowerOff/StatusAll). The on-request
   {"command":"debug"} snapshot remains always-available regardless of
   the events flag, so a client that wants only diagnostic snapshots
   without subscribing to periodic status doesn't have to enable events.
   Comment block updated to document the gate behaviour.

3. include/parameter.h + README.md — i_adc_recovery_count
   documentation rewritten to describe its actual per-episode semantic
   (resets to 0 on every successful scale.update() via
   resetAdcRecoveryState() at hds.ino:1011). The previous comment
   claimed it accumulated across long soaks, which is false; the
   variable's actual purpose is the threshold the OLED uses to switch
   from "ADC RECOVER" to "ADC ERROR" (hds.ino:1830). README now points
   clients to summing adc_recovery debug events for a lifetime count.

4. src/hds.ino — adsDebugCallback comment now points at the actual
   enablement path (USB binary protocol command 0x25 in usbcomm.h)
   instead of the vague "enabled during soak tests".

Build clean. PR decentespresso#57 review thread:
decentespresso#57 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
skialpine added a commit to skialpine/openscale that referenced this pull request May 27, 2026
Three findings from the code review on cb569d3:

1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame
   is invisible (the next weight frame is <=500 ms away, status <=5 s);
   crashing is not." The "status <=5 s" recovery guarantee no longer
   holds after this PR removes the only periodic status broadcaster.
   Rewrote to describe what's actually still gated by wsBroadcastHeapOk
   (weight + button + power_off).

2. include/websocket.h:243-247 had an addendum paragraph framing the
   surviving printfAll comment as "Note: status is no longer broadcast
   periodically...". The historical framing reads as documentation for
   a function that no longer exists; dropping it leaves the printfAll
   technical commentary applying cleanly to sendWebsocketWeightAll
   (which is what it now sits above).

3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch
   directly from the AsyncTCP task. CLAUDE.md threading model #1
   footgun: "stopWatch.* -- No -- multi-field (running flag + start ts +
   accumulator) and also mutated from loop(), BLE and USB; a status-
   frame read can tear a write across tasks." The bug is pre-existing
   on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you
   find along the way" applies. Mirroring the snapshot pattern used by
   PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h,
   refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping
   block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot
   independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial
   same-lines conflict that git auto-resolves.)

Verified on hardware:
  flash + 12s window with events on + rate 10k -> 121 weight frames at
  10 Hz, 0 periodic status frames; on-request {"command":"status"}
  returns the full 16-field frame with timer_running/timer_seconds
  populated from the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two README gaps surfaced while writing client-integration docs:

  1. The canonical text-commands list at lines 110-129 enumerates every
     other WS command (tare, events, timer, display, ..., status, battery,
     info) but omitted the four new ones added by this PR. A reader using
     that block as the complete vocabulary would conclude `debug` /
     `session_info` don't exist. Added them with their aliases.

  2. The "Event broadcasts" paragraph in the debug section didn't mention
     the events_enabled gate. After the review-fix commit (d1cedcc) added
     that gate -- matching the existing button / power-off opt-in model --
     the docs hadn't caught up. Clients reading the README would think
     debug events arrive unprompted, then be surprised when they didn't.
     Clarified that events on is required for push, and the on-request
     snapshot is always available regardless.

Pure docs change; no behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant