Skip to content

fix: tighten WS heap defenses (cap=4, broadcast floor 32K)#61

Merged
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-heap-defense
May 28, 2026
Merged

fix: tighten WS heap defenses (cap=4, broadcast floor 32K)#61
tadelv merged 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-heap-defense

Conversation

@skialpine
Copy link
Copy Markdown
Contributor

Summary

Two small knobs that together close the silent-WiFi-drop failure mode observed during multi-protocol soaks on top of #56's WS-OOM defense.

  • websocket.cleanupClients(4) — halve the WS client cap from the lib default (8) to 4
  • WS_BROADCAST_HEAP_FLOOR raised 25 KB → 32 KB

Why

Under sustained multi-protocol load (4+ WS clients + BLE + USB), per-client queue + AsyncWebSocketMessage allocations overcommit heap until lwIP can't allocate pbufs. Symptom is nasty: wifi_status stays CONNECTED, disc/rec never increment, but packets drop silently. The on-device watchdog never trips because nothing has crashed.

The existing wsBroadcastHeapOk() gate (PR #58) prevents the bad_alloc → abort reboot, but at the original 25 KB floor a 4-client status burst still dipped free heap to ~22 KB — below the lwIP starvation knee. 32 KB keeps the post-burst trough in safe territory.

cap=8 was always overhead for edge cases. Real workloads (on-device UI + PRs/Decenza app + an occasional second viewer) cap out at 3–4 concurrent clients on this hardware.

Validation

2h+ soak on hardware (skialpine/test/adc-lib+telemetry worktree with these edits applied):

  • 4 ws_drop_repro generators at 10 Hz each + BLE (Decenza app, ~1 Hz heartbeats) + USB serial
  • 0% ping loss, 0 reconnects (disc=0 rec=0 for the full run)
  • 0 panics, 0 bad_alloc, 0 AsyncTCP stalls
  • ~300 broadcasts skipped by the existing heap gate (working as designed)
  • minheap stable at ~3.6 KB across the run — no leak signature

Holding for an 8h soak before merge.

What this doesn't fix

This is a brake, not a root-cause fix. The underlying pressure — large statusAll broadcasts × 4 clients allocating ~2.3 KB per burst — is still there; we're just guaranteeing the gate catches it. Follow-up will split immutable fields (firmware_version, protocol_version, reset_reason) and quasi-immutable fields (stall_count, last_stall_*, adc_recovery_count, soc_temp_max_c) out of the hot statusAll path into a one-shot session_info frame + on-change deltas. Targets ~37% payload reduction on the status broadcast.

Test plan

  • Build clean (pio run -e esp32s3)
  • 2h+ multi-protocol soak — 0 drops
  • 8h soak validation (in progress)
  • Functional smoke: on-device UI + Decenza app + PRs WS coexist
  • Verify [ws] low heap … skip broadcast still fires correctly under load

🤖 Generated with Claude Code

Under sustained multi-protocol load (4+ WS clients + BLE + USB) the
default WS client cap of 8 lets per-client queue + AsyncWebSocketMessage
allocations overcommit heap to the point where lwIP TX buffers starve
silently — WiFi reports status=CONNECTED, disc/rec never increment, but
packets drop. Two small knobs together close the gap:

  * cleanupClients(4) — halves worst-case connection overcommit.
    Real workloads (on-device UI + PRs/Decenza + occasional second
    client) cap out at 3–4; 8 was always headroom for edge cases that
    don't happen on this hardware.

  * WS_BROADCAST_HEAP_FLOOR raised 25000 → 32000. The original 25K
    floor sat too close to the lwIP starvation knee: a 4-client status
    burst dipped free heap to ~22K, below where pbufs can be allocated.
    32K keeps the post-burst trough in safe territory.

Validated on hardware: 2h+ soak with 4 ws_drop_repro generators + BLE
(Decenza) + USB serial, 0% ping loss, 0 reconnects, 0 panics, ~300
broadcasts skipped by the existing wsBroadcastHeapOk gate. minheap
stable across the run (no leak signature).

Holding for an 8h soak before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tadelv tadelv merged commit cd37927 into decentespresso:main May 28, 2026
tadelv pushed a commit to skialpine/openscale that referenced this pull request May 28, 2026
The /snapshot WebSocket was the only protocol pushing periodic state
snapshots. BT notifies only weight + on-event (power_off, buttons); USB
binary follows the same shape. WiFi did a periodic status frame every
5 s when events_enabled, which:

  - allocated ~310 B x N clients every 5 s for state the on-device web
    apps never consumed (verified: zero references to battery_percent,
    timer_seconds, charging, low_power, etc. in web_apps/)
  - was the dominant heap-pressure source in the 8 h soak that
    motivated PR decentespresso#61
  - was inconsistent with BT, surprising clients that interoperate
    across both transports

After this change, clients request status on demand via
{"command":"status"} -- same model as BT's 0x22 voltage request, same
per-client reply path that already exists (sendWebsocketStatus, kept
intact). The events_enabled flag now gates only button events and
power_off broadcasts -- the cross-protocol event types that BT also
notifies, just without an opt-in over there. Weight, button, and
power_off push behavior on WiFi is unchanged.

Cleanup follows:
  - sendWebsocketStatusAll() removed (no callers; was the periodic
    broadcast helper)
  - t_lastWebsocketStatusUpdate global removed (no periodic tick to
    track)
  - WEBSOCKET_STATUS_NOTIFY_INTERVAL_MS constant removed
  - WS event handler no longer initialises/resets the period timer
  - README updated: status frame is on-request only; the "events on"
    paragraph no longer mentions periodic status; cross-references the
    BT pattern explicitly

Verified on hardware:
  20 s window, events on, rate 10k -> 201 weight frames (10.05 Hz)
  + 0 status frames. On-request {"command":"status"} still answers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tadelv pushed a commit that referenced this pull request May 28, 2026
…#62)

* fix(ws): remove periodic status broadcast; align push pattern with BT

The /snapshot WebSocket was the only protocol pushing periodic state
snapshots. BT notifies only weight + on-event (power_off, buttons); USB
binary follows the same shape. WiFi did a periodic status frame every
5 s when events_enabled, which:

  - allocated ~310 B x N clients every 5 s for state the on-device web
    apps never consumed (verified: zero references to battery_percent,
    timer_seconds, charging, low_power, etc. in web_apps/)
  - was the dominant heap-pressure source in the 8 h soak that
    motivated PR #61
  - was inconsistent with BT, surprising clients that interoperate
    across both transports

After this change, clients request status on demand via
{"command":"status"} -- same model as BT's 0x22 voltage request, same
per-client reply path that already exists (sendWebsocketStatus, kept
intact). The events_enabled flag now gates only button events and
power_off broadcasts -- the cross-protocol event types that BT also
notifies, just without an opt-in over there. Weight, button, and
power_off push behavior on WiFi is unchanged.

Cleanup follows:
  - sendWebsocketStatusAll() removed (no callers; was the periodic
    broadcast helper)
  - t_lastWebsocketStatusUpdate global removed (no periodic tick to
    track)
  - WEBSOCKET_STATUS_NOTIFY_INTERVAL_MS constant removed
  - WS event handler no longer initialises/resets the period timer
  - README updated: status frame is on-request only; the "events on"
    paragraph no longer mentions periodic status; cross-references the
    BT pattern explicitly

Verified on hardware:
  20 s window, events on, rate 10k -> 201 weight frames (10.05 Hz)
  + 0 status frames. On-request {"command":"status"} still answers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address review findings on periodic-status removal

Three findings from the code review on cb569d3:

1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame
   is invisible (the next weight frame is <=500 ms away, status <=5 s);
   crashing is not." The "status <=5 s" recovery guarantee no longer
   holds after this PR removes the only periodic status broadcaster.
   Rewrote to describe what's actually still gated by wsBroadcastHeapOk
   (weight + button + power_off).

2. include/websocket.h:243-247 had an addendum paragraph framing the
   surviving printfAll comment as "Note: status is no longer broadcast
   periodically...". The historical framing reads as documentation for
   a function that no longer exists; dropping it leaves the printfAll
   technical commentary applying cleanly to sendWebsocketWeightAll
   (which is what it now sits above).

3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch
   directly from the AsyncTCP task. CLAUDE.md threading model #1
   footgun: "stopWatch.* -- No -- multi-field (running flag + start ts +
   accumulator) and also mutated from loop(), BLE and USB; a status-
   frame read can tear a write across tasks." The bug is pre-existing
   on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you
   find along the way" applies. Mirroring the snapshot pattern used by
   PR #57: new g_timerRunning/g_timerElapsed volatiles in parameter.h,
   refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping
   block, read by sendWebsocketStatus. (PR #57 adds the same snapshot
   independently; merge order between #57 and #62 will produce a trivial
   same-lines conflict that git auto-resolves.)

Verified on hardware:
  flash + 12s window with events on + rate 10k -> 121 weight frames at
  10 Hz, 0 periodic status frames; on-request {"command":"status"}
  returns the full 16-field frame with timer_running/timer_seconds
  populated from the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skialpine skialpine deleted the fix/ws-heap-defense branch May 28, 2026 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants