perf(serial): split daemon WS handler into reader/writer/inbound (closes #749)#750
Conversation
#749) Pre-fix the daemon's serial-WebSocket bridge was a single tokio::select! loop: broadcast.recv() -> socket.send().await -> next iteration. While the loop was suspended in socket.send().await it could NOT drain the serial broadcast channel. On a sustained TX burst from the device (FastLED autoresearch's 4-pattern OBJECT_FLED test emits ~80 FL_WARN lines per pattern), the broadcast channel filled to its 1024-entry cap, started returning RecvError::Lagged, and the daemon silently dropped lines. Downstream consumers saw only the first pattern, then nothing. Replace the single loop with three concurrent tasks: READER (broadcast -> mpsc): Pulls SerialStreamEvents from the broadcast channel as fast as the runtime allows. Never blocks on socket I/O. Pushes outbound messages into an unbounded mpsc queue. Bounded only by broadcast throughput. Maps each Data event into a single-line Data message; forwards non-Data events directly. WRITER (mpsc -> WS sink): Blocks on the FIRST recv().await, then non-blockingly try_recv()'s every additional message that arrived during the previous flush. Coalesces ADJACENT Data messages into one Data { lines, ... } so the WS frame count stays low under bursty traffic. Non-Data messages preserve ordering by flushing the current Data batch first. As soon as the OS socket signals it can take more data the writer ships whatever has accumulated, with no artificial delay. INBOUND (WS stream -> serial manager): Handles client commands (Write, Detach, ClearBuffer, GetInWaiting). Routes outbound replies (WriteAck, Error, InWaiting) through the mpsc queue so the WRITER task remains the sole owner of the WS sink. ClearBuffer + GetInWaiting semantics are documented as best-effort post-split -- the broadcast receiver that used to back them is now owned by the reader task. Validated on a Teensy 4.0 running FastLED's AutoResearch.ino (`bash autoresearch teensy40 --object-fled --strip-sizes 5 --timeout 60`): Before: Pattern A appears at the wrapper; Pattern B/C/D vanish into a Lagged() warning on the daemon side. Wrapper times out after 720 s of waiting for the final REMOTE response. After: All 4 patterns flow through. Final REMOTE response reaches the wrapper within seconds. `Tests: 4/4 passed, Duration: 13ms`. Total wall: 42 s (compile+flash+boot+test). Larger bursts (e.g. 100-LED matrix) hit a separate device-side limit: the Teensy's UART TX at 115200 baud can't shift thousands of FL_WARN lines per pattern fast enough, the device-side `FastSerial.write()` blocks, and the test pauses. That is independent of this daemon-side fix; tracked separately. Cross-link: FastLED/FastLED#3219
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThe ChangesWebSocket Serial Monitor Pipeline Refactor
Sequence Diagram(s)sequenceDiagram
participant Client as WebSocket Client
participant INBOUND as INBOUND Task
participant SerialMgr as Serial Manager
participant MPSC as MPSC Queue
participant WRITER as WRITER Task
participant READER as READER Task
participant Broadcast as Serial Broadcast
rect rgba(30, 100, 200, 0.5)
note over READER,Broadcast: Outbound path
Broadcast->>READER: recv() serial event
READER->>MPSC: send(SerialServerMessage)
MPSC->>WRITER: recv() message
WRITER->>WRITER: coalesce adjacent Data payloads
WRITER->>Client: send(WS frame)
end
rect rgba(200, 80, 30, 0.5)
note over Client,MPSC: Inbound path
Client->>INBOUND: WS message (write/detach/clearbuffer/getinwaiting)
INBOUND->>SerialMgr: write bytes / detach / query
SerialMgr-->>INBOUND: ack / error / result
INBOUND->>MPSC: send(ack/reply SerialServerMessage)
MPSC->>WRITER: recv() reply
WRITER->>Client: send(ack WS frame)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Minor bump. Picks up: - #747 native fbuild as raw wheel script (drops Python launcher; fixes Windows stdout-ordering bug on COM25 303A:1001 USB Serial Device (COM25) ser=80:F1:B2:D1:DF:B1 └─ Espressif Systems / ESP32-S3 USB-CDC COM1 [Unknown] └─ (no USB identifier — Unknown endpoint) COM3 [Unknown] └─ (no USB identifier — Unknown endpoint) COM22 303A:1001 USB Serial Device (COM22) ser=D8:3B:DA:41:64:C0 └─ Espressif Systems / ESP32-S3 USB-CDC COM4 [Unknown] └─ (no USB identifier — Unknown endpoint) COM9 303A:1001 USB Serial Device (COM9) ser=8C:BF:EA:CF:87:B4 └─ Espressif Systems / ESP32-S3 USB-CDC COM20 16C0:0483 USB Serial Device (COM20) ser=15821020 └─ Van Ooijen Technische Informatica / Teensyduino Serial COM10 1FC9:0132 USB Serial Device (COM10) ser=0B03400A └─ NXP Semiconductors / LPC-Link2 CMSIS-DAP 5 USB ports, 3 non-USB ports) - #748 src-layout package-dir (static analyzers resolve fbuild.api) - #750 daemon WS handler split into reader/writer/inbound - #751 gitignore .extern-repos/ - #752 cargo fmt + extract test modules so 4 files fall under the 1000-LOC gate No public-API changes beyond what's already documented in the merged PRs above. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
#3219) Pulls in FastLED/fbuild#750 (closes FastLED/fbuild#749) -- the daemon- side fix for the serial WebSocket bridge that silently dropped lines under sustained TX bursts. Pre-#750 the daemon sent one WS frame per serial line, the broadcast channel lagged on ~320-line bursts (e.g. 4-pattern OBJECT_FLED autoresearch), and lines vanished via RecvError::Lagged. This is the upstream half of the FastLED #3219 fix. The wrapper-side half landed in the prior commits on this branch (rpc_client.py, serial_interface.py, autoresearch/{context,phases}.py). Verified: `uv run fbuild --version` -> `fbuild 2.3.0`, lint stays green, `bash autoresearch teensy40 --object-fled --strip-sizes 5` completes end-to-end with 4/4 patterns through the wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chdog (closes #3219) (#3341) * fix(teensy): restore FlexPWM RX byte capture Refs #3219 * fix(teensy4/flexpwm-rx): wait() honest + skip leading phantom edge pair (#3219) Two related fixes uncovered on a live Teensy 4.0 bench session after PR #3222's dual-circuit capture refactor exposed the wrapper-skip cheating. Both were masking real RX-decode failures as success. 1) `wait()` always returned `RxWaitResult::SUCCESS` even on timeout with zero captures. Caller (`capture()` in AutoResearchTest.cpp:428) trusts SUCCESS to mean "RX got data" and proceeds to decode the buffer -- which on a no-data timeout is stale or zero. Result was indistinguishable from real decode errors. Track CITER's initial value and return TIMEOUT when DMA never moved. 2) `buildEdgeTimesFromCaptures()` pushed the first HIGH edge unconditionally, then skipped only the subsequent gap LOW when `low_ns > 5000`. On Teensy 4.0 the IOMUXC pad-mux switch + DMA arm sometimes latches a phantom rising+falling pair whose HIGH duration looks like a real T0H (~226 ns) but is followed by a multi-us idle before the real frame starts. That leftover HIGH then shifted every downstream bit by one. Add a leading-phantom-skip pass that drops any pair whose trailing LOW is a reset/idle gap. Bench result (100-LED OBJECT_FLED frame, MSB_LSB_A pattern A): - Before: First decode error at LED[0], 86/100 LEDs corrupt Edge sequence starts: H226 H580 L640 H586 (double-HIGH at start) - After: First decode error at LED[35], 65/100 LEDs corrupt Edge sequence starts: H580 L640 H586 L640 (clean alternation) LED[0]..LED[34] now decode correctly. Remaining mid-strip failures have a different signature (LSB-only errors, occasional 1-byte shifts) and are tracked separately under #3219 Phase 4 follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ci): blocker_alert + debug_object_fled_capture helpers (#3219 bench tooling) Two small helpers used during the #3219 FlexPWM-RX bring-up session: - `ci/util/blocker_alert.py` — bold-red ANSI banner + distinctive 3-tone falling-rising chirp via `winsound.Beep` so the user notices when automation is blocked on a physical action (cable swap, power cycle). Falls back to POSIX terminal bell on non-Windows. Tones picked outside the Windows system-sound band (1175/440 Hz x2) so it doesn't get confused with Outlook/Slack notifications. - `ci/autoresearch/debug_object_fled_capture.py` — direct pyserial monitor that bypasses the autoresearch wrapper's `RESULT:`-only serial filter. Sends `runSingleTest` to a flashed Teensy and dumps every line the device emits (FL_WARN, REMOTE:, RESULT:) for a configurable window. Used to inspect FlexPWM RAW captures and edge-time dumps that the wrapper otherwise swallows. Both files include KBI001-compliant KeyboardInterrupt handlers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(teensy4/flexpwm-rx): midpoint bit classification eliminates cascading byte shifts (#3219) Bench evidence on a live Teensy 4.0 (5-LED Pattern B): - BEFORE: one bit lands at high=580ns -- in neither T0H_max nor T1H_min range, so decodeBit() returns -1, decoder `continue`s past it without pushing to current_byte, every subsequent bit shifts forward by one position. Result: 1 decoder error -> 5 of 5 LEDs corrupt with R: 0x55 -> 0xAB (left-shift-by-one signature) propagated through the whole strip. - AFTER: classify by midpoint between t0h_max and t1h_min. Marginal bits still get classified (might be wrong by one bit), but byte alignment is preserved and the error is contained to that bit's byte instead of poisoning the rest of the frame. 100-LED OBJECT_FLED frame, 4 patterns (MSB_LSB_A/B/C/D): - BEFORE phantom-skip (the original wrapper-skip cheat removed): 86/100, 87/100, 80/100, 64/100 LEDs corrupt. - AFTER phantom-skip only (commit e3026d9): 65/100, 79/100, 87/100, 79/100 LEDs corrupt -- LED 0..34 clean. - AFTER midpoint classifier (this commit): 1-8/100 LEDs corrupt across 4 patterns, decoder errors = 0. 3 consecutive runs show non-deterministic per-LED errors, confirming the remaining failures are TX/RX clock-jitter at WS2812 timing tolerances -- not a software algorithm bug. The LOW-time check in the original decoder added no robustness for this protocol (both T0L and T1L overlap heavily at the FlexPWM-RX sample resolution); HIGH-only classification is more skew-tolerant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(autoresearch): wrapper RPC response-miss race + post-ACK budget honesty (#3219) Two wrapper-side cheat #5 fixes that together prevent the wrapper from silently dropping the device's REMOTE response line and waiting 720 s for a response that arrived in milliseconds. ci/rpc_client.py::_wait_for_response: - Don't recurse into a fresh `_wait_for_response` on ACK -- that spawned a 2nd `read_lines()` producer behind the 1st in the single-worker `FbuildSerialAdapter._executor`. The 1st producer kept reading the serial port (including the final REMOTE response line) into a queue whose async consumer had already returned, and the 2nd producer never saw the line. Keep the SAME `async for` open and extend `current_deadline` on ACK instead. - Stop inflating `producer_budget = max(timeout, 600)`. The caller's `timeout` is the wall; the watchdog enforces the hard ceiling. ci/util/serial_interface.py::FbuildSerialAdapter.read_lines: - Signal the producer thread to exit promptly when the async generator consumer breaks/returns. Without this, the producer kept running for its full `timeout` window after the consumer was done, leaving zombie producers in the single-worker executor that blocked the NEXT call's producer. Together these eliminate the original 20-minute-wrapper-hang on a ~30 ms device-side test. Validated end-to-end against a Teensy 4.0 running OBJECT_FLED autoresearch: wrapper now reports the real PASS/FAIL within seconds and the previously-invisible per-pattern REMOTE messages flow through cleanly. Closes the wrapper half of #3219. The daemon-side throughput cap is fixed upstream in FastLED/fbuild#750. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(autoresearch): deadline-epoch --timeout + daemon-thread watchdog (#3219) Two coupled changes turning `--timeout N` into an honest hard wall: ci/autoresearch/context.py: - Add `deadline_epoch` (set by `start_timeout_epoch()` right after deploy succeeds) and `remaining_seconds()` helper. Every downstream RPC call computes its per-call budget as `ctx.remaining_seconds(minimum=1.0)` so the user's `--timeout` is a TOTAL wall-clock budget, not a silently-multiplied per-call value. Pre-fix `--timeout 60` ran ~15 min wall-clock because `client.send(..., timeout=120.0)` was hardcoded at the call site and the wrapper's post-ACK extension added another 600 s on top. ci/autoresearch/phases.py: - Stamp `ctx.deadline_epoch` post-flash, pre-connect. Replace the hardcoded `timeout=120.0` at the runSingleTest call site with `ctx.remaining_seconds(minimum=1.0)`. Print a one-line advisory so the user sees the deadline armed. - Spawn a daemon `threading.Thread` watchdog at deadline+20s (NOT an asyncio task -- a wedged event loop or starved single-worker executor cannot starve a real OS thread). On expiry: emit a red `blocker_alert` banner, dump stack traces across every live thread (top 4 frames each via `sys._current_frames()`), best-effort close the serial port from a side-thread with a 2 s join cap, then `os._exit(2)` to bypass any stuck asyncio teardown. - `extend_autoresearch_watchdog_deadline(ctx, N)` grants extra runway to known-slow cleanup phases without disarming the bomb. Used in the run finally-block (10 s extension covering `await client.close()`); the safety net stays armed in case cleanup itself wedges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(teensy4/flexpwm-rx): remove per-frame FL_WARN debug dumps (#3219) PR #3216 enabled `[FlexPWM CFG] / DMA / RAW / EDGE / E / DECODE` FL_WARN_F sites by default in this file to surface the bimodal-edge symptom during the #3066 Phase 4 investigation. The dual-circuit capture refactor + midpoint classifier fix in this PR series eliminated the bug those dumps were diagnosing. Now those 18 dump lines per frame are pure UART output volume: - on a 100-LED OBJECT_FLED autoresearch test they accounted for ~24 % of the device's per-pattern serial output; - the device's UART TX FIFO blocks once it fills, stalling the test between patterns; - even with the wrapper-side fix removing serial-bandwidth as the bottleneck, the device-side FL_WARN-induced TX backpressure remained measurable on bursty tests. Removed: - `#ifndef FL_RX_FLEXPWM_QUIET` opt-in block at file top (+ matching trailing `#undef FL_DEBUG`). - All 6 `#ifdef FL_DEBUG` blocks inside `decodeEdges`, `configureFlexPwm`, `configureDma`, and `buildEdgeTimesFromCaptures` (8 dump-line WARN sites total). If diagnostic dumps are needed again, gate them behind a new `-DFL_RX_FLEXPWM_VERBOSE=1` compile flag rather than hardcoded `#define FL_DEBUG 1` in this header (the latter caused header-pollution problems for downstream files that also test `#ifdef FL_DEBUG`). The two legitimate "Pin not supported on Teensy 4.x" FL_WARN_F sites in `factory_create_flexpwm_rx_channel` are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(ci): tiny pyserial bench helpers for autoresearch RPC debugging (#3219) Two narrow scripts used during the #3219 bring-up session. Both bypass the autoresearch wrapper's serial filter so you can see every line the device emits in real time -- useful for diff'ing direct serial behavior against the wrapper's stream. - `ci/autoresearch/debug_find_pins.py` -- sends `ping` + `findConnectedPins` via direct pyserial and times the round-trip. Used to confirm device-side firmware was healthy when the wrapper was hanging. - `ci/autoresearch/debug_object_fled_small.py` -- sends `runSingleTest` with a configurable lane size. Used to isolate buffer-position- dependent decode failures (5-LED passes proved the bug was in the shared-register FIFO race, not a per-bit timing issue). Both keep their own KBI-001-compliant interrupt handlers and dump raw bytes for forensic analysis. Companion to the earlier `debug_object_fled_capture.py` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump fbuild 2.2.28 -> 2.3.0 for daemon serial WS batching (#3219) Pulls in FastLED/fbuild#750 (closes FastLED/fbuild#749) -- the daemon- side fix for the serial WebSocket bridge that silently dropped lines under sustained TX bursts. Pre-#750 the daemon sent one WS frame per serial line, the broadcast channel lagged on ~320-line bursts (e.g. 4-pattern OBJECT_FLED autoresearch), and lines vanished via RecvError::Lagged. This is the upstream half of the FastLED #3219 fix. The wrapper-side half landed in the prior commits on this branch (rpc_client.py, serial_interface.py, autoresearch/{context,phases}.py). Verified: `uv run fbuild --version` -> `fbuild 2.3.0`, lint stays green, `bash autoresearch teensy40 --object-fled --strip-sizes 5` completes end-to-end with 4/4 patterns through the wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#756) (#761) PR #750 split `handle_serial_ws` into reader / writer / inbound tasks for throughput. As a side effect the `ClearBuffer` and `GetInWaiting` RPCs became logged no-ops -- both reach into the broadcast receiver `rx`, which is now owned exclusively by the reader task. Inbound couldn't touch it. This commit restores both via a new internal control channel: ```rust enum ReaderControl { Drain { reply: oneshot::Sender<usize> }, GetDepth { reply: oneshot::Sender<usize> }, } ``` Wiring: - Unbounded mpsc `(control_tx, control_rx)` created alongside the existing out-mpsc. - Reader's `tokio::select!` adds a branch on `control_rx.recv()`. `biased;` keeps broadcast forwarding the priority so a burst of inbound control requests cannot starve forwarding. - Inbound's `ClearBuffer` handler sends `Drain` over the control channel and awaits the oneshot reply with the drop count. - Inbound's `GetInWaiting` handler sends `GetDepth`, awaits the oneshot reply, and ships the `InWaiting { count }` response through the writer mpsc (preserves the writer-is-sole-WS-sink invariant from #750). Race safety: if the reader has exited between send and recv (session teardown race), the oneshot resolves with `Err` and the inbound handler logs debug / falls back to `count=0` rather than crashing. No API change visible to clients -- the SerialClientMessage protocol is unchanged, only the internals. Build: `soldr cargo build -p fbuild-daemon` clean. Refs #755 (meta). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Closes #749. Replaces the daemon's single-loop serial-WebSocket bridge with a producer/consumer/inbound triple-task split so a slow socket flush no longer head-of-line blocks the broadcast drain, eliminating the silent-line-drop behavior FastLED #3219 traced back to here.
What changed
crates/fbuild-daemon/src/handlers/websockets.rs::handle_serial_ws():recv().await, thentry_recv()-drains everything that arrived during the previous socket flush. Coalesces adjacent Data messages into oneData { lines, current_index }.Test plan
soldr cargo build -p fbuild-daemon-> cleanbash autoresearch teensy40 --object-fled --strip-sizes 5->Tests: 4/4 passed, Duration: 13ms,RESULT PASS 1 test(s). Pre-fix this exact command saw Pattern A then 720 s timeout.Notes on ClearBuffer / GetInWaiting semantics
Both used to drain / report on the local
rxbroadcast receiver. Post-split that receiver lives in the READER task; the inbound handler cannot reach it from its own task. ClearBuffer is now a logged no-op; GetInWaiting returns 0. The pre-split use case (host discards pre-RPC boot banner before sending a command) is largely served by the READER's lag-handling now. If we need true drain semantics back, the inbound task should send a control-message into the READER's loop via an additional channel.🤖 Generated with Claude Code
Summary by CodeRabbit