feat(util): SnapshotPublisher — lock-free single-writer seqcount snapshot#653
feat(util): SnapshotPublisher — lock-free single-writer seqcount snapshot#653sritchie wants to merge 2 commits into
Conversation
…shot Foundation for the rate-decoupled filter stack outlined in local-plans/PLAN_RATE_DECOUPLED_FILTER.md. This commit lands the infrastructure only — no callsite migrations yet. Future commits will introduce g_AhrsSnapshot / g_FlapSnapshot / g_SensorSnapshot publishers and migrate web handlers, display task, log producer, and housekeeping off xAhrsMutex onto this primitive. The pattern (Linux seqcount-style version-counter with release/acquire fences across the data memcpy) gives: - Lock-free writer: publish() never blocks. Writer rate scales with IMU rate (208/416 Hz today, potentially 1666 Hz once #645 lands). - Lock-free readers: any number of concurrent readers; never block the writer; never block each other. - Coherent reads: read() spins until the seqcount agrees across the data copy. Never returns torn data. The writer's publish completes in bounded time (one memcpy of trivially-copyable bytes), so starvation is impossible. - Bounded-retry tryRead() for real-time-deadlined consumers that would rather skip a frame than spin. Returns false on bailout; caller is responsible for not acting on `out` in that case. Single-writer is a hard requirement; UB if violated. For multiple producers, use multiple SnapshotPublisher instances and have readers compose them (the natural pattern for #645's predict-publishes-on- ImuReadTask + correct-publishes-on-SensorReadTask split). Memory ordering uses release-store on writer side around the memcpy, acquire-load on reader side around the memcpy, plus explicit atomic_thread_fence(release/acquire) before/after the memcpy. The fences are belt-and-suspenders against speculative reordering — on x86 they're no-ops (TSO model), on ARM/Xtensa they emit the right barriers (dmb / memw). Initial implementation without the fences exhibited torn-read bugs caught by the native race tests in this commit; the fences fix them. Verification Native test suite covers: - Default state (read before publish returns zeroed payload) - Single-threaded round-trip - Most-recent-publish-wins semantics - tryRead success when idle - Diagnostic version counter advances correctly - Concurrent sequence-counter monotonicity (writer + reader threads, 200000 reads; asserts the seqcount catches every torn read) - Concurrent two-pattern integrity (writer alternates 0xAA/0x55; reader never observes a mix; 100000 reads) - tryRead coherence under tight-loop write pressure (asserts every true return is coherent; bailout rate not gated since pathological test load far exceeds real-hardware publish rates) - Pointer-bearing payload round-trip pio test -e native: 1177/1178 passed, 1 skipped (host_main_cli, as always). 9 new tests added. pio run -e esp32s3-v4p: SUCCESS, no firmware bloat (header-only template, only ~200 bytes when instantiated per payload type). Related - local-plans/PLAN_RATE_DECOUPLED_FILTER.md — composed sequence this commit is step 1 of. - #645 — LogProducerTask architecture that depends on this primitive. - #644 — atomic AHRS snapshots called out as a deferred 416 Hz item; this is the infrastructure for that work. - #649 — AHRS lifecycle cleanup; folds in once the migration commits start touching AHRS state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Firmware ArtifactsMain firmware (Gen3 box)
Each External display firmware
Each The huVVer-AVI binary is built but not yet validated on real hardware — see the bring-up checklist for what to verify on first flash. X-Plane plugin
Drop the Built from
|
Review against
|
Two independent bulldog reviews converged on the same set of fixes, which together harden the contract and tests without changing runtime behavior: 1. Test tolerances tightened to 0. test_concurrent_seq_is_monotonic and test_concurrent_two_pattern_integrity previously allowed up to 1 torn read in 100000–200000 iterations. read() is documented as "NEVER returns torn data" — a nonzero allowance contradicts the contract. Both tests now assert == 0. If they ever fail on CI, the response is to find the race, not widen the tolerance. 2. Fence comments rewritten to be honest. The standalone atomic_thread_fence calls in publish() and read()/tryRead() are STRICTLY REDUNDANT with the release-store / acquire-load that surround them under the C++20 memory model. They're kept as belt-and-suspenders against a future edit that accidentally downgrades a version_.store/load to memory_order_relaxed. Cost is one MEMW per op on Xtensa (~5 cycles); negligible against the memcpy. The old comments claimed the fences were load-bearing, which would mislead the next reader; rewritten to identify the actual load-bearing sync (the release-store of v+2 + acquire-load of v2) and the defensive role of the fences. 3. read() doc now warns DEADLINED CALLERS MUST USE tryRead(). Without the warning, a future contributor could reach for read() from a real-time path because it's the friendlier API, not realizing that read() can spin for the duration of one publish under contention — or longer if the writer task is preempted mid-publish (e.g. by a high-priority interrupt). The audio task, IMU task, and deterministic-cadence display task all must use tryRead() and skip frames on bailout. 4. Per-read latency claim corrected. The previous "~100 ns memcpy for 256 bytes" estimate ignored the per-op atomic-load barrier cost. Realistic Xtensa LX7 at 240 MHz: ~25–40 cycles memcpy for a 100-byte payload (~150 ns) + 2 acquire-loads with MEMW (~5 cycles each); ~150–200 ns per read total. Still 10–50× better than the mutex it replaces, just not the optimistic number originally documented. 5. New test_version_wrap_arithmetic. At 416 Hz × 2 increments per publish, version_ wraps every ~59 days of continuous operation. Not reachable in a single flight, but worth locking the property: the parity check (v & 1u) and equality comparison (v1 == v2) must remain correct across uint32_t wrap. The math is right (unsigned arithmetic is wrap-defined); the new test asserts the key cases (0xFFFFFFFE → 0xFFFFFFFF → 0 don't spuriously match). 6. File-level comment fixed. Previously promised a "synthetic- bailout test below" that didn't exist. The actual test (test_tryread_coherence_under_pressure) asserts the contract that matters (every true return is verified word-for-word coherent) without over-asserting on a workload-dependent bailout rate. Comment now accurately describes what's there. Tests: 10 cases (was 9), all passing including the strict-zero tolerance and the new wrap test. Native suite: 1178 succeeded / 1 skipped (was 1177); no regressions. pio run -e esp32s3-v4p: SUCCESS Bulldog reviews this addresses: - Local Agent tool reviewer (this session) - #653 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eview Four days post May 23. The concurrency primitive landed in code review (PR #653, open). Two other substantial PRs landed: #646 (universal writer + web-mutex hardening) and #647 (experimental 416 Hz IMU rate). Major addition: A new top-level section "Step 1 in flight (PR #653, 2026-05-27)" placed right after "Stream topology and rate planning." Names the merged design explicitly: ┌─ Consumers ──────────┐ ↑ reads ┌─ LiveDataFrame composition (pure) ─┐ ↑ N lock-free reads ┌─ Per-stream SnapshotPublisher<T> ──┐ ↑ single-writer publish() per stream ┌─ Producer tasks ───────────────────┐ Per-stream lock-free seqcount snapshots (PR #653's primitive) underneath the composed LiveDataFrame view (future PR). Three real updates to the plan's framing locked in by #653: 1. Concurrency primitive is lock-free seqcount, not mutex. The May 23 reconciliation proposed extending PR #634's narrow xAhrsMutex pattern. That was wrong-design; PR #653 is right. 2. Per-stream snapshots are the foundation; LiveDataFrame is the composition view. Single-writer constraint is per-stream (ImuReadTask writes IMU; SensorReadTask writes pressures). 3. DataServer migration is the first validation target, not LogSensor. Validates the lock-free win against PR #647's 18-20 ms IMU-stall measurement. The May 23 LogSensor-first recommendation was wrong because PR #634's existing snapshot uses xAhrsMutex — exactly what the new primitive eliminates. Running scoreboard updated with May 23 → 27 cluster (#653, #647, #646, plus #647 side-effect fixes). §1 of the planning section updated to "in flight (PR #653 is commit 1)" with the merged-design explanation. "Next priority" rewritten: - Earlier options (a/b/c/d) superseded by the #653 sequence. - Now framed as a commit-by-commit sequence following the local plan (commits 2-7), each independently bench-validated. - Commit 2: migrate Housekeeping (smallest blast radius). - Commit 3: migrate every read-only consumer including DataServer — this is where the 18-20 ms IMU stall measurement validates the lock-free design. - Commit 6: the local plan needs revision before this lands — the "two writers, same publisher" approach violates SnapshotPublisher's single-writer invariant (flagged in the #653 review). - Commit 7: LogProducerTask, enables always-on 416 Hz. - User's upcoming binary + pilot-tuning logs drop in as additional writer tasks against the same snapshots. Recommendation now: (a') wait for #653 to merge, (b') start commit 2 (Housekeeping), (c') draft the commit-6 revision in parallel.
|
Addressed both bulldog reviews in commit d828052. Concrete changes:
Plan doc updates that flow from the second review:
Tests: 10 cases (was 9), all passing including the strict-zero tolerance. Native suite: 1178/1179 (1 skipped), no regressions. The bulldog review at #653 (comment) is the one that surfaced the two-writer-on-one-publisher unsafety. That alone was worth getting before this lands. |
SnapshotPublisher — foundation for the rate-decoupled filter stack
Header-only template class implementing the standard Linux-kernel-style seqcount snapshot pattern in C++20, plus a 10-test native suite that covers the race semantics. No callsite changes — this is pure infrastructure that future PRs build on.
Step 1 of the multi-commit plan in `local-plans/PLAN_RATE_DECOUPLED_FILTER.md` that decouples sample rate, AHRS update rate, and log capture rate. Each future migration commit introduces one specific publisher (`g_AhrsSnapshot`, `g_FlapSnapshot`, etc.) and migrates one or more consumers off `xAhrsMutex` onto this primitive.
What it does
```cpp
struct AhrsSnapshotPayload { /* trivially copyable */ };
SnapshotPublisher g_AhrsSnap;
// One writer task (e.g. ImuReadTask), lock-free:
g_AhrsSnap.publish(payload);
// Many readers (web handlers, display, log producer), lock-free + coherent:
const AhrsSnapshotPayload s = g_AhrsSnap.read();
// Or with explicit bailout for real-time-deadlined callers:
AhrsSnapshotPayload s;
if (g_AhrsSnap.tryRead(s)) { /* coherent / } else { / skip */ }
```
Why seqcount instead of a mutex
Today's pattern (`xAhrsMutex`, established by PR #634 closing #520) requires every web handler / display / log consumer to take the same mutex as `ImuReadTask`. Under web load this contention produced 18-20 ms IMU stalls every config save (PR #647 bench data) and 50 ms-class web-handler timeouts. The seqcount pattern eliminates the cross-core contention entirely:
Memory ordering
Initial implementation without explicit `atomic_thread_fence` calls between the memcpy and the seqcount loads produced torn reads in the race tests. The current implementation adds defensive release/acquire fences around the memcpy on both sides. These are STRICTLY REDUNDANT with the release-store / acquire-load semantics already on the version counter — kept as belt-and-suspenders against a future edit that downgrades the atomic ops to relaxed by accident. The header documents which sync is load-bearing vs defensive.
Tests
10 native tests, all passing:
```
test_default_read_returns_zero_payload
test_publish_then_read
test_most_recent_publish_wins
test_tryread_succeeds_when_idle
test_version_for_telemetry_advances
test_concurrent_seq_is_monotonic # writer+reader threads, 200000 reads, asserts ==0 torn
test_concurrent_two_pattern_integrity # alternating 0xAA/0x55, 100000 reads, asserts ==0 mixed
test_tryread_coherence_under_pressure # asserts every true return is coherent
test_pointer_bearing_payload_round_trip
test_version_wrap_arithmetic # parity + equality across uint32_t wrap
```
The strict-zero tolerance on the concurrent coherence tests is load-bearing: `read()` is documented as "NEVER returns torn data," and any nonzero count is a real race in the seqcount, not test flakiness.
Verification
```
pio run -e esp32s3-v4p # SUCCESS, header-only template (no firmware bloat)
pio test -e native # 1179 cases: 1178 succeeded, 1 skipped
```
Bulldog review feedback addressed
Two independent reviewers converged on the same fixes; a second commit on this branch (`d8280524`) addresses all of them:
Migration sequencing
Plan now calls for DataServer to migrate first (validates the lock-free win against PR #647's measured 18-20 ms IMU stalls under web load), then LogSensor::Write, then web handlers, then Housekeeping. Original plan had Housekeeping first; per bulldog feedback, DataServer is the right load-bearing first migration since it tests against the actual observed contention scenario.
Two-writer constraint
Single-writer is a hard invariant of this primitive. The original plan's "two writers publishing to one snapshot" option (open question 1 in the plan doc) is unsafe — two interleaved `version_.load(relaxed) → store(v+1) → store(v+2)` sequences can leave the counter persistently odd. Plan now mandates separate publishers per writer (e.g. `g_AhrsPredictSnap` and `g_AhrsCorrectSnap` in the future predict/correct split), composed at a view layer.
Related