Skip to content

EKFQ: Cholesky-failure observability counter (partial #593)#621

Open
sritchie wants to merge 7 commits into
masterfrom
chore/ekfq-cholesky-counter
Open

EKFQ: Cholesky-failure observability counter (partial #593)#621
sritchie wants to merge 7 commits into
masterfrom
chore/ekfq-cholesky-counter

Conversation

@sritchie
Copy link
Copy Markdown
Collaborator

EKFQ: Cholesky-failure observability counter

When EKFQ::correct() aborts on a non-PSD Cholesky diagonal (sum <= 0.0f at EKFQ.cpp:601), the entire 8-measurement batch update is silently dropped. The filter falls back to predict-only for that frame and the pilot has no way to know. This PR adds a session-persistent counter and surfaces it via the existing TASKS console command when EKFQ is the active algorithm.

Strict subset of #593 item #1. Items #2 (stack hoist — filed separately as #617), #3 (log provenance — deferred for coordinated log-with-config work), and #4 (audio-sweep τ — currently a no-op since shared kCompFadeTauSec makes sweep.py's hardcoded 0.5 correct) are out of scope.

Changes

  • EKFQ gains three private uint32_t members (updateCallCount_, failedUpdateCount_, lastFailedCallNum_) and three public accessors. Counters survive init() — a failure burst before a reseed is still diagnostic.

  • ++updateCallCount_ fires at the top of EKFQ::correct(). (Originally at the top of EKFQ::update() — but EkfqPipeline::Step() calls predict() and correct() separately rather than going through update(), so increment at correct() is the production-correct site. Caught by the final whole-branch review; fix in commit 2ee45b2d.)

  • The Cholesky guard inside correct() bumps failedUpdateCount_ and stamps lastFailedCallNum_ before its early return.

  • The sketch-side AHRS wrapper exposes pass-through accessors EkfqUpdateCallCount(), EkfqFailedUpdateCount(), EkfqLastFailedCallNum(), plus IsEkfqActive().

  • The TASKS console command appends one line when EKFQ is active:

    EKFQ              0 failed updates (last @ call #0, 12459 total)
    

    (Three values: failure count, last-failed call number, total correct() call count. The total denominator was an addition during implementation — the spec called for two values; the third makes the counter ratio interpretable post-flight without external context.)

    Width %-16s matches PrintTaskInfo's existing format for column alignment.

  • 6 new native unit tests in test/test_ekfq/test_ekfq.cpp:

    • test_ekfq_counter_starts_at_zero
    • test_ekfq_counter_unchanged_on_normal_update
    • test_ekfq_counter_bumps_on_degenerate_S
    • test_ekfq_counter_persists_across_init
    • test_ekfq_counter_increments_by_one_on_batch_failure (documents batch semantics — one failure-per-correct(), not one per measurement)
    • test_ekfq_counter_advances_via_direct_predict_correct (locks in the production-path fix from 2ee45b2d)

Relationship to PERF telemetry (#605/#612/#615)

This counter is intentionally separate from the PERF subsystem. PERF measures wall-clock duration of EkfqCorrect scopes and is compile-gated (-DONSPEED_PERF_ENABLED, only built in the esp32s3-v4p-perf env). Production firmware carries zero PERF instrumentation by design.

PERF cannot answer this counter's question: "of the 208 EkfqCorrect calls this second, did the Cholesky guard fire?" A scope emits one event per call whether correct() did real work or hit sum<=0 and bailed. The new counter is always-on in production firmware and per-session persistent.

Testing

pio test -e native                          # 1127 cases (1126 pass + 1 pre-existing skip)
pio test -e native -f test_ekfq             # 13/13 pass
pio run -e esp32s3-v4p                      # zero warnings, 22.2% RAM / 15.6% Flash
./tools/regression/run_snapshot.py          # all 5 fixtures bit-identical
./scripts/check_core_purity.sh              # passes (no platform API leaked into core)

Bench-test recommendations (post-merge)

  1. Cold-boot, EKFQ active, idle for 30s. `TASKS` should show total ≈ 6200 (30 s × 208 Hz). If total is 0, the production-path wiring regressed.
  2. Cold-boot, Madgwick active. `TASKS` should omit the EKFQ line entirely (not show 0 0 0).
  3. Switch Madgwick → EKFQ mid-session. The EKFQ line appears after the switch; total reflects only post-switch correct() calls.
  4. Long-duration run (15+ min) on EKFQ healthy unit. total keeps climbing, failed updates stays at 0.

Spec → branch drift (cosmetic, not blocking)

Spec line 145 named the branch chore/ekfq-failed-update-counter; actual branch is chore/ekfq-cholesky-counter. Pure cosmetic.

🤖 Generated with Claude Code

sritchie and others added 7 commits May 20, 2026 23:11
Adds three uint32_t instance members and three public accessors for
post-flight observability of correct()'s non-PSD Cholesky abort path.
Counters survive init() — a failure burst before a reseed is still
diagnostic.

Refs #593 item #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
update() bumps updateCallCount_ once per call. The Cholesky guard
inside correct() bumps failedUpdateCount_ and stamps
lastFailedCallNum_ before the early return. Pure observability; no
algorithmic change. Regression goldens unchanged.

Refs #593 item #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tests covering: zero-init, monotonic call-count on normal updates,
counter bump on a synthetically degenerate measurement, counter
persistence across init(), and the batch-form single-increment
semantic (one failure per update(), not one per measurement).

Refs #593 item #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three accessors exposing EKFQ's getUpdateCallCount /
getFailedUpdateCount / getLastFailedCallNum through the sketch-side
AHRS wrapper, plus IsEkfqActive() so ConsoleSerial can decide
whether to print the diagnostic line.

Refs #593 item #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the active algorithm is EKFQ, the TASKS command appends one
line summarising the Cholesky-failure counter, last-failed call
number, and total update() call count. Omitted when Madgwick is
active so the output isn't misleading about an algorithm that's
not running.

Refs #593 item #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task 4's code-quality review renamed AHRS wrapper methods for
self-consistency:
- EkfqFailedUpdates → EkfqFailedUpdateCount
- EkfqLastFailedCall → EkfqLastFailedCallNum

Updates the in-repo plan doc to match the merged code. Also trims
the IsEkfqActive() doc comment to match the implementation (removed
caller reference per CLAUDE.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EkfqPipeline::Step() calls EKFQ::predict() and EKFQ::correct()
separately (to apply comp-fade to TAS between them), bypassing the
update() wrapper. The previous increment inside update() never fired
in production, so the TASKS console line would have printed
"0 total" forever regardless of uptime.

Move the increment to the top of correct() so it fires once per
production frame. The update() convenience wrapper transitively
inherits the increment.

Adds a regression test exercising the production path
(predict()+correct() called directly) to lock in the fix.

Refs #593 item #1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

Firmware Artifacts

Main firmware (Gen3 box)

Variant Download
V4P (Phil's box) onspeed-V4P-4.22.2-dev.117+11c7239.zip
V4B (Bob's box) onspeed-V4B-4.22.2-dev.117+11c7239.zip

Each .zip contains three files: firmware.bin, bootloader.bin, and partitions.bin.
For an OTA update, you only need firmware.bin — upload it at http://onspeed.local/upgrade.
For a USB flash (initial setup or recovery), you need all three — see the flashing docs.

External display firmware

Board Download
M5Stack Basic m5-display-basic-4.22.2-dev.117+11c7239.zip
M5Stack Core2 m5-display-core2-4.22.2-dev.117+11c7239.zip
huVVer-AVI ⚠️ unverified — see #298 m5-display-huvver-avi-4.22.2-dev.117+11c7239.zip

Each .zip contains firmware.bin, bootloader.bin, and partitions.bin. For an OTA update on M5Stack, hold Button B during boot to enter WiFi update mode and upload firmware.bin. For a USB flash, see the external display docs.

The huVVer-AVI binary is built but not yet validated on real hardware — see the bring-up checklist for what to verify on first flash.

X-Plane plugin

Platform Download
macOS AOA-Tone-FlyOnSpeed-mac_x64.xpl
Windows AOA-Tone-FlyOnSpeed-win_x64.xpl
Linux AOA-Tone-FlyOnSpeed-lin_x64.xpl

Drop the .xpl into X-Plane 12/Resources/plugins/AOA-Tone-FlyOnSpeed/<arch>/. Restart X-Plane to load. See the X-Plane plugin docs for install details and usage.


Built from chore/ekfq-cholesky-counter at 2ee45b2df89f3bb41cdee071df95e84fa79cd256.

Downloading these artifacts requires a GitHub login. Artifacts expire after 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant