Skip to content

fw/drivers/imu/lis2dw12: harden against FIFO/INT1 stream stalls#1484

Open
jplexer wants to merge 5 commits into
coredevices:mainfrom
jplexer:jp/firm-2490-accel-stream-stall
Open

fw/drivers/imu/lis2dw12: harden against FIFO/INT1 stream stalls#1484
jplexer wants to merge 5 commits into
coredevices:mainfrom
jplexer:jp/firm-2490-accel-stream-stall

Conversation

@jplexer

@jplexer jplexer commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Investigation of FIRM-2490 ("Step count not working, started working after toggling health") showed the accel sample stream on obelix stalls: steps stop counting, activity sees frozen orientation, and stationary mode engages on a moving wrist. Recreating the accel session (the health toggle) fixed it because it fully re-arms the FIFO and INT1 — this series makes the driver do that itself.

Root cause mechanics:

  • INT1 is edge-triggered while the sensor's FIFO-threshold output is level-based: if the drain work is delayed past FIFO overrun (~280 ms slack at 25 Hz), the line latches high and no further edges arrive.
  • The existing INT1 watchdog timed the last INT1 edge, but shake/wake-up function interrupts share the INT1 pad (CTRL7 INT2_ON_INT1), so wrist motion kept feeding the watchdog while the FIFO stream was dead — fully silent stalls.
  • Its recovery only rewrote FIFO_CTRL (field logs show it firing 3x in 52 s without keeping the stream alive) and discarded queued samples.
  • accel_peek returns the cached last FIFO sample during active sampling, so a dead stream freezes stationary/orientation data and the stall masks itself.

Changes

  1. Watchdog tracks successful FIFO burst reads instead of INT1 edges, with a 2x period margin (1x tripped on normal jitter: 1004 ms vs 1000 ms period).
  2. Shared recovery helper (watchdog + overrun path): quiesce INT routing (forces a latched-high pad to produce a fresh edge), drain queued samples, re-assert ODR/FIFO/INT routing, clear latched sources, log a recovery counter.
  3. accel_peek detects stale cached data and queues the stall check itself, breaking the self-masking loop.
  4. Drain the FIFO on subscriber reconfiguration instead of discarding up to a full FIFO of samples (resolves existing FIXME).
  5. Floor the stall threshold at 1 s so small-batch subscribers (e.g. per-sample at 25 Hz) don't trip spurious recoveries.

Fixes FIRM-2490
Related: FIRM-2285, FIRM-1626, FIRM-1141

Testing

  • obelix_pvt and getafix_dvt (both lis2dw12 boards) build clean
  • ./waf test green (200 suites, including test_accel_manager)
  • On-watch validation pending (QEMU does not emulate the lis2dw12): watch for FIFO stream stalled for N ms / Recovering accel stream (count N) followed by steps continuing to count, and no recovery storms

🤖 Generated with Claude Code

jplexer and others added 5 commits June 11, 2026 15:40
…tall detection

The INT1 watchdog used the time of the last INT1 edge to decide whether
the FIFO stream had stalled. Shake/wake-up function interrupts are
routed onto the same INT1 pad (CTRL7 INT2_ON_INT1), so wrist motion
kept kicking the watchdog while the FIFO threshold stream itself was
dead, leaving the stall undetected until the accel session was
recreated (e.g. by toggling Health on the phone). Field logs from
FIRM-2490 show exactly this: steps frozen and orientation stale with no
watchdog warnings at all.

Track the time of the last successful FIFO burst read instead, which is
the actual signal we care about. The trip threshold gains a 2x margin
over the FIFO threshold period: the old 1x threshold tripped at 1004 ms
against a 1000 ms period, i.e. on normal scheduling jitter.

Fixes FIRM-2490

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Joshua Jun <lets@throw.rocks>
…very

The watchdog and overrun recovery paths only rewrote FIFO_CTRL. That
discards up to a full FIFO (~1.3 s of samples at 25 Hz) on every
recovery, and it cannot repair upsets to ODR (CTRL1) or INT routing
(CTRL4/5/7) — field logs from FIRM-2490 show recovery firing three
times in under a minute without keeping the stream alive.

Introduce a shared recovery helper used by both the watchdog and the
FIFO overrun branch that quiesces INT routing (forcing a latched-high
pad low so re-enabling yields a fresh rising edge), drains queued
samples before they are lost to the bypass write, re-asserts ODR, FIFO
mode and INT routing, and clears latched function INT sources. The
previously unused num_recoveries counter is now incremented and logged
so field logs can distinguish first-time from repeat stalls.

Drained samples are timestamped at read time like regular FTH reads,
so their timestamps are late by up to one FIFO period; unchanged from
the existing behavior.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Joshua Jun <lets@throw.rocks>
…ling

While sampling is active, accel_peek returns the last FIFO sample, so a
stalled stream freezes the data consumed by stationary-mode motion
detection and activity orientation checks. The stall then masks itself:
the watch reports being motionless and flat, enters stationary mode on
a moving wrist, and never logs anything (FIRM-2490).

Detect staleness directly in accel_peek and queue the shared stall
check, giving a second, caller-driven trigger alongside the watchdog
timer. The check re-validates staleness on the serialized driver work
queue, so schedule/execute races with reconfiguration are benign, and
a pending flag prevents frequent peek callers from flooding the work
queue.

The cached (stale but bounded) sample is still returned rather than an
error or a one-shot measurement: an error would regress stationary
detection, and the one-shot path rewrites CTRL1 mid-stream and blocks
the caller for up to 100 ms.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Joshua Jun <lets@throw.rocks>
accel_set_num_samples discarded any samples queued in the FIFO via the
bypass write, losing up to a full FIFO of data on every subscriber
reconfiguration. Drain the FIFO first when sampling was previously
active, resolving the long-standing FIXME.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Joshua Jun <lets@throw.rocks>
The stall threshold scales with the FIFO threshold period, which is
derived from the most demanding subscriber. An app requesting
per-sample updates at 25 Hz yields a 40 ms period and thus an 80 ms
threshold, which normal work-queue latency can exceed, tripping
spurious recoveries. Floor the threshold at one second, matching the
watchdog timer granularity.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Joshua Jun <lets@throw.rocks>
@jplexer jplexer requested a review from gmarull as a code owner June 11, 2026 16:04

@gmarull gmarull left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me take this carefully, first question: has this been tested on hw?

@jplexer

jplexer commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

regression-tested normal steps/shake on getafix_dvt2, and verified the recovery path with a fault-injection build (killed INT routing / FIFO mode / ODR behind the driver's back) all recovering sucessfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants