Skip to content

pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness#32

Draft
ringof wants to merge 46 commits into
mainfrom
claude/pps-development
Draft

pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness#32
ringof wants to merge 46 commits into
mainfrom
claude/pps-development

Conversation

@ringof

@ringof ringof commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

A measurement harness for the RX888 mk II PPS in-band marker and the
streaming path it rides on, plus the investigation it drove to a resolution.

  • pps_integrity — drives the PPS marker at 1 Hz and validates each edge
    against the sample stream.
  • stream_soak — the no-marker control.
  • tests/pps_log_stats.py — offline analysis of the -v log and the
    --statslog CSV; tests/pps_knob_sweep.sh-q/-p sweep.

Result: the in-band marker is byte-lossless

The headline question — does the marker drop data at production rates? — is
resolved: no. A 3 h run at 129.6 MSPS with byte-exact producer/consumer
drain counters:

quantity value meaning
glDMACount × 16384 − delivered +119 MB = 42.6 ppm the apparent "loss"
per marker 10.8 KB (~0.7 buffer) the partial-buffer over-count
drain backlog (apiProd − apiCons) net 0, peak ±1 buffer, 0 sustained steps nothing orphaned
apiProd ≠ apiCons 1027 / 10,800 s producer counter independent, not mirrored
real clock (budget) −0.47 ppm TCXO ~nominal; no fast clock
bad_xfers, PIB, faults 0 clean device

The "~42 ppm loss" we chased for weeks was a measurement artifact.
glDMACount counts each committed DMA buffer (PROD_EVENT) as a full 16 KB
regardless of fill. The marker fires a GPIF thread-switch that abandons a
partial buffer, so glDMACount × 16 KB over-counts "produced" bytes by
~10.8 KB/marker — a fixed ~42 ppm at 1 Hz, which is why it was suspiciously
constant across every run while the real clock varied. stream_soak (no
marker) shows no gap because it makes no partial buffers.

Why "no loss" (byte-exact). The firmware now exposes the producer/consumer
socket xferCount registers; their instantaneous difference — the backlog
actually in the DMA channel — holds 0 ± exactly one buffer for the entire
3 h and ends where it started. A real orphan would make the producer
permanently outrun the consumer, so the end backlog bounds the total leak at
< 6 ppb over 2.8 TB delivered. The producer counter is independent (differs
from the consumer in 1027/10,800 s), so the 0 backlog is real, not a mirror.
The "fast clock masking loss" reading is dead — the real offset is −0.47 ppm
and the +42.6 ppm gap is a constant artifact on top of it.

What this PR changed about the tools

Following review feedback, the tools report measurements, not verdicts.
PASS/FAIL keys only on unambiguous faults (bad_xfers, device reset, spurious
shorts); the glDMACount gap, drain backlog, continuity dips, and rate offset
are printed as raw numbers. Interpretation lives in doc/pps_integrity.md and
in pps_log_stats.py over the --statslog CSV.

How we got here (dead ends, recorded)

  • stream_soak control — lossless, isolating the effect to the marker.
  • KBA231382 (INVALID_SEQUENCE) — ruled out; the marker isn't a CPU commit.
  • Edge / metastability — a 1 kΩ fast edge fixed spurious shorts (chatter)
    but not the gap → edge governs fidelity, not the gap.
  • Marker mechanism — firmware-confirmed: a forced GPIF thread-switch
    (TH0_PPS_COMMIT = DATA_CNT_HIT firing unconditionally), no THR_WRAPUP,
    AUTO_MANY_TO_ONE, 4 × 16 KB buffers — the source of the partial buffers.
  • "Drain-side orphan" hypothesis — refuted. Building the consumer-side
    telemetry to test it produced the byte-exact backlog, which shows no
    orphaning. (Three firmware iterations: CONS_EVENT never fires on AUTO; the
    socket read had to sum both producer sockets; the host had to read the
    wrap-around backlog as signed.) The "12-buffer dip / 26.9× boundary
    enrichment" were continuity-detector artifacts, not real loss.

Still open: data corruption (delivered ≠ correct)

Byte accounting proves delivery, not correctness. Next experiment: inject a
known 10 MHz tone, down-convert and track the unwrapped phase — a single
dropped/garbled sample is a 27.8° phase step, and correlating those with the
marker positions tests the partial-buffer splice directly (the LTC2208 has no
digital test pattern, so the tone is the ground truth).

What's in the tree

  • src/pps_integrity.c, src/stream_soak.c; rx888_get_transfer_bytes()
    accessor (only public-API addition; in the CI symbol set).
  • --statslog CSV (raw per-second GETSTATS + host counters) on both tools;
    the byte-exact drain backlog (signed); measurement-only summaries.
  • tests/pps_integrity_smoke.sh, tests/stream_soak_smoke.sh (in
    make check); tests/pps_knob_sweep.sh; tests/pps_log_stats.py (console
    log + CSV).
  • Makefile / README / .gitignore / doc/pps_integrity.md updated.

make all, make check, make check-asan pass.

🤖 Generated with Claude Code

claude added 3 commits June 20, 2026 01:50
Refine PLAN_PPS_INTEGRITY.md against the current tree and resolve the
draft's open questions:

- Full-transfer size: add a one-line rx888_get_transfer_bytes() getter
  (buf_bytes is already computed in struct rx888; valid after start()).
- GETSTATS decoder: copy the decode into pps_integrity, citing
  src/fx3_cmd/fx3_stats.h as the layout authority.
- GPIO base: OR librx888's dither/randomizer bits into the marker
  toggle so GPIOFX3 full-word writes don't clear them.

Also documents the BIAS_VHF (bit 9 / 0x200) marker-bit alias, a
double-counting-proof atomic marker state machine, the missing
$(LIBUSB_CFLAGS) on the build target, and the test/install wiring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Forward-reference the one public-API addition the pps_integrity tool
needs (full-transfer size for short-marker detection). Marked planned /
not yet implemented, with the post-start() validity caveat and the note
that it must join the CI exported-symbol assertion when it lands.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Add src/pps_integrity.c: streams via librx888 while toggling the PPS
marker GPIO (BIAS_VHF / bit 9) at 1 Hz on a second EP0-only libusb
handle, verifying every rising edge yields exactly one short bulk
transfer with no spurious shorts and no misses. Per-second status line
plus a PASS/FAIL summary with GETSTATS deltas (PIB errors, streaming
faults, boot-count reset detection).

Supporting changes:
- librx888: add rx888_get_transfer_bytes() read-only accessor so the
  tool can derive the full-transfer sample count (short-marker
  threshold). Only valid after rx888_start(). Added to the CI
  exported-symbol assertion.
- Makefile: pps_integrity target (static librx888.a + libusb +
  $(LIBUSB_CFLAGS)); add to BINS so install/uninstall pick it up.
- tests: pps_integrity_smoke.sh (help/exit codes, arg validation,
  no-device path) wired into `make check`.
- docs: doc/pps_integrity.md updated to implemented state; note that
  firmware GETSTATS exposes no pps_count/pps_fail, so host-side
  counters are authoritative. README + librx888.md updated.

make all, make check, and make check-asan all pass; ABI symbol set
matches the CI assertion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
@ringof ringof changed the title doc: pps_integrity implementation plan pps_integrity: PPS in-band marker fidelity tool Jun 20, 2026
claude added 18 commits June 20, 2026 02:23
Mirror rx888_stream's -f: accept a firmware path and pass it through to
cfg.firmware_path so the tool can upload the FX3 image when the device
is in bootloader mode, instead of requiring a separate rx888_stream or
fx3_cmd load step first. NULL (default) keeps the prior behaviour: the
device must already be in app mode.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
At 32 MSPS a couple of markers per run can fail to register as short
transfers (mid-run MISS with zero spurious). To tell a near-full short
slipping past SHORT_MARGIN from a genuinely dropped/merged marker, track
the smallest transfer size per 1-second window and expose it under -v as
a 'minxfer' column. On an ok second this is the marker size; on a MISS
second it shows whether any short arrived at all.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
For the marker-drop investigation, make each run self-documenting:
- Categorise every miss from the window's smallest transfer (minxfer):
  'true drops' (min >= full — no marker emitted) vs 'near-full' (a short
  that slipped past SHORT_MARGIN). A short below threshold would have
  registered as a marker, so near-full should always be 0 — reporting it
  proves the detection threshold isn't hiding real drops.
- Print drop rate as a percentage of edges plus a per-hour projection,
  so long runs at different sample rates yield directly comparable
  numbers without log post-processing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The marker is the leftover partial buffer each PPS second, so its size is
~(samplerate mod full-transfer-samples) and drifts with the ADC/host
clock offset. When that remainder drifts within a buffer-boundary band,
the partial is empty (suppressed) or indistinguishable from a full
transfer and no in-band short can exist — an inherent blind spot of a
short-transfer marker, not a fidelity failure.

Classify each miss accordingly:
- blind-spot: minxfer < full (near-full partial past SHORT_MARGIN), or
  minxfer == full while the live remainder (last good marker) is within
  DANGER_BAND of a boundary. Benign, reported as NOTE.
- anomalous: minxfer == full with the remainder mid-buffer — the marker
  should have been plainly visible but vanished. Real failure.

Verdict now requires anomalous == 0 (and spurious == 0, no faults/reset),
replacing the blunt +-2 edge/marker tolerance which couldn't tell a
benign boundary miss from a real drop. Doc updated with the marker-size
model, the blind-spot characterisation, and the firmware-side fix
(ZLP / forced minimum-length delimiter).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The marker after a hard miss is often ~2x the typical size: the skipped
flush rolls forward and the next edge flushes the combined partial, so no
samples are lost — the delimiter was merely delayed and is reconstructable
in post. Classify such a miss as MERGE (WARN, recoverable) rather than
ANOM (FAIL, possible data loss), using a running-median baseline of recent
normal markers (recovery spikes excluded) and a MERGE_PCT threshold. The
decision is deferred one second until the next marker resolves it.

Verdict: merge and blind-spot misses no longer fail the run; only
anomalous misses (and spurious shorts / faults) do.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Marker classification only answered "is the delimiter present" — never
"did we lose samples". Add direct loss accounting:

- Inter-marker continuity: count samples delivered between consecutive
  markers and compare to (seconds spanned) x the learned per-second rate
  (running median of single-second intervals). Anchored to the marker
  (a fixed point in the stream), so it is free of host-window jitter;
  within ~1s clock drift is negligible, so a dropped DMA buffer is an
  obvious deficit. This also cross-validates MERGE: a merge spans 2s, so
  its interval must equal 2x a normal one — if short, the merge dropped
  data.
- Surface librx888 ok_xfers/bad_xfers (transport-level loss) and the
  total samples delivered.
- Reinterpret firmware PIB errors against the deficit: PIB with a deficit
  = host-visible loss; PIB without = recovered (NOTE).

Verdict now fails on any confirmed loss (bad_xfers > 0 or an inter-marker
deficit) in addition to anomalous misses and spurious shorts.

make check and make check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
3 h of 64 MSPS data showed the MERGE-vs-anomalous split (recovery marker
>= 170% of the running median) cut a single continuous population of
recovery sizes arbitrarily — smaller recoveries tagged "anomalous",
larger ones "merge" — while the inter-marker continuity check proved all
37 misses were lossless. The recovery-marker size was the wrong
discriminator.

Replace it with the continuity ground truth: when the next marker
resolves a hard miss, classify by whether the spanning interval shows a
sample deficit:
  - no deficit -> "recovered" (delimiter displaced one edge, data intact,
    reconstructable) -> WARN
  - deficit (or run ended unresolved) -> "lost" -> FAIL

Drop MERGE_PCT and the marker-size baseline ring. Verdict now fails on
lost markers, spurious shorts, confirmed sample loss (bad_xfers or
inter-marker deficit), or device faults. Summary reports
blind-spot/recovered/lost; docs updated.

make check and make check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Parse --rate as a double and convert to Hz (rounded to nearest sample/s)
so production rates like ka9q-radio's 129.6 Msps can be tested exactly,
instead of only whole MSPS. Reject non-numeric or trailing-garbage rates
(e.g. '129.6x') with a usage error. Startup line and summary now show
both MSPS and the resolved Hz.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Four host-side additions so the loss claim is defensible, not just
circumstantial:

- Absolute sample budget: expected = samplerate x elapsed vs delivered,
  reported as +/-ppm. Catches sustained loss the self-referential
  per-interval rate would absorb. Honest caveat: the ppm is clock offset
  AND uniform loss combined; the host can't separate them (that needs a
  firmware "samples produced" counter via GETSTATS — not a stream embed).
- Per-second PIB polling: GETSTATS read each second on the EP0 handle
  localises overflow events to the second and correlates each with that
  interval's continuity result, and catches a mid-run device reset
  immediately. Summary reports PIB seconds and how many coincided with a
  deficit.
- Loss-floor honesty: report LOSS_TOL and the largest inter-marker
  deficit actually observed, so the summary states its own sensitivity
  and the headroom to the floor.
- Learned-vs-nominal rate cross-check (ppm).

Verdict unchanged on hard loss (bad_xfers / inter-marker deficit); the
gross-budget shortfall beyond localised loss + 100 ppm is a WARN, since
it cannot be separated from a slow clock host-side.

make check and make check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Use the firmware's glDMACount (GETSTATS byte 0 = DMA buffers produced)
against librx888 ok_xfers (buffers delivered): equal deltas = no loss,
and the comparison is independent of the ADC clock (a slow clock lowers
both equally; only real loss makes produced exceed delivered). This is
the decisive answer to "were all DMA buffers delivered?" and the clean
separation of clock offset from loss.

- Read dma_count in GETSTATS; snapshot librx888 ok_xfers at start and end.
- Read the end firmware+lib counters BEFORE rx888_stop (STOPFX3 resets
  glDMACount); subtract in_flight and allow one queue_depth of slack for
  the startup ramp and non-atomic read skew.
- Report DMA buffers produced/delivered/in-flight/undelivered. Lost
  buffers fail the run (third independent loss detector alongside
  bad_xfers and inter-marker continuity).
- With all loss detectors clear, report the rate offset as a pure ADC
  clock calibration ("ADC clock: +X ppm") instead of "clock+loss
  combined"; mid-run reset now also fails.

Add a THEORY OF OPERATION block to the source and expand
doc/pps_integrity.md (loss detectors, clock-vs-loss separation,
sensitivity, what we deliberately don't do) so the harness is
maintainable. No stream embeds.

make all, make check, make check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Hardware at 64/129 MSPS showed a false "buffers lost" FAIL: glDMACount
counts the firmware's 16 KB DMA buffers while librx888 ok_xfers counts
1 MB USB transfers (~64 DMA buffers each), so comparing the raw counts
reported ~64x "undelivered". In bytes the two match exactly (no loss),
confirmed by the +19/+44 ppm budget (delivered ABOVE expected).

Fixes:
- Compare in BYTES: produced_bytes = glDMACount x FW_DMA_BUF_BYTES (16384)
  vs delivered_bytes = samples x 2, with in-flight subtracted and a
  runtime guard that marks the check indeterminate (not failing) if the
  firmware DMA-buffer-size assumption is off by a large factor. Baseline
  samples_total at start so the delta aligns with glDMACount.
- Make inter-marker continuity a DIAGNOSTIC localizer, not a hard fail:
  marker-detection timing jitters by ~a marker at high rates, so isolated
  single-interval deficits are redistribution, not loss (also a false
  FAIL in the runs). It is now reported and corroborated against the
  byte-exact produced-vs-delivered, and gated to a fully-warmed baseline.
- Misses are now "blind-spot"/"displaced" (marker timing only); data loss
  is judged solely by bad_xfers + produced-vs-delivered. Verdict updated.
- Report bytes (GB/MB) for DMA buffers and the gross floor.

Update the THEORY block and doc/pps_integrity.md to match. No stream
embeds. make all, check, check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Firmware maintainer confirmed the GPIF->DMA buffer size is 16 KB, so the
produced-vs-delivered byte comparison rests on a verified constant (the
runtime guard stays as defense-in-depth against future firmware changes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
A real 3h/129 MSPS run dropped ~82 MB (produced > delivered) with PIB=0
and bad_xfers=0 — DMA->USB consumer-side loss the budget masked (a +28
ppm fast clock minus ~30 ppm loss read as -1.2 ppm). Two follow-ups:

- Add -q/--queuedepth and -p/--reqsize (pass-through to librx888
  cfg.queue_depth / req_packets) so more in-flight buffering can be tried
  against the throughput ceiling.
- Report loss as ppm and percent next to the MB, and on a failing run
  print the IMPLIED ADC clock (budget_ppm + loss_ppm added back) so the
  clock-masking is visible at a glance.
- Document the empirical ceiling (64 MSPS lossless, 129 MSPS ~30 ppm
  consumer-side loss) and why the clock-independent check is essential.

make all, check, check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Automate the -q/-p knob matrix against the high-rate DMA->USB drain
ceiling: run low-rate controls, sweep queue depth at the high rate, pick
the depth with the least clock-independent (produced-vs-delivered) loss,
sweep transfer size at that depth, then print a summary table (undelivered
MB, result, spurious, displaced per cell). Env-configurable
(RX888_FW/HIRATE/LORATE/DUR_*/QSWEEP/PSWEEP); hardware test, not wired
into `make check`. Guards for missing binary/firmware and a no-device
abort. Referenced from doc/pps_integrity.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The scientific control we never ran: the same data-integrity
instrumentation as pps_integrity but with NO PPS marker injected, so we
can tell whether anomalies are inherent to the streaming path or caused
by the marker mechanism.

Measures (host clock assumed disciplined):
- effective sample rate from delivered-samples / elapsed-time (ppm);
- buffers produced vs delivered in BYTES (glDMACount x 16384 vs
  samples x 2) — clock-independent loss, same as pps_integrity;
- SHORT transfers: any transfer below full. With no marker applied each
  is a pure anomaly (FX3 committing a partial buffer) — the control for
  pps_integrity's "spurious shorts";
- bad_xfers, per-second PIB, streaming faults, mid-run device reset.

Shares the -q/-p knobs and the two-handle design. FAILs on confirmed
loss (bad_xfers or produced>delivered), device reset, or early stop;
short transfers are a WARN. With loss detectors clear, reports the rate
offset as a pure ADC clock calibration.

Wired in: Makefile target + BINS (install follows), .gitignore,
tests/stream_soak_smoke.sh in `make check`, README + doc/pps_integrity.md
("Control" section). make all, check, check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…ting

The 64 MSPS stream_soak control reported undelivered -33.5 MB (~ one queue
depth) on a provably lossless run. Cause: produced and delivered are both
measured from the start snapshot, so the steady in-flight already cancels
in their difference, but we then subtracted the END in_flight again —
double-counting one queue depth and biasing undelivered by ~ -queue.

Subtract the in_flight DELTA (end - start) instead, which cancels the
steady pipeline regardless of whether the start snapshot caught the ramp.
This re-centers the zero-loss baseline at ~0 and tightens the loss floor
to the intended ~slack (4 MB / ~3 ppm over 3 h) instead of ~37 MB.
Report now shows in-flight start->end for transparency.

make all, check, check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The 129.6 MSPS stream_soak control settled it: the bare stream sustains
259 MB/s for 3 h with zero short transfers and zero loss (produced ==
delivered, PIB=0, bad_xfers=0). So 129.6 is not a throughput ceiling; the
loss in pps_integrity is caused by the PPS marker injection (the firmware
forced partial-buffer commit perturbing the near-capacity DMA->USB
pipeline). The direction is conclusive: the lossy run was at the LOWER
throughput (129) with the marker, while the higher bare rate (129.6) was
clean. Rewrite the doc section accordingly and point to the out-of-band
PPS / GPSDO direction as the fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The loss magnitude (~140-250 KB / ~0.5-1 ms per event, ~500x a clean
commit's cost) shows the marker loss is a stall/glitch, not the intrinsic
~0-sample cost of a forced commit. Two untested input conditions are the
prime suspects: the 100 kOhm series resistor (~4 us RC edge -> metastable
detection) and clock-domain crossing of the async PPS into the GPIF (needs
a real 2-flop synchronizer). So this is NOT a verdict that in-band marking
is unworkable.

Reframe the doc section: throughput exonerated (firm), clock-masking shown
(firm), but root cause OPEN pending the edge + synchronizer experiments.
Note that in-band, if fixed, is the finest time source (the short
transfer length encodes the edge sample index sample-exactly), so it is
worth recovering; out-of-band MCU latch remains the safe fallback.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
@ringof ringof changed the title pps_integrity: PPS in-band marker fidelity tool pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness Jun 20, 2026
claude added 7 commits June 21, 2026 01:18
Replace the earlier contaminated ~82 MB / 129 figure with the trustworthy
3 h baseline (sleep inhibited, in-flight-corrected): at 129.6 MSPS the
marker run loses 118 MB = 42.3 ppm while the device is pristine (PIB=0,
bad_xfers=0, faults=0) and the no-marker control at the same rate loses
nothing. Loss is buffers glDMACount counts as produced but that never
become a USB transfer — dropped in the commit handoff, not overflow. The
budget read -0.5 ppm (a +41.8 ppm fast clock masked the 42.3 ppm loss);
only produced-vs-delivered saw it. Marker size swings the full range
(~2.5k-521k samples), consistent with a metastable commit trigger. This
is the number to beat for the resistor + GPIF-synchronizer experiments.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Reusable analyzer for -v pps_integrity logs. Reports the key correlation
from the 129.6 baseline: dip (loss) seconds are 26.9x enriched in
buffer-boundary-adjacent marker positions (near-empty/near-full minxfer)
vs normal seconds — the signature of a commit-vs-buffer-completion race.
Also reports the post-MISS ~2x merge ratio (displaced markers are clean,
distinct from dips), dip/spur/MISS co-occurrence, dip magnitude, and
inter-dip timing. Accepts multiple logs for a comparison table
(_baseline/_lowR/_sync), so the edge + GPIF-synchronizer experiments can
be judged by whether dips/hour drops AND boundary-enrichment collapses
toward 1x. Stdlib only; referenced from doc/pps_integrity.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Accidentally committed tests/__pycache__/*.pyc from a compile check.
Remove it and ignore __pycache__/ and *.pyc going forward.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Read the actual KBA (KBA231382) rather than the title: of its two
INVALID_SEQUENCE (error 71) scenarios, only Scenario 1 (out-of-order
commit against the PIB_SOCKET_0/1 ping-pong) fits our data; Scenario 2
(host-slow backpressure) is ruled out by the lossless stream_soak
control at the same rate. Records the three caveats that keep this a
firmware lead pending source (AUTO vs manual channel; faults=0 implying
a silently swallowed commit failure; the KBA fix not preventing loss)
and the four firmware questions it raises.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…uild

The RX888 runs the PPS_CTL_ENABLE=1 build, where the marker commit is
pure GPIF hardware (TH0/TH1_PPS_COMMIT states) on an AUTO_MANY_TO_ONE
channel -- no CPU CommitBuffer/SetWrapUp, cross-route correct by
construction. That eliminates the entire KBA231382 commit-ordering class
(the SetWrapUp-on-AUTO anti-pattern lives only in synth_pps.c, the
PPS_CTL_ENABLE=0 path, which never runs). With software commit ordering
ruled out, the loss is internal to the GPIF state machine; the ~250 KB
(~15 DMA buffers, ~1 ms) per-event magnitude is a stall signature, not a
one-buffer slip, pointing at the slow CTL[2] edge driving the control
comparator metastable. Recasts the KBA section from open lead to closed
cause and makes the fast-edge experiment the decisive next test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Walk back over-confident framing. The 42.3 ppm is a rare catastrophic
mode (3.68%% of markers carry ~all the loss), not a per-marker cost; the
other 96.3%% are 'below a jittery detector', not verified zero. The
metastability diagnosis is inference, not a captured event -- record the
alternative that the AUTO channel may not tolerate an async mid-descriptor
commit regardless of edge quality. Note that 43.6%% of loss events are not
boundary-adjacent (unexplained by the boundary race), that the uniform
marker-size spread is expected (not evidence of metastability), and that
the result is n=1. The fast-edge experiment is a test that can fail, not a
foregone conclusion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…parator, no sync)

Firmware-source read of the GPIF config and vendored SDK turns the
input-condition suspects into a confirmed signal path (not yet a confirmed
cause):
- CTL[2] sampled on the external ADC clock (129.6 MHz); the ~4 us RC edge
  spans ~500 consecutive sample clocks in the input threshold.
- Control comparator in TOGGLE mode (GPIF_CONFIG bit 12), sampling CTL[2]
  directly -- no intermediate state, no 2-flop synchronizer; a metastable
  sample propagates straight into the PPS_COMMIT transition. PPS vs
  original waveform differs in exactly two bits (CTRL_COMP_ENABLE,
  CTRL_COMP_TOGGLE).
- DLL correctly off (sync mode, external clock) -- removes the one
  firmware-config lever previously flagged (KBA210733).
- SDK 1.3.4.

Recasts the experiment order as edge-first (resistor/buffer attacks the
500-cycle dwell directly; a synchronizer state is belt-and-suspenders),
while keeping the error bars: the path is confirmed but causality is not,
and 43.6%% of loss events remain non-boundary and unexplained.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
claude added 18 commits June 21, 2026 06:09
…menu

Captures forward-looking firmware features that would convert the host
tools' inferences into device-reported ground truth, independent of
whether in-band PPS survives the edge experiment. Ranked candidates
grouped by goal (dropped-sample detection, timing/clock offset, health
monitoring, validation), each with what it buys the existing tools and
the open feasibility question for firmware. Highlights the two cheapest
high-value GETSTATS additions (drop counter, sample-granular produced
count) and the GPIF counter-capture-on-CTL-edge latch as a
non-perturbing alternative to the in-band marker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Firmware-confirmed: glDMACount increments on CY_U3P_DMA_CB_PROD_EVENT
(producer commits a filled buffer into the DMA channel, before USB
consumption). So produced > delivered means buffers entered the channel
but never drained out the USB consumer socket -- the gap is inside the
FX3, on the DMA->USB drain, not across the wire. Producer fill is fine;
the drain loses them.

Rules out host backpressure two ways: the no-marker stream_soak control
at the same data rate is lossless, and the loss correlates 26.9x with the
device-internal buffer-fill phase the host can't see. Adds the PROD_EVENT
accounting nuance (per-buffer regardless of fill -> glDMACount x 16 KB may
over-count forced partials) and the delivered-side continuity cross-check
showing the bulk loss is real, not a rounding artifact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…collision

Experiment 2 (1 kOhm fast edge, ~100x faster) result, two independent 3h
runs analyzed with pps_log_stats:
- Fidelity improved: spurious shorts 403->0, displaced markers 61->32.
- Loss unchanged: 42.3->41.2 ppm; boundary-enrichment intact 26.9x->25.8x;
  dip magnitude identical (median 98347->98220 samples ~= 12 DMA buffers).

Conclusion: edge quality governs marker fidelity, not data loss. The loss
is a stable, reproducible structural commit-vs-buffer-boundary collision,
not edge metastability. This also down-weights the synchronizer as a loss
fix (it makes the commit deterministic but cannot move where it lands vs
the boundary). The lever the data points at is boundary-aware commit logic
in the GPIF state machine (defer/suppress the partial commit in the danger
band) plus the open question of whether a near-boundary consumer wrap-up
orphans the ~12 buffers in the DMA->USB drain. Strengthens the case for #5
(capture the edge sample index without forcing a commit).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Firmware source resolves the mechanism, superseding the metastability/
synchronizer framing:
- The marker is a forced GPIF thread-switch, NOT a commit. BETA_THR_WRAPUP
  is set in no waveform state; TH0_PPS_COMMIT is the normal DATA_CNT_HIT
  transition firing unconditionally, abandoning a partial buffer. The short
  transfer is an implicit DMA-adapter side effect.
- Channel holds 4 x 16 KB buffers (AUTO_MANY_TO_ONE), not 64; the 64 is
  host URB assembly. So the ~12-buffer loss is a drain-side descriptor
  desync (3x the whole pool), silent because the consumer side is
  uninstrumented (CONS_EVENT commented out).

Drops the synchronizer; makes CONS_EVENT telemetry the decisive next step;
adds three forward paths: A0 explicit THR_WRAPUP (one bit, with the
thread-shutdown/re-arm caveat), A boundary-aware switch, B capture-don't-
commit GPIO latch. Fixes the stale KBA closing that still named the edge as
the leading suspect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The firmware now counts consumer drains (CY_U3P_DMA_CB_CONS_EVENT) and
exposes glDMAConsCount at GETSTATS [36..39]. Both tools now request the
40-byte response and decode it (gracefully absent on older firmware:
cons_valid stays 0, all other fields still decode).

pps_integrity:
- per-second 'orphan+N (drain)' note when produced-consumed steps beyond
  the in-flight wobble, landing on the same line as the dip so the orphan
  is correlated with the marker event;
- summary 'DMA drain' line (consumed buffers, produced-consumed orphaned
  MB) plus a cross-check against host undelivered: agreement pins the loss
  to the in-chip producer->consumer hop; a consumer->host gap would indict
  the USB wire instead.

stream_soak: same decode + a 'DMA drain' baseline line (expected ~in-flight
depth with no marker) -- the no-orphaning control the marker steps stand
out against.

Doc: loss-accounting section documents the new field and cross-check.
make check / check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
When glDMAConsCount is present in GETSTATS but pinned at 0 (CONS_EVENT not
firing -- expected on an AUTO channel, where consumer callbacks are not
delivered to the CPU), the prior code reported the entire stream as
'orphaned' and spammed a per-second orphan note at the full production
rate. Now: the per-second note only fires once the consumer counter is
actually advancing, and the summary prints a plain 'glDMAConsCount did not
advance -- CONS_EVENT not firing' diagnostic instead of a misleading
multi-GB orphan figure. Confirmed against a live run where consumed stayed
0 over 680k produced buffers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…isters

Firmware replaced the (never-firing) CONS_EVENT callback with direct socket
xferCount register reads: GETSTATS grows to 48 bytes with producer xferCount
[36..39], consumer-API [40..43], consumer-raw [44..47], all in bytes that
wrap ~16 s. The meaningful quantity is the instantaneous backlog =
producer - consumer (wrap-safe; both wrap together, backlog << 2^32).

Host side now:
- requests 48 bytes; decodes the three drain counters (drain_valid gated on
  >=48 so older firmware degrades cleanly);
- fw_backlog() helper; per-second 'orphan+NKB (drain)' note when the backlog
  steps past the in-flight wobble, with a 256 MB sanity cap so a non-tracking
  counter can't spam;
- summary reports backlog start->end and orphaned MB (byte-exact), the
  API-vs-raw-register skew, and the cross-check vs host undelivered (AGREE
  when they match within in-flight slack).

stream_soak mirrors the decode + a flat-backlog baseline line. Doc updated
for the new layout and the wrap-safe backlog method. make check / check-asan
pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…elds

A firmware response < 48 B (no socket-xferCount drain counters) previously
produced silence -- no DMA drain line at all -- which is ambiguous (did it
work? old firmware?). Now both tools print an explicit 'DMA drain: not
reported by this firmware (GETSTATS < 48 B)' line so the operator knows to
flash the socket-xferCount build rather than wondering why the cross-check
is missing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The 48-byte firmware now returns the drain counters, but apiProd (producer
xferCount via CyU3PDmaMultiChannelGetStatus) appears to be a single socket
of the MANY_TO_ONE channel, not the summed full-stream producer byte count.
It therefore advances at a different rate than the single full-rate
consumer socket, so producer-consumer drifts GB-scale and wraps instead of
being the small in-flight+orphaned backlog.

The sanity guard previously only checked the END backlog, so it printed a
nonsense 'orphaned -4082 MB'. Now both tools check the start backlog and the
orphan magnitude and, when the pair is incoherent, print an explicit
'producer/consumer counters incoherent ... producer xferCount is not the
full-stream byte count (one socket of the many-to-one?)' instead of a bogus
figure. The consumer side is fine (API == raw register, skew 0); the fix is
firmware-side: sum both PIB producer sockets for apiProd.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The producer-sum firmware fix landed: drain counters are now coherent
(backlog ~0, skew 0) instead of GB-scale garbage. On a window with no real
loss, orphan ~0 and undelivered within slack both read clean, so the
cross-check now says 'both clear -- no loss this window' rather than the
mildly alarming 'see delta'. AGREE / see-delta still apply once real loss
clears the in-flight slack on a longer run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
For short (e.g. 10 min) confirmation runs, each 'orphan+NKB (drain)' note
now also prints the cumulative backlog (backlog X.XXMB), so the drain
staircase is legible live without waiting for the summary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…ither)

The 15-min run exposed a contradiction: glDMACount x 16384 shows ~42 ppm
'loss' while the byte-exact producer-consumer backlog shows ~0. Both
readings are in question -- glDMACount over-counts the partial marker
buffers (each counted as a full 16 KB), AND an *exactly* 0.000 backlog is
implausible (expect nonzero in-flight), so apiProd may be mirroring apiCons.

Rather than trust either, the summary now:
- cross-check prints 'CONFLICT: glDMACount says loss, byte-exact says clean
  (unresolved)';
- a NOTE spells out both failure modes and the two things to confirm
  (apiProd is an independent producer-socket sum AND shows nonzero steady
  in-flight);
- the loss line becomes 'CONTESTED ... treat as unresolved' instead of
  asserting 'dropped data';
- the implied-clock line no longer asserts 'the fast clock masked the loss'
  -- it carries 'UNPROVEN' when the drain disagrees, 'IF the loss is real'
  when no drain counter is present, and only confirms when the drain agrees.

No PASS/FAIL flip: we do not over-trust the suspect 0.000 either.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…he 0.000

The exactly-0 backlog needs proof it is not apiProd mirroring apiCons. Two
diagnostics added to the drain summary:
- 'drain raw': start/end apiProd, apiCons and rawCons values -- if
  apiProd == apiCons byte-for-byte, the producer is mirroring the consumer
  and the 0 backlog is vacuous;
- 'drain inflight': peak producer-consumer backlog seen across the whole run
  -- a genuine independent producer wobbles nonzero as buffers fill/drain; a
  peak of 0 for the entire run is the mirroring red flag.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Stop inferring from the cooked summary -- capture the raw stream. --statslog
FILE writes one CSV row per GETSTATS poll (plus start/end snapshots) with
every unprocessed counter we acquire: host (samples_total, ok/bad_xfers,
in_flight) and firmware (dma_count, drain_prod/cons/raw, backlog, pib,
faults, boot), tagged with the per-second stat and minxfer. Offline analysis
(did the backlog ever go nonzero? does drain_prod track drain_cons exactly?
does backlog step at dips?) then works off ground truth instead of the
derived metrics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
stream_soak gains the same -l/--statslog as pps_integrity, writing the
identical CSV schema (marker columns 0/'soak') so one analyser reads both --
the no-marker control's drain trace is the flat-backlog baseline.

pps_log_stats.py now auto-detects a --statslog CSV (vs the -v console log)
and analyses the raw counters to settle the questions the summary can only
infer:
- producer-vs-consumer independence: drain_prod==drain_cons fraction and the
  backlog distribution -> 'backlog 0 on EVERY row -> apiProd MIRRORS apiCons
  (vacuous)' vs 'wobbles nonzero -> INDEPENDENT';
- orphan steps: backlog jumps > 6 buffers, with the second and its minxfer;
- glDMACount-vs-delivered growth: smooth per-second (per-marker partial-buffer
  over-count artifact) vs bursty/concentrated (real loss events).
Verified on synthetic mirror/independent/artifact/real CSVs; the existing
console-log path is unchanged. make check / check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…klog

The real run.csv exposed a bug: backlog is unsigned (prod-cons mod 2^32),
and read-skew lets the consumer be caught one buffer AHEAD, wrapping to
~2^32 -- which the analyzer misread as a 262143-buffer 'orphan step'. Now
interpret backlog as signed, so the wobble reads as +/-1 buffer.

With that, the drain analysis is decisive: independence from the count of
rows where prod!=cons (a hard mirror is 0); orphan accumulation from the
net start->end change and sustained steps (in-flight wobble returns to ~0, a
real orphan ratchets up). When the backlog shows no accumulation, the
glDMACount-vs-delivered growth is BY DEFINITION the partial-buffer
over-count, so the gap section now defers to the backlog verdict instead of
the host-in-flight-confounded smooth/bursty heuristic (median buf/marker
shown for scale).

Verified on the 10-min run.csv: apiProd independent (53/612 rows differ by
1 buffer), net backlog 0, no steps -> the +7.5 MB glDMACount gap is
over-count (12 KB/marker), not dropped data.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Strip the interpretive verdicts the tools had accreted -- they are
diagnostic instruments, not adjudicators. Removed:
- PASS/FAIL driven by the glDMACount-vs-delivered gap (it over-counts
  partial marker buffers); PASS/FAIL now reflects only unambiguous
  transport/device faults (bad_xfers, device reset, spurious, etc.);
- 'fast clock masked the loss' / 'implied clock' add-back -> a plain
  'Rate offset' line that states the budget and gap ppm and says they are
  conflated, without asserting which dominates;
- 'undelivered/dropped data', 'orphaned', 'CONFLICT/CONTESTED/AGREE',
  'mirrored/suspect', 'OVER-COUNT vs REAL' verdicts -> neutral measurement
  labels (glDMACount-delivered gap; backlog start/end/net; peak |in-flight|;
  prod!=cons N/M s; raw P/C/Craw).

The numbers are all still there (more of them, signed backlog); the
interpretation is left to the analyst and the --statslog CSV. make check /
check-asan pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
… was over-count

The 3 h byte-exact drain run settled the investigation, so the doc no longer
claims '42 ppm drain-side loss'. Rewrites:
- new 'Resolution' section: the marker drops no data (backlog net 0 over 3 h,
  leak < 6 ppb); the 42.6 ppm gap is glDMACount x 16 KB counting partial
  marker buffers as full (~10.8 KB/marker); real clock ~0 ppm; the
  'fast clock masked the loss' reading is dead;
- condensed 'How we got here' history (stream_soak control, KBA ruled out,
  thread-switch mechanism, edge experiment = fidelity not loss, the
  CONS_EVENT-dead-on-AUTO -> socket-register saga, orphan hypothesis refuted);
- new 'Still open: data corruption' -> the 10 MHz-tone phase-tracking test;
- sample-loss-accounting section: glDMACount gap is over-count, the byte-exact
  drain backlog is the authoritative loss check;
- clock-separation, Final-report example, Pass criteria, Takeaways retitled to
  match the tools-report-measurements design and the real 3 h output;
- top-of-doc status callout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants