pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness#32
Draft
ringof wants to merge 46 commits into
Draft
pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness#32ringof wants to merge 46 commits into
ringof wants to merge 46 commits into
Conversation
Refine PLAN_PPS_INTEGRITY.md against the current tree and resolve the draft's open questions: - Full-transfer size: add a one-line rx888_get_transfer_bytes() getter (buf_bytes is already computed in struct rx888; valid after start()). - GETSTATS decoder: copy the decode into pps_integrity, citing src/fx3_cmd/fx3_stats.h as the layout authority. - GPIO base: OR librx888's dither/randomizer bits into the marker toggle so GPIOFX3 full-word writes don't clear them. Also documents the BIAS_VHF (bit 9 / 0x200) marker-bit alias, a double-counting-proof atomic marker state machine, the missing $(LIBUSB_CFLAGS) on the build target, and the test/install wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Forward-reference the one public-API addition the pps_integrity tool needs (full-transfer size for short-marker detection). Marked planned / not yet implemented, with the post-start() validity caveat and the note that it must join the CI exported-symbol assertion when it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Add src/pps_integrity.c: streams via librx888 while toggling the PPS marker GPIO (BIAS_VHF / bit 9) at 1 Hz on a second EP0-only libusb handle, verifying every rising edge yields exactly one short bulk transfer with no spurious shorts and no misses. Per-second status line plus a PASS/FAIL summary with GETSTATS deltas (PIB errors, streaming faults, boot-count reset detection). Supporting changes: - librx888: add rx888_get_transfer_bytes() read-only accessor so the tool can derive the full-transfer sample count (short-marker threshold). Only valid after rx888_start(). Added to the CI exported-symbol assertion. - Makefile: pps_integrity target (static librx888.a + libusb + $(LIBUSB_CFLAGS)); add to BINS so install/uninstall pick it up. - tests: pps_integrity_smoke.sh (help/exit codes, arg validation, no-device path) wired into `make check`. - docs: doc/pps_integrity.md updated to implemented state; note that firmware GETSTATS exposes no pps_count/pps_fail, so host-side counters are authoritative. README + librx888.md updated. make all, make check, and make check-asan all pass; ABI symbol set matches the CI assertion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Mirror rx888_stream's -f: accept a firmware path and pass it through to cfg.firmware_path so the tool can upload the FX3 image when the device is in bootloader mode, instead of requiring a separate rx888_stream or fx3_cmd load step first. NULL (default) keeps the prior behaviour: the device must already be in app mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
At 32 MSPS a couple of markers per run can fail to register as short transfers (mid-run MISS with zero spurious). To tell a near-full short slipping past SHORT_MARGIN from a genuinely dropped/merged marker, track the smallest transfer size per 1-second window and expose it under -v as a 'minxfer' column. On an ok second this is the marker size; on a MISS second it shows whether any short arrived at all. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
For the marker-drop investigation, make each run self-documenting: - Categorise every miss from the window's smallest transfer (minxfer): 'true drops' (min >= full — no marker emitted) vs 'near-full' (a short that slipped past SHORT_MARGIN). A short below threshold would have registered as a marker, so near-full should always be 0 — reporting it proves the detection threshold isn't hiding real drops. - Print drop rate as a percentage of edges plus a per-hour projection, so long runs at different sample rates yield directly comparable numbers without log post-processing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The marker is the leftover partial buffer each PPS second, so its size is ~(samplerate mod full-transfer-samples) and drifts with the ADC/host clock offset. When that remainder drifts within a buffer-boundary band, the partial is empty (suppressed) or indistinguishable from a full transfer and no in-band short can exist — an inherent blind spot of a short-transfer marker, not a fidelity failure. Classify each miss accordingly: - blind-spot: minxfer < full (near-full partial past SHORT_MARGIN), or minxfer == full while the live remainder (last good marker) is within DANGER_BAND of a boundary. Benign, reported as NOTE. - anomalous: minxfer == full with the remainder mid-buffer — the marker should have been plainly visible but vanished. Real failure. Verdict now requires anomalous == 0 (and spurious == 0, no faults/reset), replacing the blunt +-2 edge/marker tolerance which couldn't tell a benign boundary miss from a real drop. Doc updated with the marker-size model, the blind-spot characterisation, and the firmware-side fix (ZLP / forced minimum-length delimiter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The marker after a hard miss is often ~2x the typical size: the skipped flush rolls forward and the next edge flushes the combined partial, so no samples are lost — the delimiter was merely delayed and is reconstructable in post. Classify such a miss as MERGE (WARN, recoverable) rather than ANOM (FAIL, possible data loss), using a running-median baseline of recent normal markers (recovery spikes excluded) and a MERGE_PCT threshold. The decision is deferred one second until the next marker resolves it. Verdict: merge and blind-spot misses no longer fail the run; only anomalous misses (and spurious shorts / faults) do. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Marker classification only answered "is the delimiter present" — never "did we lose samples". Add direct loss accounting: - Inter-marker continuity: count samples delivered between consecutive markers and compare to (seconds spanned) x the learned per-second rate (running median of single-second intervals). Anchored to the marker (a fixed point in the stream), so it is free of host-window jitter; within ~1s clock drift is negligible, so a dropped DMA buffer is an obvious deficit. This also cross-validates MERGE: a merge spans 2s, so its interval must equal 2x a normal one — if short, the merge dropped data. - Surface librx888 ok_xfers/bad_xfers (transport-level loss) and the total samples delivered. - Reinterpret firmware PIB errors against the deficit: PIB with a deficit = host-visible loss; PIB without = recovered (NOTE). Verdict now fails on any confirmed loss (bad_xfers > 0 or an inter-marker deficit) in addition to anomalous misses and spurious shorts. make check and make check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
3 h of 64 MSPS data showed the MERGE-vs-anomalous split (recovery marker
>= 170% of the running median) cut a single continuous population of
recovery sizes arbitrarily — smaller recoveries tagged "anomalous",
larger ones "merge" — while the inter-marker continuity check proved all
37 misses were lossless. The recovery-marker size was the wrong
discriminator.
Replace it with the continuity ground truth: when the next marker
resolves a hard miss, classify by whether the spanning interval shows a
sample deficit:
- no deficit -> "recovered" (delimiter displaced one edge, data intact,
reconstructable) -> WARN
- deficit (or run ended unresolved) -> "lost" -> FAIL
Drop MERGE_PCT and the marker-size baseline ring. Verdict now fails on
lost markers, spurious shorts, confirmed sample loss (bad_xfers or
inter-marker deficit), or device faults. Summary reports
blind-spot/recovered/lost; docs updated.
make check and make check-asan pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Parse --rate as a double and convert to Hz (rounded to nearest sample/s) so production rates like ka9q-radio's 129.6 Msps can be tested exactly, instead of only whole MSPS. Reject non-numeric or trailing-garbage rates (e.g. '129.6x') with a usage error. Startup line and summary now show both MSPS and the resolved Hz. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Four host-side additions so the loss claim is defensible, not just circumstantial: - Absolute sample budget: expected = samplerate x elapsed vs delivered, reported as +/-ppm. Catches sustained loss the self-referential per-interval rate would absorb. Honest caveat: the ppm is clock offset AND uniform loss combined; the host can't separate them (that needs a firmware "samples produced" counter via GETSTATS — not a stream embed). - Per-second PIB polling: GETSTATS read each second on the EP0 handle localises overflow events to the second and correlates each with that interval's continuity result, and catches a mid-run device reset immediately. Summary reports PIB seconds and how many coincided with a deficit. - Loss-floor honesty: report LOSS_TOL and the largest inter-marker deficit actually observed, so the summary states its own sensitivity and the headroom to the floor. - Learned-vs-nominal rate cross-check (ppm). Verdict unchanged on hard loss (bad_xfers / inter-marker deficit); the gross-budget shortfall beyond localised loss + 100 ppm is a WARN, since it cannot be separated from a slow clock host-side. make check and make check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Use the firmware's glDMACount (GETSTATS byte 0 = DMA buffers produced)
against librx888 ok_xfers (buffers delivered): equal deltas = no loss,
and the comparison is independent of the ADC clock (a slow clock lowers
both equally; only real loss makes produced exceed delivered). This is
the decisive answer to "were all DMA buffers delivered?" and the clean
separation of clock offset from loss.
- Read dma_count in GETSTATS; snapshot librx888 ok_xfers at start and end.
- Read the end firmware+lib counters BEFORE rx888_stop (STOPFX3 resets
glDMACount); subtract in_flight and allow one queue_depth of slack for
the startup ramp and non-atomic read skew.
- Report DMA buffers produced/delivered/in-flight/undelivered. Lost
buffers fail the run (third independent loss detector alongside
bad_xfers and inter-marker continuity).
- With all loss detectors clear, report the rate offset as a pure ADC
clock calibration ("ADC clock: +X ppm") instead of "clock+loss
combined"; mid-run reset now also fails.
Add a THEORY OF OPERATION block to the source and expand
doc/pps_integrity.md (loss detectors, clock-vs-loss separation,
sensitivity, what we deliberately don't do) so the harness is
maintainable. No stream embeds.
make all, make check, make check-asan pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Hardware at 64/129 MSPS showed a false "buffers lost" FAIL: glDMACount counts the firmware's 16 KB DMA buffers while librx888 ok_xfers counts 1 MB USB transfers (~64 DMA buffers each), so comparing the raw counts reported ~64x "undelivered". In bytes the two match exactly (no loss), confirmed by the +19/+44 ppm budget (delivered ABOVE expected). Fixes: - Compare in BYTES: produced_bytes = glDMACount x FW_DMA_BUF_BYTES (16384) vs delivered_bytes = samples x 2, with in-flight subtracted and a runtime guard that marks the check indeterminate (not failing) if the firmware DMA-buffer-size assumption is off by a large factor. Baseline samples_total at start so the delta aligns with glDMACount. - Make inter-marker continuity a DIAGNOSTIC localizer, not a hard fail: marker-detection timing jitters by ~a marker at high rates, so isolated single-interval deficits are redistribution, not loss (also a false FAIL in the runs). It is now reported and corroborated against the byte-exact produced-vs-delivered, and gated to a fully-warmed baseline. - Misses are now "blind-spot"/"displaced" (marker timing only); data loss is judged solely by bad_xfers + produced-vs-delivered. Verdict updated. - Report bytes (GB/MB) for DMA buffers and the gross floor. Update the THEORY block and doc/pps_integrity.md to match. No stream embeds. make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Firmware maintainer confirmed the GPIF->DMA buffer size is 16 KB, so the produced-vs-delivered byte comparison rests on a verified constant (the runtime guard stays as defense-in-depth against future firmware changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
A real 3h/129 MSPS run dropped ~82 MB (produced > delivered) with PIB=0 and bad_xfers=0 — DMA->USB consumer-side loss the budget masked (a +28 ppm fast clock minus ~30 ppm loss read as -1.2 ppm). Two follow-ups: - Add -q/--queuedepth and -p/--reqsize (pass-through to librx888 cfg.queue_depth / req_packets) so more in-flight buffering can be tried against the throughput ceiling. - Report loss as ppm and percent next to the MB, and on a failing run print the IMPLIED ADC clock (budget_ppm + loss_ppm added back) so the clock-masking is visible at a glance. - Document the empirical ceiling (64 MSPS lossless, 129 MSPS ~30 ppm consumer-side loss) and why the clock-independent check is essential. make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Automate the -q/-p knob matrix against the high-rate DMA->USB drain ceiling: run low-rate controls, sweep queue depth at the high rate, pick the depth with the least clock-independent (produced-vs-delivered) loss, sweep transfer size at that depth, then print a summary table (undelivered MB, result, spurious, displaced per cell). Env-configurable (RX888_FW/HIRATE/LORATE/DUR_*/QSWEEP/PSWEEP); hardware test, not wired into `make check`. Guards for missing binary/firmware and a no-device abort. Referenced from doc/pps_integrity.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The scientific control we never ran: the same data-integrity
instrumentation as pps_integrity but with NO PPS marker injected, so we
can tell whether anomalies are inherent to the streaming path or caused
by the marker mechanism.
Measures (host clock assumed disciplined):
- effective sample rate from delivered-samples / elapsed-time (ppm);
- buffers produced vs delivered in BYTES (glDMACount x 16384 vs
samples x 2) — clock-independent loss, same as pps_integrity;
- SHORT transfers: any transfer below full. With no marker applied each
is a pure anomaly (FX3 committing a partial buffer) — the control for
pps_integrity's "spurious shorts";
- bad_xfers, per-second PIB, streaming faults, mid-run device reset.
Shares the -q/-p knobs and the two-handle design. FAILs on confirmed
loss (bad_xfers or produced>delivered), device reset, or early stop;
short transfers are a WARN. With loss detectors clear, reports the rate
offset as a pure ADC clock calibration.
Wired in: Makefile target + BINS (install follows), .gitignore,
tests/stream_soak_smoke.sh in `make check`, README + doc/pps_integrity.md
("Control" section). make all, check, check-asan pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…ting The 64 MSPS stream_soak control reported undelivered -33.5 MB (~ one queue depth) on a provably lossless run. Cause: produced and delivered are both measured from the start snapshot, so the steady in-flight already cancels in their difference, but we then subtracted the END in_flight again — double-counting one queue depth and biasing undelivered by ~ -queue. Subtract the in_flight DELTA (end - start) instead, which cancels the steady pipeline regardless of whether the start snapshot caught the ramp. This re-centers the zero-loss baseline at ~0 and tightens the loss floor to the intended ~slack (4 MB / ~3 ppm over 3 h) instead of ~37 MB. Report now shows in-flight start->end for transparency. make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The 129.6 MSPS stream_soak control settled it: the bare stream sustains 259 MB/s for 3 h with zero short transfers and zero loss (produced == delivered, PIB=0, bad_xfers=0). So 129.6 is not a throughput ceiling; the loss in pps_integrity is caused by the PPS marker injection (the firmware forced partial-buffer commit perturbing the near-capacity DMA->USB pipeline). The direction is conclusive: the lossy run was at the LOWER throughput (129) with the marker, while the higher bare rate (129.6) was clean. Rewrite the doc section accordingly and point to the out-of-band PPS / GPSDO direction as the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The loss magnitude (~140-250 KB / ~0.5-1 ms per event, ~500x a clean commit's cost) shows the marker loss is a stall/glitch, not the intrinsic ~0-sample cost of a forced commit. Two untested input conditions are the prime suspects: the 100 kOhm series resistor (~4 us RC edge -> metastable detection) and clock-domain crossing of the async PPS into the GPIF (needs a real 2-flop synchronizer). So this is NOT a verdict that in-band marking is unworkable. Reframe the doc section: throughput exonerated (firm), clock-masking shown (firm), but root cause OPEN pending the edge + synchronizer experiments. Note that in-band, if fixed, is the finest time source (the short transfer length encodes the edge sample index sample-exactly), so it is worth recovering; out-of-band MCU latch remains the safe fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Replace the earlier contaminated ~82 MB / 129 figure with the trustworthy 3 h baseline (sleep inhibited, in-flight-corrected): at 129.6 MSPS the marker run loses 118 MB = 42.3 ppm while the device is pristine (PIB=0, bad_xfers=0, faults=0) and the no-marker control at the same rate loses nothing. Loss is buffers glDMACount counts as produced but that never become a USB transfer — dropped in the commit handoff, not overflow. The budget read -0.5 ppm (a +41.8 ppm fast clock masked the 42.3 ppm loss); only produced-vs-delivered saw it. Marker size swings the full range (~2.5k-521k samples), consistent with a metastable commit trigger. This is the number to beat for the resistor + GPIF-synchronizer experiments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Reusable analyzer for -v pps_integrity logs. Reports the key correlation from the 129.6 baseline: dip (loss) seconds are 26.9x enriched in buffer-boundary-adjacent marker positions (near-empty/near-full minxfer) vs normal seconds — the signature of a commit-vs-buffer-completion race. Also reports the post-MISS ~2x merge ratio (displaced markers are clean, distinct from dips), dip/spur/MISS co-occurrence, dip magnitude, and inter-dip timing. Accepts multiple logs for a comparison table (_baseline/_lowR/_sync), so the edge + GPIF-synchronizer experiments can be judged by whether dips/hour drops AND boundary-enrichment collapses toward 1x. Stdlib only; referenced from doc/pps_integrity.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Accidentally committed tests/__pycache__/*.pyc from a compile check. Remove it and ignore __pycache__/ and *.pyc going forward. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Read the actual KBA (KBA231382) rather than the title: of its two INVALID_SEQUENCE (error 71) scenarios, only Scenario 1 (out-of-order commit against the PIB_SOCKET_0/1 ping-pong) fits our data; Scenario 2 (host-slow backpressure) is ruled out by the lossless stream_soak control at the same rate. Records the three caveats that keep this a firmware lead pending source (AUTO vs manual channel; faults=0 implying a silently swallowed commit failure; the KBA fix not preventing loss) and the four firmware questions it raises. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…uild The RX888 runs the PPS_CTL_ENABLE=1 build, where the marker commit is pure GPIF hardware (TH0/TH1_PPS_COMMIT states) on an AUTO_MANY_TO_ONE channel -- no CPU CommitBuffer/SetWrapUp, cross-route correct by construction. That eliminates the entire KBA231382 commit-ordering class (the SetWrapUp-on-AUTO anti-pattern lives only in synth_pps.c, the PPS_CTL_ENABLE=0 path, which never runs). With software commit ordering ruled out, the loss is internal to the GPIF state machine; the ~250 KB (~15 DMA buffers, ~1 ms) per-event magnitude is a stall signature, not a one-buffer slip, pointing at the slow CTL[2] edge driving the control comparator metastable. Recasts the KBA section from open lead to closed cause and makes the fast-edge experiment the decisive next test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Walk back over-confident framing. The 42.3 ppm is a rare catastrophic mode (3.68%% of markers carry ~all the loss), not a per-marker cost; the other 96.3%% are 'below a jittery detector', not verified zero. The metastability diagnosis is inference, not a captured event -- record the alternative that the AUTO channel may not tolerate an async mid-descriptor commit regardless of edge quality. Note that 43.6%% of loss events are not boundary-adjacent (unexplained by the boundary race), that the uniform marker-size spread is expected (not evidence of metastability), and that the result is n=1. The fast-edge experiment is a test that can fail, not a foregone conclusion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…parator, no sync) Firmware-source read of the GPIF config and vendored SDK turns the input-condition suspects into a confirmed signal path (not yet a confirmed cause): - CTL[2] sampled on the external ADC clock (129.6 MHz); the ~4 us RC edge spans ~500 consecutive sample clocks in the input threshold. - Control comparator in TOGGLE mode (GPIF_CONFIG bit 12), sampling CTL[2] directly -- no intermediate state, no 2-flop synchronizer; a metastable sample propagates straight into the PPS_COMMIT transition. PPS vs original waveform differs in exactly two bits (CTRL_COMP_ENABLE, CTRL_COMP_TOGGLE). - DLL correctly off (sync mode, external clock) -- removes the one firmware-config lever previously flagged (KBA210733). - SDK 1.3.4. Recasts the experiment order as edge-first (resistor/buffer attacks the 500-cycle dwell directly; a synchronizer state is belt-and-suspenders), while keeping the error bars: the path is confirmed but causality is not, and 43.6%% of loss events remain non-boundary and unexplained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…menu Captures forward-looking firmware features that would convert the host tools' inferences into device-reported ground truth, independent of whether in-band PPS survives the edge experiment. Ranked candidates grouped by goal (dropped-sample detection, timing/clock offset, health monitoring, validation), each with what it buys the existing tools and the open feasibility question for firmware. Highlights the two cheapest high-value GETSTATS additions (drop counter, sample-granular produced count) and the GPIF counter-capture-on-CTL-edge latch as a non-perturbing alternative to the in-band marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Firmware-confirmed: glDMACount increments on CY_U3P_DMA_CB_PROD_EVENT (producer commits a filled buffer into the DMA channel, before USB consumption). So produced > delivered means buffers entered the channel but never drained out the USB consumer socket -- the gap is inside the FX3, on the DMA->USB drain, not across the wire. Producer fill is fine; the drain loses them. Rules out host backpressure two ways: the no-marker stream_soak control at the same data rate is lossless, and the loss correlates 26.9x with the device-internal buffer-fill phase the host can't see. Adds the PROD_EVENT accounting nuance (per-buffer regardless of fill -> glDMACount x 16 KB may over-count forced partials) and the delivered-side continuity cross-check showing the bulk loss is real, not a rounding artifact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…collision Experiment 2 (1 kOhm fast edge, ~100x faster) result, two independent 3h runs analyzed with pps_log_stats: - Fidelity improved: spurious shorts 403->0, displaced markers 61->32. - Loss unchanged: 42.3->41.2 ppm; boundary-enrichment intact 26.9x->25.8x; dip magnitude identical (median 98347->98220 samples ~= 12 DMA buffers). Conclusion: edge quality governs marker fidelity, not data loss. The loss is a stable, reproducible structural commit-vs-buffer-boundary collision, not edge metastability. This also down-weights the synchronizer as a loss fix (it makes the commit deterministic but cannot move where it lands vs the boundary). The lever the data points at is boundary-aware commit logic in the GPIF state machine (defer/suppress the partial commit in the danger band) plus the open question of whether a near-boundary consumer wrap-up orphans the ~12 buffers in the DMA->USB drain. Strengthens the case for #5 (capture the edge sample index without forcing a commit). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Firmware source resolves the mechanism, superseding the metastability/ synchronizer framing: - The marker is a forced GPIF thread-switch, NOT a commit. BETA_THR_WRAPUP is set in no waveform state; TH0_PPS_COMMIT is the normal DATA_CNT_HIT transition firing unconditionally, abandoning a partial buffer. The short transfer is an implicit DMA-adapter side effect. - Channel holds 4 x 16 KB buffers (AUTO_MANY_TO_ONE), not 64; the 64 is host URB assembly. So the ~12-buffer loss is a drain-side descriptor desync (3x the whole pool), silent because the consumer side is uninstrumented (CONS_EVENT commented out). Drops the synchronizer; makes CONS_EVENT telemetry the decisive next step; adds three forward paths: A0 explicit THR_WRAPUP (one bit, with the thread-shutdown/re-arm caveat), A boundary-aware switch, B capture-don't- commit GPIO latch. Fixes the stale KBA closing that still named the edge as the leading suspect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The firmware now counts consumer drains (CY_U3P_DMA_CB_CONS_EVENT) and exposes glDMAConsCount at GETSTATS [36..39]. Both tools now request the 40-byte response and decode it (gracefully absent on older firmware: cons_valid stays 0, all other fields still decode). pps_integrity: - per-second 'orphan+N (drain)' note when produced-consumed steps beyond the in-flight wobble, landing on the same line as the dip so the orphan is correlated with the marker event; - summary 'DMA drain' line (consumed buffers, produced-consumed orphaned MB) plus a cross-check against host undelivered: agreement pins the loss to the in-chip producer->consumer hop; a consumer->host gap would indict the USB wire instead. stream_soak: same decode + a 'DMA drain' baseline line (expected ~in-flight depth with no marker) -- the no-orphaning control the marker steps stand out against. Doc: loss-accounting section documents the new field and cross-check. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
When glDMAConsCount is present in GETSTATS but pinned at 0 (CONS_EVENT not firing -- expected on an AUTO channel, where consumer callbacks are not delivered to the CPU), the prior code reported the entire stream as 'orphaned' and spammed a per-second orphan note at the full production rate. Now: the per-second note only fires once the consumer counter is actually advancing, and the summary prints a plain 'glDMAConsCount did not advance -- CONS_EVENT not firing' diagnostic instead of a misleading multi-GB orphan figure. Confirmed against a live run where consumed stayed 0 over 680k produced buffers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…isters Firmware replaced the (never-firing) CONS_EVENT callback with direct socket xferCount register reads: GETSTATS grows to 48 bytes with producer xferCount [36..39], consumer-API [40..43], consumer-raw [44..47], all in bytes that wrap ~16 s. The meaningful quantity is the instantaneous backlog = producer - consumer (wrap-safe; both wrap together, backlog << 2^32). Host side now: - requests 48 bytes; decodes the three drain counters (drain_valid gated on >=48 so older firmware degrades cleanly); - fw_backlog() helper; per-second 'orphan+NKB (drain)' note when the backlog steps past the in-flight wobble, with a 256 MB sanity cap so a non-tracking counter can't spam; - summary reports backlog start->end and orphaned MB (byte-exact), the API-vs-raw-register skew, and the cross-check vs host undelivered (AGREE when they match within in-flight slack). stream_soak mirrors the decode + a flat-backlog baseline line. Doc updated for the new layout and the wrap-safe backlog method. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…elds A firmware response < 48 B (no socket-xferCount drain counters) previously produced silence -- no DMA drain line at all -- which is ambiguous (did it work? old firmware?). Now both tools print an explicit 'DMA drain: not reported by this firmware (GETSTATS < 48 B)' line so the operator knows to flash the socket-xferCount build rather than wondering why the cross-check is missing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The 48-byte firmware now returns the drain counters, but apiProd (producer xferCount via CyU3PDmaMultiChannelGetStatus) appears to be a single socket of the MANY_TO_ONE channel, not the summed full-stream producer byte count. It therefore advances at a different rate than the single full-rate consumer socket, so producer-consumer drifts GB-scale and wraps instead of being the small in-flight+orphaned backlog. The sanity guard previously only checked the END backlog, so it printed a nonsense 'orphaned -4082 MB'. Now both tools check the start backlog and the orphan magnitude and, when the pair is incoherent, print an explicit 'producer/consumer counters incoherent ... producer xferCount is not the full-stream byte count (one socket of the many-to-one?)' instead of a bogus figure. The consumer side is fine (API == raw register, skew 0); the fix is firmware-side: sum both PIB producer sockets for apiProd. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
The producer-sum firmware fix landed: drain counters are now coherent (backlog ~0, skew 0) instead of GB-scale garbage. On a window with no real loss, orphan ~0 and undelivered within slack both read clean, so the cross-check now says 'both clear -- no loss this window' rather than the mildly alarming 'see delta'. AGREE / see-delta still apply once real loss clears the in-flight slack on a longer run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
For short (e.g. 10 min) confirmation runs, each 'orphan+NKB (drain)' note now also prints the cumulative backlog (backlog X.XXMB), so the drain staircase is legible live without waiting for the summary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…ither) The 15-min run exposed a contradiction: glDMACount x 16384 shows ~42 ppm 'loss' while the byte-exact producer-consumer backlog shows ~0. Both readings are in question -- glDMACount over-counts the partial marker buffers (each counted as a full 16 KB), AND an *exactly* 0.000 backlog is implausible (expect nonzero in-flight), so apiProd may be mirroring apiCons. Rather than trust either, the summary now: - cross-check prints 'CONFLICT: glDMACount says loss, byte-exact says clean (unresolved)'; - a NOTE spells out both failure modes and the two things to confirm (apiProd is an independent producer-socket sum AND shows nonzero steady in-flight); - the loss line becomes 'CONTESTED ... treat as unresolved' instead of asserting 'dropped data'; - the implied-clock line no longer asserts 'the fast clock masked the loss' -- it carries 'UNPROVEN' when the drain disagrees, 'IF the loss is real' when no drain counter is present, and only confirms when the drain agrees. No PASS/FAIL flip: we do not over-trust the suspect 0.000 either. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…he 0.000 The exactly-0 backlog needs proof it is not apiProd mirroring apiCons. Two diagnostics added to the drain summary: - 'drain raw': start/end apiProd, apiCons and rawCons values -- if apiProd == apiCons byte-for-byte, the producer is mirroring the consumer and the 0 backlog is vacuous; - 'drain inflight': peak producer-consumer backlog seen across the whole run -- a genuine independent producer wobbles nonzero as buffers fill/drain; a peak of 0 for the entire run is the mirroring red flag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Stop inferring from the cooked summary -- capture the raw stream. --statslog FILE writes one CSV row per GETSTATS poll (plus start/end snapshots) with every unprocessed counter we acquire: host (samples_total, ok/bad_xfers, in_flight) and firmware (dma_count, drain_prod/cons/raw, backlog, pib, faults, boot), tagged with the per-second stat and minxfer. Offline analysis (did the backlog ever go nonzero? does drain_prod track drain_cons exactly? does backlog step at dips?) then works off ground truth instead of the derived metrics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
stream_soak gains the same -l/--statslog as pps_integrity, writing the identical CSV schema (marker columns 0/'soak') so one analyser reads both -- the no-marker control's drain trace is the flat-backlog baseline. pps_log_stats.py now auto-detects a --statslog CSV (vs the -v console log) and analyses the raw counters to settle the questions the summary can only infer: - producer-vs-consumer independence: drain_prod==drain_cons fraction and the backlog distribution -> 'backlog 0 on EVERY row -> apiProd MIRRORS apiCons (vacuous)' vs 'wobbles nonzero -> INDEPENDENT'; - orphan steps: backlog jumps > 6 buffers, with the second and its minxfer; - glDMACount-vs-delivered growth: smooth per-second (per-marker partial-buffer over-count artifact) vs bursty/concentrated (real loss events). Verified on synthetic mirror/independent/artifact/real CSVs; the existing console-log path is unchanged. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
…klog The real run.csv exposed a bug: backlog is unsigned (prod-cons mod 2^32), and read-skew lets the consumer be caught one buffer AHEAD, wrapping to ~2^32 -- which the analyzer misread as a 262143-buffer 'orphan step'. Now interpret backlog as signed, so the wobble reads as +/-1 buffer. With that, the drain analysis is decisive: independence from the count of rows where prod!=cons (a hard mirror is 0); orphan accumulation from the net start->end change and sustained steps (in-flight wobble returns to ~0, a real orphan ratchets up). When the backlog shows no accumulation, the glDMACount-vs-delivered growth is BY DEFINITION the partial-buffer over-count, so the gap section now defers to the backlog verdict instead of the host-in-flight-confounded smooth/bursty heuristic (median buf/marker shown for scale). Verified on the 10-min run.csv: apiProd independent (53/612 rows differ by 1 buffer), net backlog 0, no steps -> the +7.5 MB glDMACount gap is over-count (12 KB/marker), not dropped data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
Strip the interpretive verdicts the tools had accreted -- they are diagnostic instruments, not adjudicators. Removed: - PASS/FAIL driven by the glDMACount-vs-delivered gap (it over-counts partial marker buffers); PASS/FAIL now reflects only unambiguous transport/device faults (bad_xfers, device reset, spurious, etc.); - 'fast clock masked the loss' / 'implied clock' add-back -> a plain 'Rate offset' line that states the budget and gap ppm and says they are conflated, without asserting which dominates; - 'undelivered/dropped data', 'orphaned', 'CONFLICT/CONTESTED/AGREE', 'mirrored/suspect', 'OVER-COUNT vs REAL' verdicts -> neutral measurement labels (glDMACount-delivered gap; backlog start/end/net; peak |in-flight|; prod!=cons N/M s; raw P/C/Craw). The numbers are all still there (more of them, signed backlog); the interpretation is left to the analyst and the --statslog CSV. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
… was over-count The 3 h byte-exact drain run settled the investigation, so the doc no longer claims '42 ppm drain-side loss'. Rewrites: - new 'Resolution' section: the marker drops no data (backlog net 0 over 3 h, leak < 6 ppb); the 42.6 ppm gap is glDMACount x 16 KB counting partial marker buffers as full (~10.8 KB/marker); real clock ~0 ppm; the 'fast clock masked the loss' reading is dead; - condensed 'How we got here' history (stream_soak control, KBA ruled out, thread-switch mechanism, edge experiment = fidelity not loss, the CONS_EVENT-dead-on-AUTO -> socket-register saga, orphan hypothesis refuted); - new 'Still open: data corruption' -> the 10 MHz-tone phase-tracking test; - sample-loss-accounting section: glDMACount gap is over-count, the byte-exact drain backlog is the authoritative loss check; - clock-separation, Final-report example, Pass criteria, Takeaways retitled to match the tools-report-measurements design and the real 3 h output; - top-of-doc status callout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A measurement harness for the RX888 mk II PPS in-band marker and the
streaming path it rides on, plus the investigation it drove to a resolution.
pps_integrity— drives the PPS marker at 1 Hz and validates each edgeagainst the sample stream.
stream_soak— the no-marker control.tests/pps_log_stats.py— offline analysis of the-vlog and the--statslogCSV;tests/pps_knob_sweep.sh—-q/-psweep.Result: the in-band marker is byte-lossless
The headline question — does the marker drop data at production rates? — is
resolved: no. A 3 h run at 129.6 MSPS with byte-exact producer/consumer
drain counters:
glDMACount × 16384 − deliveredapiProd − apiCons)apiProd ≠ apiConsbudget)bad_xfers, PIB, faultsThe "~42 ppm loss" we chased for weeks was a measurement artifact.
glDMACountcounts each committed DMA buffer (PROD_EVENT) as a full 16 KBregardless of fill. The marker fires a GPIF thread-switch that abandons a
partial buffer, so
glDMACount × 16 KBover-counts "produced" bytes by~10.8 KB/marker — a fixed ~42 ppm at 1 Hz, which is why it was suspiciously
constant across every run while the real clock varied.
stream_soak(nomarker) shows no gap because it makes no partial buffers.
Why "no loss" (byte-exact). The firmware now exposes the producer/consumer
socket
xferCountregisters; their instantaneous difference — the backlogactually in the DMA channel — holds 0 ± exactly one buffer for the entire
3 h and ends where it started. A real orphan would make the producer
permanently outrun the consumer, so the end backlog bounds the total leak at
< 6 ppb over 2.8 TB delivered. The producer counter is independent (differs
from the consumer in 1027/10,800 s), so the 0 backlog is real, not a mirror.
The "fast clock masking loss" reading is dead — the real offset is −0.47 ppm
and the +42.6 ppm gap is a constant artifact on top of it.
What this PR changed about the tools
Following review feedback, the tools report measurements, not verdicts.
PASS/FAIL keys only on unambiguous faults (
bad_xfers, device reset, spuriousshorts); the
glDMACountgap, drain backlog, continuity dips, and rate offsetare printed as raw numbers. Interpretation lives in
doc/pps_integrity.mdandin
pps_log_stats.pyover the--statslogCSV.How we got here (dead ends, recorded)
stream_soakcontrol — lossless, isolating the effect to the marker.INVALID_SEQUENCE) — ruled out; the marker isn't a CPU commit.but not the gap → edge governs fidelity, not the gap.
(
TH0_PPS_COMMIT=DATA_CNT_HITfiring unconditionally), noTHR_WRAPUP,AUTO_MANY_TO_ONE, 4 × 16 KB buffers — the source of the partial buffers.
telemetry to test it produced the byte-exact backlog, which shows no
orphaning. (Three firmware iterations:
CONS_EVENTnever fires on AUTO; thesocket read had to sum both producer sockets; the host had to read the
wrap-around backlog as signed.) The "12-buffer dip / 26.9× boundary
enrichment" were continuity-detector artifacts, not real loss.
Still open: data corruption (delivered ≠ correct)
Byte accounting proves delivery, not correctness. Next experiment: inject a
known 10 MHz tone, down-convert and track the unwrapped phase — a single
dropped/garbled sample is a 27.8° phase step, and correlating those with the
marker positions tests the partial-buffer splice directly (the LTC2208 has no
digital test pattern, so the tone is the ground truth).
What's in the tree
src/pps_integrity.c,src/stream_soak.c;rx888_get_transfer_bytes()accessor (only public-API addition; in the CI symbol set).
--statslogCSV (raw per-second GETSTATS + host counters) on both tools;the byte-exact drain backlog (signed); measurement-only summaries.
tests/pps_integrity_smoke.sh,tests/stream_soak_smoke.sh(inmake check);tests/pps_knob_sweep.sh;tests/pps_log_stats.py(consolelog + CSV).
.gitignore/doc/pps_integrity.mdupdated.make all,make check,make check-asanpass.🤖 Generated with Claude Code