pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness by ringof · Pull Request #32 · ringof/rx888-tools

ringof · 2026-06-20T01:51:10Z

Summary

A measurement harness for the RX888 mk II PPS in-band marker and the
streaming path it rides on, plus the investigation it drove to a resolution.

pps_integrity — drives the PPS marker at 1 Hz and validates each edge
against the sample stream.
stream_soak — the no-marker control.
tests/pps_log_stats.py — offline analysis of the -v log and the
--statslog CSV; tests/pps_knob_sweep.sh — -q/-p sweep.

Result: the in-band marker is byte-lossless

The headline question — does the marker drop data at production rates? — is
resolved: no. A 3 h run at 129.6 MSPS with byte-exact producer/consumer
drain counters:

quantity	value	meaning
`glDMACount × 16384 − delivered`	+119 MB = 42.6 ppm	the apparent "loss"
per marker	10.8 KB (~0.7 buffer)	the partial-buffer over-count
drain backlog (`apiProd − apiCons`)	net 0, peak ±1 buffer, 0 sustained steps	nothing orphaned
`apiProd ≠ apiCons`	1027 / 10,800 s	producer counter independent, not mirrored
real clock (`budget`)	−0.47 ppm	TCXO ~nominal; no fast clock
`bad_xfers`, PIB, faults	0	clean device

The "~42 ppm loss" we chased for weeks was a measurement artifact.
glDMACount counts each committed DMA buffer (PROD_EVENT) as a full 16 KB
regardless of fill. The marker fires a GPIF thread-switch that abandons a
partial buffer, so glDMACount × 16 KB over-counts "produced" bytes by
~10.8 KB/marker — a fixed ~42 ppm at 1 Hz, which is why it was suspiciously
constant across every run while the real clock varied. stream_soak (no
marker) shows no gap because it makes no partial buffers.

Why "no loss" (byte-exact). The firmware now exposes the producer/consumer
socket xferCount registers; their instantaneous difference — the backlog
actually in the DMA channel — holds 0 ± exactly one buffer for the entire
3 h and ends where it started. A real orphan would make the producer
permanently outrun the consumer, so the end backlog bounds the total leak at
< 6 ppb over 2.8 TB delivered. The producer counter is independent (differs
from the consumer in 1027/10,800 s), so the 0 backlog is real, not a mirror.
The "fast clock masking loss" reading is dead — the real offset is −0.47 ppm
and the +42.6 ppm gap is a constant artifact on top of it.

What this PR changed about the tools

Following review feedback, the tools report measurements, not verdicts.
PASS/FAIL keys only on unambiguous faults (bad_xfers, device reset, spurious
shorts); the glDMACount gap, drain backlog, continuity dips, and rate offset
are printed as raw numbers. Interpretation lives in doc/pps_integrity.md and
in pps_log_stats.py over the --statslog CSV.

How we got here (dead ends, recorded)

stream_soak control — lossless, isolating the effect to the marker.
KBA231382 (INVALID_SEQUENCE) — ruled out; the marker isn't a CPU commit.
Edge / metastability — a 1 kΩ fast edge fixed spurious shorts (chatter)
but not the gap → edge governs fidelity, not the gap.
Marker mechanism — firmware-confirmed: a forced GPIF thread-switch
(TH0_PPS_COMMIT = DATA_CNT_HIT firing unconditionally), no THR_WRAPUP,
AUTO_MANY_TO_ONE, 4 × 16 KB buffers — the source of the partial buffers.
"Drain-side orphan" hypothesis — refuted. Building the consumer-side
telemetry to test it produced the byte-exact backlog, which shows no
orphaning. (Three firmware iterations: CONS_EVENT never fires on AUTO; the
socket read had to sum both producer sockets; the host had to read the
wrap-around backlog as signed.) The "12-buffer dip / 26.9× boundary
enrichment" were continuity-detector artifacts, not real loss.

Still open: data corruption (delivered ≠ correct)

Byte accounting proves delivery, not correctness. Next experiment: inject a
known 10 MHz tone, down-convert and track the unwrapped phase — a single
dropped/garbled sample is a 27.8° phase step, and correlating those with the
marker positions tests the partial-buffer splice directly (the LTC2208 has no
digital test pattern, so the tone is the ground truth).

What's in the tree

src/pps_integrity.c, src/stream_soak.c; rx888_get_transfer_bytes()
accessor (only public-API addition; in the CI symbol set).
--statslog CSV (raw per-second GETSTATS + host counters) on both tools;
the byte-exact drain backlog (signed); measurement-only summaries.
tests/pps_integrity_smoke.sh, tests/stream_soak_smoke.sh (in
make check); tests/pps_knob_sweep.sh; tests/pps_log_stats.py (console
log + CSV).
Makefile / README / .gitignore / doc/pps_integrity.md updated.

make all, make check, make check-asan pass.

🤖 Generated with Claude Code

Refine PLAN_PPS_INTEGRITY.md against the current tree and resolve the draft's open questions: - Full-transfer size: add a one-line rx888_get_transfer_bytes() getter (buf_bytes is already computed in struct rx888; valid after start()). - GETSTATS decoder: copy the decode into pps_integrity, citing src/fx3_cmd/fx3_stats.h as the layout authority. - GPIO base: OR librx888's dither/randomizer bits into the marker toggle so GPIOFX3 full-word writes don't clear them. Also documents the BIAS_VHF (bit 9 / 0x200) marker-bit alias, a double-counting-proof atomic marker state machine, the missing $(LIBUSB_CFLAGS) on the build target, and the test/install wiring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Forward-reference the one public-API addition the pps_integrity tool needs (full-transfer size for short-marker detection). Marked planned / not yet implemented, with the post-start() validity caveat and the note that it must join the CI exported-symbol assertion when it lands. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Add src/pps_integrity.c: streams via librx888 while toggling the PPS marker GPIO (BIAS_VHF / bit 9) at 1 Hz on a second EP0-only libusb handle, verifying every rising edge yields exactly one short bulk transfer with no spurious shorts and no misses. Per-second status line plus a PASS/FAIL summary with GETSTATS deltas (PIB errors, streaming faults, boot-count reset detection). Supporting changes: - librx888: add rx888_get_transfer_bytes() read-only accessor so the tool can derive the full-transfer sample count (short-marker threshold). Only valid after rx888_start(). Added to the CI exported-symbol assertion. - Makefile: pps_integrity target (static librx888.a + libusb + $(LIBUSB_CFLAGS)); add to BINS so install/uninstall pick it up. - tests: pps_integrity_smoke.sh (help/exit codes, arg validation, no-device path) wired into `make check`. - docs: doc/pps_integrity.md updated to implemented state; note that firmware GETSTATS exposes no pps_count/pps_fail, so host-side counters are authoritative. README + librx888.md updated. make all, make check, and make check-asan all pass; ABI symbol set matches the CI assertion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Mirror rx888_stream's -f: accept a firmware path and pass it through to cfg.firmware_path so the tool can upload the FX3 image when the device is in bootloader mode, instead of requiring a separate rx888_stream or fx3_cmd load step first. NULL (default) keeps the prior behaviour: the device must already be in app mode. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

At 32 MSPS a couple of markers per run can fail to register as short transfers (mid-run MISS with zero spurious). To tell a near-full short slipping past SHORT_MARGIN from a genuinely dropped/merged marker, track the smallest transfer size per 1-second window and expose it under -v as a 'minxfer' column. On an ok second this is the marker size; on a MISS second it shows whether any short arrived at all. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

For the marker-drop investigation, make each run self-documenting: - Categorise every miss from the window's smallest transfer (minxfer): 'true drops' (min >= full — no marker emitted) vs 'near-full' (a short that slipped past SHORT_MARGIN). A short below threshold would have registered as a marker, so near-full should always be 0 — reporting it proves the detection threshold isn't hiding real drops. - Print drop rate as a percentage of edges plus a per-hour projection, so long runs at different sample rates yield directly comparable numbers without log post-processing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The marker is the leftover partial buffer each PPS second, so its size is ~(samplerate mod full-transfer-samples) and drifts with the ADC/host clock offset. When that remainder drifts within a buffer-boundary band, the partial is empty (suppressed) or indistinguishable from a full transfer and no in-band short can exist — an inherent blind spot of a short-transfer marker, not a fidelity failure. Classify each miss accordingly: - blind-spot: minxfer < full (near-full partial past SHORT_MARGIN), or minxfer == full while the live remainder (last good marker) is within DANGER_BAND of a boundary. Benign, reported as NOTE. - anomalous: minxfer == full with the remainder mid-buffer — the marker should have been plainly visible but vanished. Real failure. Verdict now requires anomalous == 0 (and spurious == 0, no faults/reset), replacing the blunt +-2 edge/marker tolerance which couldn't tell a benign boundary miss from a real drop. Doc updated with the marker-size model, the blind-spot characterisation, and the firmware-side fix (ZLP / forced minimum-length delimiter). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The marker after a hard miss is often ~2x the typical size: the skipped flush rolls forward and the next edge flushes the combined partial, so no samples are lost — the delimiter was merely delayed and is reconstructable in post. Classify such a miss as MERGE (WARN, recoverable) rather than ANOM (FAIL, possible data loss), using a running-median baseline of recent normal markers (recovery spikes excluded) and a MERGE_PCT threshold. The decision is deferred one second until the next marker resolves it. Verdict: merge and blind-spot misses no longer fail the run; only anomalous misses (and spurious shorts / faults) do. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Marker classification only answered "is the delimiter present" — never "did we lose samples". Add direct loss accounting: - Inter-marker continuity: count samples delivered between consecutive markers and compare to (seconds spanned) x the learned per-second rate (running median of single-second intervals). Anchored to the marker (a fixed point in the stream), so it is free of host-window jitter; within ~1s clock drift is negligible, so a dropped DMA buffer is an obvious deficit. This also cross-validates MERGE: a merge spans 2s, so its interval must equal 2x a normal one — if short, the merge dropped data. - Surface librx888 ok_xfers/bad_xfers (transport-level loss) and the total samples delivered. - Reinterpret firmware PIB errors against the deficit: PIB with a deficit = host-visible loss; PIB without = recovered (NOTE). Verdict now fails on any confirmed loss (bad_xfers > 0 or an inter-marker deficit) in addition to anomalous misses and spurious shorts. make check and make check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

3 h of 64 MSPS data showed the MERGE-vs-anomalous split (recovery marker >= 170% of the running median) cut a single continuous population of recovery sizes arbitrarily — smaller recoveries tagged "anomalous", larger ones "merge" — while the inter-marker continuity check proved all 37 misses were lossless. The recovery-marker size was the wrong discriminator. Replace it with the continuity ground truth: when the next marker resolves a hard miss, classify by whether the spanning interval shows a sample deficit: - no deficit -> "recovered" (delimiter displaced one edge, data intact, reconstructable) -> WARN - deficit (or run ended unresolved) -> "lost" -> FAIL Drop MERGE_PCT and the marker-size baseline ring. Verdict now fails on lost markers, spurious shorts, confirmed sample loss (bad_xfers or inter-marker deficit), or device faults. Summary reports blind-spot/recovered/lost; docs updated. make check and make check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Parse --rate as a double and convert to Hz (rounded to nearest sample/s) so production rates like ka9q-radio's 129.6 Msps can be tested exactly, instead of only whole MSPS. Reject non-numeric or trailing-garbage rates (e.g. '129.6x') with a usage error. Startup line and summary now show both MSPS and the resolved Hz. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Four host-side additions so the loss claim is defensible, not just circumstantial: - Absolute sample budget: expected = samplerate x elapsed vs delivered, reported as +/-ppm. Catches sustained loss the self-referential per-interval rate would absorb. Honest caveat: the ppm is clock offset AND uniform loss combined; the host can't separate them (that needs a firmware "samples produced" counter via GETSTATS — not a stream embed). - Per-second PIB polling: GETSTATS read each second on the EP0 handle localises overflow events to the second and correlates each with that interval's continuity result, and catches a mid-run device reset immediately. Summary reports PIB seconds and how many coincided with a deficit. - Loss-floor honesty: report LOSS_TOL and the largest inter-marker deficit actually observed, so the summary states its own sensitivity and the headroom to the floor. - Learned-vs-nominal rate cross-check (ppm). Verdict unchanged on hard loss (bad_xfers / inter-marker deficit); the gross-budget shortfall beyond localised loss + 100 ppm is a WARN, since it cannot be separated from a slow clock host-side. make check and make check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Use the firmware's glDMACount (GETSTATS byte 0 = DMA buffers produced) against librx888 ok_xfers (buffers delivered): equal deltas = no loss, and the comparison is independent of the ADC clock (a slow clock lowers both equally; only real loss makes produced exceed delivered). This is the decisive answer to "were all DMA buffers delivered?" and the clean separation of clock offset from loss. - Read dma_count in GETSTATS; snapshot librx888 ok_xfers at start and end. - Read the end firmware+lib counters BEFORE rx888_stop (STOPFX3 resets glDMACount); subtract in_flight and allow one queue_depth of slack for the startup ramp and non-atomic read skew. - Report DMA buffers produced/delivered/in-flight/undelivered. Lost buffers fail the run (third independent loss detector alongside bad_xfers and inter-marker continuity). - With all loss detectors clear, report the rate offset as a pure ADC clock calibration ("ADC clock: +X ppm") instead of "clock+loss combined"; mid-run reset now also fails. Add a THEORY OF OPERATION block to the source and expand doc/pps_integrity.md (loss detectors, clock-vs-loss separation, sensitivity, what we deliberately don't do) so the harness is maintainable. No stream embeds. make all, make check, make check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Hardware at 64/129 MSPS showed a false "buffers lost" FAIL: glDMACount counts the firmware's 16 KB DMA buffers while librx888 ok_xfers counts 1 MB USB transfers (~64 DMA buffers each), so comparing the raw counts reported ~64x "undelivered". In bytes the two match exactly (no loss), confirmed by the +19/+44 ppm budget (delivered ABOVE expected). Fixes: - Compare in BYTES: produced_bytes = glDMACount x FW_DMA_BUF_BYTES (16384) vs delivered_bytes = samples x 2, with in-flight subtracted and a runtime guard that marks the check indeterminate (not failing) if the firmware DMA-buffer-size assumption is off by a large factor. Baseline samples_total at start so the delta aligns with glDMACount. - Make inter-marker continuity a DIAGNOSTIC localizer, not a hard fail: marker-detection timing jitters by ~a marker at high rates, so isolated single-interval deficits are redistribution, not loss (also a false FAIL in the runs). It is now reported and corroborated against the byte-exact produced-vs-delivered, and gated to a fully-warmed baseline. - Misses are now "blind-spot"/"displaced" (marker timing only); data loss is judged solely by bad_xfers + produced-vs-delivered. Verdict updated. - Report bytes (GB/MB) for DMA buffers and the gross floor. Update the THEORY block and doc/pps_integrity.md to match. No stream embeds. make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Firmware maintainer confirmed the GPIF->DMA buffer size is 16 KB, so the produced-vs-delivered byte comparison rests on a verified constant (the runtime guard stays as defense-in-depth against future firmware changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

A real 3h/129 MSPS run dropped ~82 MB (produced > delivered) with PIB=0 and bad_xfers=0 — DMA->USB consumer-side loss the budget masked (a +28 ppm fast clock minus ~30 ppm loss read as -1.2 ppm). Two follow-ups: - Add -q/--queuedepth and -p/--reqsize (pass-through to librx888 cfg.queue_depth / req_packets) so more in-flight buffering can be tried against the throughput ceiling. - Report loss as ppm and percent next to the MB, and on a failing run print the IMPLIED ADC clock (budget_ppm + loss_ppm added back) so the clock-masking is visible at a glance. - Document the empirical ceiling (64 MSPS lossless, 129 MSPS ~30 ppm consumer-side loss) and why the clock-independent check is essential. make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Automate the -q/-p knob matrix against the high-rate DMA->USB drain ceiling: run low-rate controls, sweep queue depth at the high rate, pick the depth with the least clock-independent (produced-vs-delivered) loss, sweep transfer size at that depth, then print a summary table (undelivered MB, result, spurious, displaced per cell). Env-configurable (RX888_FW/HIRATE/LORATE/DUR_*/QSWEEP/PSWEEP); hardware test, not wired into `make check`. Guards for missing binary/firmware and a no-device abort. Referenced from doc/pps_integrity.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The scientific control we never ran: the same data-integrity instrumentation as pps_integrity but with NO PPS marker injected, so we can tell whether anomalies are inherent to the streaming path or caused by the marker mechanism. Measures (host clock assumed disciplined): - effective sample rate from delivered-samples / elapsed-time (ppm); - buffers produced vs delivered in BYTES (glDMACount x 16384 vs samples x 2) — clock-independent loss, same as pps_integrity; - SHORT transfers: any transfer below full. With no marker applied each is a pure anomaly (FX3 committing a partial buffer) — the control for pps_integrity's "spurious shorts"; - bad_xfers, per-second PIB, streaming faults, mid-run device reset. Shares the -q/-p knobs and the two-handle design. FAILs on confirmed loss (bad_xfers or produced>delivered), device reset, or early stop; short transfers are a WARN. With loss detectors clear, reports the rate offset as a pure ADC clock calibration. Wired in: Makefile target + BINS (install follows), .gitignore, tests/stream_soak_smoke.sh in `make check`, README + doc/pps_integrity.md ("Control" section). make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…ting The 64 MSPS stream_soak control reported undelivered -33.5 MB (~ one queue depth) on a provably lossless run. Cause: produced and delivered are both measured from the start snapshot, so the steady in-flight already cancels in their difference, but we then subtracted the END in_flight again — double-counting one queue depth and biasing undelivered by ~ -queue. Subtract the in_flight DELTA (end - start) instead, which cancels the steady pipeline regardless of whether the start snapshot caught the ramp. This re-centers the zero-loss baseline at ~0 and tightens the loss floor to the intended ~slack (4 MB / ~3 ppm over 3 h) instead of ~37 MB. Report now shows in-flight start->end for transparency. make all, check, check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The 129.6 MSPS stream_soak control settled it: the bare stream sustains 259 MB/s for 3 h with zero short transfers and zero loss (produced == delivered, PIB=0, bad_xfers=0). So 129.6 is not a throughput ceiling; the loss in pps_integrity is caused by the PPS marker injection (the firmware forced partial-buffer commit perturbing the near-capacity DMA->USB pipeline). The direction is conclusive: the lossy run was at the LOWER throughput (129) with the marker, while the higher bare rate (129.6) was clean. Rewrite the doc section accordingly and point to the out-of-band PPS / GPSDO direction as the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The loss magnitude (~140-250 KB / ~0.5-1 ms per event, ~500x a clean commit's cost) shows the marker loss is a stall/glitch, not the intrinsic ~0-sample cost of a forced commit. Two untested input conditions are the prime suspects: the 100 kOhm series resistor (~4 us RC edge -> metastable detection) and clock-domain crossing of the async PPS into the GPIF (needs a real 2-flop synchronizer). So this is NOT a verdict that in-band marking is unworkable. Reframe the doc section: throughput exonerated (firm), clock-masking shown (firm), but root cause OPEN pending the edge + synchronizer experiments. Note that in-band, if fixed, is the finest time source (the short transfer length encodes the edge sample index sample-exactly), so it is worth recovering; out-of-band MCU latch remains the safe fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Replace the earlier contaminated ~82 MB / 129 figure with the trustworthy 3 h baseline (sleep inhibited, in-flight-corrected): at 129.6 MSPS the marker run loses 118 MB = 42.3 ppm while the device is pristine (PIB=0, bad_xfers=0, faults=0) and the no-marker control at the same rate loses nothing. Loss is buffers glDMACount counts as produced but that never become a USB transfer — dropped in the commit handoff, not overflow. The budget read -0.5 ppm (a +41.8 ppm fast clock masked the 42.3 ppm loss); only produced-vs-delivered saw it. Marker size swings the full range (~2.5k-521k samples), consistent with a metastable commit trigger. This is the number to beat for the resistor + GPIF-synchronizer experiments. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Reusable analyzer for -v pps_integrity logs. Reports the key correlation from the 129.6 baseline: dip (loss) seconds are 26.9x enriched in buffer-boundary-adjacent marker positions (near-empty/near-full minxfer) vs normal seconds — the signature of a commit-vs-buffer-completion race. Also reports the post-MISS ~2x merge ratio (displaced markers are clean, distinct from dips), dip/spur/MISS co-occurrence, dip magnitude, and inter-dip timing. Accepts multiple logs for a comparison table (_baseline/_lowR/_sync), so the edge + GPIF-synchronizer experiments can be judged by whether dips/hour drops AND boundary-enrichment collapses toward 1x. Stdlib only; referenced from doc/pps_integrity.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Accidentally committed tests/__pycache__/*.pyc from a compile check. Remove it and ignore __pycache__/ and *.pyc going forward. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Read the actual KBA (KBA231382) rather than the title: of its two INVALID_SEQUENCE (error 71) scenarios, only Scenario 1 (out-of-order commit against the PIB_SOCKET_0/1 ping-pong) fits our data; Scenario 2 (host-slow backpressure) is ruled out by the lossless stream_soak control at the same rate. Records the three caveats that keep this a firmware lead pending source (AUTO vs manual channel; faults=0 implying a silently swallowed commit failure; the KBA fix not preventing loss) and the four firmware questions it raises. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…uild The RX888 runs the PPS_CTL_ENABLE=1 build, where the marker commit is pure GPIF hardware (TH0/TH1_PPS_COMMIT states) on an AUTO_MANY_TO_ONE channel -- no CPU CommitBuffer/SetWrapUp, cross-route correct by construction. That eliminates the entire KBA231382 commit-ordering class (the SetWrapUp-on-AUTO anti-pattern lives only in synth_pps.c, the PPS_CTL_ENABLE=0 path, which never runs). With software commit ordering ruled out, the loss is internal to the GPIF state machine; the ~250 KB (~15 DMA buffers, ~1 ms) per-event magnitude is a stall signature, not a one-buffer slip, pointing at the slow CTL[2] edge driving the control comparator metastable. Recasts the KBA section from open lead to closed cause and makes the fast-edge experiment the decisive next test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Walk back over-confident framing. The 42.3 ppm is a rare catastrophic mode (3.68%% of markers carry ~all the loss), not a per-marker cost; the other 96.3%% are 'below a jittery detector', not verified zero. The metastability diagnosis is inference, not a captured event -- record the alternative that the AUTO channel may not tolerate an async mid-descriptor commit regardless of edge quality. Note that 43.6%% of loss events are not boundary-adjacent (unexplained by the boundary race), that the uniform marker-size spread is expected (not evidence of metastability), and that the result is n=1. The fast-edge experiment is a test that can fail, not a foregone conclusion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…parator, no sync) Firmware-source read of the GPIF config and vendored SDK turns the input-condition suspects into a confirmed signal path (not yet a confirmed cause): - CTL[2] sampled on the external ADC clock (129.6 MHz); the ~4 us RC edge spans ~500 consecutive sample clocks in the input threshold. - Control comparator in TOGGLE mode (GPIF_CONFIG bit 12), sampling CTL[2] directly -- no intermediate state, no 2-flop synchronizer; a metastable sample propagates straight into the PPS_COMMIT transition. PPS vs original waveform differs in exactly two bits (CTRL_COMP_ENABLE, CTRL_COMP_TOGGLE). - DLL correctly off (sync mode, external clock) -- removes the one firmware-config lever previously flagged (KBA210733). - SDK 1.3.4. Recasts the experiment order as edge-first (resistor/buffer attacks the 500-cycle dwell directly; a synchronizer state is belt-and-suspenders), while keeping the error bars: the path is confirmed but causality is not, and 43.6%% of loss events remain non-boundary and unexplained. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…menu Captures forward-looking firmware features that would convert the host tools' inferences into device-reported ground truth, independent of whether in-band PPS survives the edge experiment. Ranked candidates grouped by goal (dropped-sample detection, timing/clock offset, health monitoring, validation), each with what it buys the existing tools and the open feasibility question for firmware. Highlights the two cheapest high-value GETSTATS additions (drop counter, sample-granular produced count) and the GPIF counter-capture-on-CTL-edge latch as a non-perturbing alternative to the in-band marker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Firmware-confirmed: glDMACount increments on CY_U3P_DMA_CB_PROD_EVENT (producer commits a filled buffer into the DMA channel, before USB consumption). So produced > delivered means buffers entered the channel but never drained out the USB consumer socket -- the gap is inside the FX3, on the DMA->USB drain, not across the wire. Producer fill is fine; the drain loses them. Rules out host backpressure two ways: the no-marker stream_soak control at the same data rate is lossless, and the loss correlates 26.9x with the device-internal buffer-fill phase the host can't see. Adds the PROD_EVENT accounting nuance (per-buffer regardless of fill -> glDMACount x 16 KB may over-count forced partials) and the delivered-side continuity cross-check showing the bulk loss is real, not a rounding artifact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…collision Experiment 2 (1 kOhm fast edge, ~100x faster) result, two independent 3h runs analyzed with pps_log_stats: - Fidelity improved: spurious shorts 403->0, displaced markers 61->32. - Loss unchanged: 42.3->41.2 ppm; boundary-enrichment intact 26.9x->25.8x; dip magnitude identical (median 98347->98220 samples ~= 12 DMA buffers). Conclusion: edge quality governs marker fidelity, not data loss. The loss is a stable, reproducible structural commit-vs-buffer-boundary collision, not edge metastability. This also down-weights the synchronizer as a loss fix (it makes the commit deterministic but cannot move where it lands vs the boundary). The lever the data points at is boundary-aware commit logic in the GPIF state machine (defer/suppress the partial commit in the danger band) plus the open question of whether a near-boundary consumer wrap-up orphans the ~12 buffers in the DMA->USB drain. Strengthens the case for #5 (capture the edge sample index without forcing a commit). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Firmware source resolves the mechanism, superseding the metastability/ synchronizer framing: - The marker is a forced GPIF thread-switch, NOT a commit. BETA_THR_WRAPUP is set in no waveform state; TH0_PPS_COMMIT is the normal DATA_CNT_HIT transition firing unconditionally, abandoning a partial buffer. The short transfer is an implicit DMA-adapter side effect. - Channel holds 4 x 16 KB buffers (AUTO_MANY_TO_ONE), not 64; the 64 is host URB assembly. So the ~12-buffer loss is a drain-side descriptor desync (3x the whole pool), silent because the consumer side is uninstrumented (CONS_EVENT commented out). Drops the synchronizer; makes CONS_EVENT telemetry the decisive next step; adds three forward paths: A0 explicit THR_WRAPUP (one bit, with the thread-shutdown/re-arm caveat), A boundary-aware switch, B capture-don't- commit GPIO latch. Fixes the stale KBA closing that still named the edge as the leading suspect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The firmware now counts consumer drains (CY_U3P_DMA_CB_CONS_EVENT) and exposes glDMAConsCount at GETSTATS [36..39]. Both tools now request the 40-byte response and decode it (gracefully absent on older firmware: cons_valid stays 0, all other fields still decode). pps_integrity: - per-second 'orphan+N (drain)' note when produced-consumed steps beyond the in-flight wobble, landing on the same line as the dip so the orphan is correlated with the marker event; - summary 'DMA drain' line (consumed buffers, produced-consumed orphaned MB) plus a cross-check against host undelivered: agreement pins the loss to the in-chip producer->consumer hop; a consumer->host gap would indict the USB wire instead. stream_soak: same decode + a 'DMA drain' baseline line (expected ~in-flight depth with no marker) -- the no-orphaning control the marker steps stand out against. Doc: loss-accounting section documents the new field and cross-check. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

When glDMAConsCount is present in GETSTATS but pinned at 0 (CONS_EVENT not firing -- expected on an AUTO channel, where consumer callbacks are not delivered to the CPU), the prior code reported the entire stream as 'orphaned' and spammed a per-second orphan note at the full production rate. Now: the per-second note only fires once the consumer counter is actually advancing, and the summary prints a plain 'glDMAConsCount did not advance -- CONS_EVENT not firing' diagnostic instead of a misleading multi-GB orphan figure. Confirmed against a live run where consumed stayed 0 over 680k produced buffers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…isters Firmware replaced the (never-firing) CONS_EVENT callback with direct socket xferCount register reads: GETSTATS grows to 48 bytes with producer xferCount [36..39], consumer-API [40..43], consumer-raw [44..47], all in bytes that wrap ~16 s. The meaningful quantity is the instantaneous backlog = producer - consumer (wrap-safe; both wrap together, backlog << 2^32). Host side now: - requests 48 bytes; decodes the three drain counters (drain_valid gated on >=48 so older firmware degrades cleanly); - fw_backlog() helper; per-second 'orphan+NKB (drain)' note when the backlog steps past the in-flight wobble, with a 256 MB sanity cap so a non-tracking counter can't spam; - summary reports backlog start->end and orphaned MB (byte-exact), the API-vs-raw-register skew, and the cross-check vs host undelivered (AGREE when they match within in-flight slack). stream_soak mirrors the decode + a flat-backlog baseline line. Doc updated for the new layout and the wrap-safe backlog method. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…elds A firmware response < 48 B (no socket-xferCount drain counters) previously produced silence -- no DMA drain line at all -- which is ambiguous (did it work? old firmware?). Now both tools print an explicit 'DMA drain: not reported by this firmware (GETSTATS < 48 B)' line so the operator knows to flash the socket-xferCount build rather than wondering why the cross-check is missing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The 48-byte firmware now returns the drain counters, but apiProd (producer xferCount via CyU3PDmaMultiChannelGetStatus) appears to be a single socket of the MANY_TO_ONE channel, not the summed full-stream producer byte count. It therefore advances at a different rate than the single full-rate consumer socket, so producer-consumer drifts GB-scale and wraps instead of being the small in-flight+orphaned backlog. The sanity guard previously only checked the END backlog, so it printed a nonsense 'orphaned -4082 MB'. Now both tools check the start backlog and the orphan magnitude and, when the pair is incoherent, print an explicit 'producer/consumer counters incoherent ... producer xferCount is not the full-stream byte count (one socket of the many-to-one?)' instead of a bogus figure. The consumer side is fine (API == raw register, skew 0); the fix is firmware-side: sum both PIB producer sockets for apiProd. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

The producer-sum firmware fix landed: drain counters are now coherent (backlog ~0, skew 0) instead of GB-scale garbage. On a window with no real loss, orphan ~0 and undelivered within slack both read clean, so the cross-check now says 'both clear -- no loss this window' rather than the mildly alarming 'see delta'. AGREE / see-delta still apply once real loss clears the in-flight slack on a longer run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

For short (e.g. 10 min) confirmation runs, each 'orphan+NKB (drain)' note now also prints the cumulative backlog (backlog X.XXMB), so the drain staircase is legible live without waiting for the summary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…ither) The 15-min run exposed a contradiction: glDMACount x 16384 shows ~42 ppm 'loss' while the byte-exact producer-consumer backlog shows ~0. Both readings are in question -- glDMACount over-counts the partial marker buffers (each counted as a full 16 KB), AND an *exactly* 0.000 backlog is implausible (expect nonzero in-flight), so apiProd may be mirroring apiCons. Rather than trust either, the summary now: - cross-check prints 'CONFLICT: glDMACount says loss, byte-exact says clean (unresolved)'; - a NOTE spells out both failure modes and the two things to confirm (apiProd is an independent producer-socket sum AND shows nonzero steady in-flight); - the loss line becomes 'CONTESTED ... treat as unresolved' instead of asserting 'dropped data'; - the implied-clock line no longer asserts 'the fast clock masked the loss' -- it carries 'UNPROVEN' when the drain disagrees, 'IF the loss is real' when no drain counter is present, and only confirms when the drain agrees. No PASS/FAIL flip: we do not over-trust the suspect 0.000 either. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…he 0.000 The exactly-0 backlog needs proof it is not apiProd mirroring apiCons. Two diagnostics added to the drain summary: - 'drain raw': start/end apiProd, apiCons and rawCons values -- if apiProd == apiCons byte-for-byte, the producer is mirroring the consumer and the 0 backlog is vacuous; - 'drain inflight': peak producer-consumer backlog seen across the whole run -- a genuine independent producer wobbles nonzero as buffers fill/drain; a peak of 0 for the entire run is the mirroring red flag. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Stop inferring from the cooked summary -- capture the raw stream. --statslog FILE writes one CSV row per GETSTATS poll (plus start/end snapshots) with every unprocessed counter we acquire: host (samples_total, ok/bad_xfers, in_flight) and firmware (dma_count, drain_prod/cons/raw, backlog, pib, faults, boot), tagged with the per-second stat and minxfer. Offline analysis (did the backlog ever go nonzero? does drain_prod track drain_cons exactly? does backlog step at dips?) then works off ground truth instead of the derived metrics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

stream_soak gains the same -l/--statslog as pps_integrity, writing the identical CSV schema (marker columns 0/'soak') so one analyser reads both -- the no-marker control's drain trace is the flat-backlog baseline. pps_log_stats.py now auto-detects a --statslog CSV (vs the -v console log) and analyses the raw counters to settle the questions the summary can only infer: - producer-vs-consumer independence: drain_prod==drain_cons fraction and the backlog distribution -> 'backlog 0 on EVERY row -> apiProd MIRRORS apiCons (vacuous)' vs 'wobbles nonzero -> INDEPENDENT'; - orphan steps: backlog jumps > 6 buffers, with the second and its minxfer; - glDMACount-vs-delivered growth: smooth per-second (per-marker partial-buffer over-count artifact) vs bursty/concentrated (real loss events). Verified on synthetic mirror/independent/artifact/real CSVs; the existing console-log path is unchanged. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

…klog The real run.csv exposed a bug: backlog is unsigned (prod-cons mod 2^32), and read-skew lets the consumer be caught one buffer AHEAD, wrapping to ~2^32 -- which the analyzer misread as a 262143-buffer 'orphan step'. Now interpret backlog as signed, so the wobble reads as +/-1 buffer. With that, the drain analysis is decisive: independence from the count of rows where prod!=cons (a hard mirror is 0); orphan accumulation from the net start->end change and sustained steps (in-flight wobble returns to ~0, a real orphan ratchets up). When the backlog shows no accumulation, the glDMACount-vs-delivered growth is BY DEFINITION the partial-buffer over-count, so the gap section now defers to the backlog verdict instead of the host-in-flight-confounded smooth/bursty heuristic (median buf/marker shown for scale). Verified on the 10-min run.csv: apiProd independent (53/612 rows differ by 1 buffer), net backlog 0, no steps -> the +7.5 MB glDMACount gap is over-count (12 KB/marker), not dropped data. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

Strip the interpretive verdicts the tools had accreted -- they are diagnostic instruments, not adjudicators. Removed: - PASS/FAIL driven by the glDMACount-vs-delivered gap (it over-counts partial marker buffers); PASS/FAIL now reflects only unambiguous transport/device faults (bad_xfers, device reset, spurious, etc.); - 'fast clock masked the loss' / 'implied clock' add-back -> a plain 'Rate offset' line that states the budget and gap ppm and says they are conflated, without asserting which dominates; - 'undelivered/dropped data', 'orphaned', 'CONFLICT/CONTESTED/AGREE', 'mirrored/suspect', 'OVER-COUNT vs REAL' verdicts -> neutral measurement labels (glDMACount-delivered gap; backlog start/end/net; peak |in-flight|; prod!=cons N/M s; raw P/C/Craw). The numbers are all still there (more of them, signed backlog); the interpretation is left to the analyst and the --statslog CSV. make check / check-asan pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

… was over-count The 3 h byte-exact drain run settled the investigation, so the doc no longer claims '42 ppm drain-side loss'. Rewrites: - new 'Resolution' section: the marker drops no data (backlog net 0 over 3 h, leak < 6 ppb); the 42.6 ppm gap is glDMACount x 16 KB counting partial marker buffers as full (~10.8 KB/marker); real clock ~0 ppm; the 'fast clock masked the loss' reading is dead; - condensed 'How we got here' history (stream_soak control, KBA ruled out, thread-switch mechanism, edge experiment = fidelity not loss, the CONS_EVENT-dead-on-AUTO -> socket-register saga, orphan hypothesis refuted); - new 'Still open: data corruption' -> the 10 MHz-tone phase-tracking test; - sample-loss-accounting section: glDMACount gap is over-count, the byte-exact drain backlog is the authoritative loss check; - clock-separation, Final-report example, Pass criteria, Takeaways retitled to match the tools-report-measurements design and the real 3 h output; - top-of-doc status callout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RVt7wuQHhUfGFMCo7q1aHL

claude added 3 commits June 20, 2026 01:50

ringof changed the title ~~doc: pps_integrity implementation plan~~ pps_integrity: PPS in-band marker fidelity tool Jun 20, 2026

claude added 18 commits June 20, 2026 02:23

ringof changed the title ~~pps_integrity: PPS in-band marker fidelity tool~~ pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness Jun 20, 2026

claude added 7 commits June 21, 2026 01:18

claude added 18 commits June 21, 2026 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness#32

pps_integrity + stream_soak: PPS marker fidelity & stream-integrity harness#32
ringof wants to merge 46 commits into
mainfrom
claude/pps-development

ringof commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ringof commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Result: the in-band marker is byte-lossless

What this PR changed about the tools

How we got here (dead ends, recorded)

Still open: data corruption (delivered ≠ correct)

What's in the tree

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ringof commented Jun 20, 2026 •

edited

Loading