Summary
osdtrace occasionally emits an op_w row whose queue_lat is an unsigned-underflow value (~UINT64_MAX − N) far larger than the row's total op_lat. The pattern reproduces on ops written to a single-replica pool — both peer slots are the (-1, 0) sentinel, the per-stage bluestore_lat / prepare / aio_wait / seq_wait / kv_commit / osd_lat values are all 0, but queue_lat is the underflow.
Sample row
Captured during a cephadm v18.2.7 (reef) cluster's test workload (.mgr pool, id 0):
osd 1 pg 0.0 op_w size 27 client 14233 tid 3126
throttle_lat 0 recv_lat 7 dispatch_lat 6
queue_lat 18446743547040292 osd_lat 0
peers [(-1, 0), (-1, 0)]
bluestore_lat 0 (prepare 0 aio_wait 0 (aio_size 0) seq_wait 0 kv_commit 0)
op_lat 2895
queue_lat = 18446743547040292 (UINT64_MAX − ~9·10⁸) → ~unsigned underflow of ~9·10⁸ µs
op_lat = 2895 µs (sane)
peers = [(-1, 0), (-1, 0)] → pool size = 1, no replication path taken
- every other per-stage latency = 0
How it was caught
The new strict verify_osdtrace_output added in PR #106 (tests/lib/verify-trace-output.sh) enforces each-sub-latency <= op_lat per row. This invariant flagged the row with:
ERROR: Sub-latency queue_lat=18446743547040292 µs > op_lat=2895 µs (op=op_w osd=1 pool=0 tid=3126)
Reproduction
Suggested investigation
src/osdtrace.bpf.c — look at the code path that handles op_w for pools with size == 1 (no replicas). The queue-start timestamp is probably set behind a code path that's only entered when replication runs; on the fast path the per-CPU queue_start_ts map entry is stale (zero / leftover from the prior op), so the end-time subtraction underflows.
Workaround in test infrastructure (PR #106)
PR #106 has the test verifier downstream of this bug. Short-term, the verifier may skip the strict sub-latency check for op_w rows where both peer slots are (-1, …) so CI stays useful while this bug is fixed. Long-term fix belongs in the BPF prog.
Summary
osdtraceoccasionally emits anop_wrow whosequeue_latis an unsigned-underflow value (~UINT64_MAX − N) far larger than the row's totalop_lat. The pattern reproduces on ops written to a single-replica pool — both peer slots are the(-1, 0)sentinel, the per-stagebluestore_lat/prepare/aio_wait/seq_wait/kv_commit/osd_latvalues are all0, butqueue_latis the underflow.Sample row
Captured during a cephadm v18.2.7 (reef) cluster's test workload (
.mgrpool, id 0):queue_lat = 18446743547040292(UINT64_MAX − ~9·10⁸) → ~unsigned underflow of ~9·10⁸ µsop_lat = 2895µs (sane)peers = [(-1, 0), (-1, 0)]→ pool size = 1, no replication path takenHow it was caught
The new strict
verify_osdtrace_outputadded in PR #106 (tests/lib/verify-trace-output.sh) enforceseach-sub-latency <= op_latper row. This invariant flagged the row with:Reproduction
cephadm-rgw-ci-tests, PR Add cephadm + radosgw E2E coverage for quincy → tentacle #106build-ubuntu-cephadm (ubuntu-22.04, reef)in workflow run https://github.com/taodd/cephtrace/actions/runs/26276979755op_wper ~thousands of rows.Suggested investigation
src/osdtrace.bpf.c— look at the code path that handlesop_wfor pools withsize == 1(no replicas). The queue-start timestamp is probably set behind a code path that's only entered when replication runs; on the fast path the per-CPUqueue_start_tsmap entry is stale (zero / leftover from the prior op), so the end-time subtraction underflows.Workaround in test infrastructure (PR #106)
PR #106 has the test verifier downstream of this bug. Short-term, the verifier may skip the strict sub-latency check for op_w rows where both peer slots are
(-1, …)so CI stays useful while this bug is fixed. Long-term fix belongs in the BPF prog.