ITER-0002: codec comparison benchmark + tradeoff guidance by nnunley · Pull Request #13 · forest-rs/leit

nnunley · 2026-05-30T04:11:38Z

Summary

The measurement half of ITER-0002: a Criterion benchmark comparing the two postings codecs, plus tradeoff guidance grounded in the results.

Stacked on #12 (postings-codecs), which is stacked on #11. Until the ancestors merge, this PR's diff against main includes their commits; review only the benchmark + leit_index accessor + tradeoff doc here.

What's in it

crates/leit_wind_tunnel_index/benches/codec_compare.rs — Criterion benchmark measuring encode time, decode time, and compressed size for DeltaVarintCodec vs BlockDeltaCodec over the deterministic wind-tunnel corpus (1K + 10K docs, multi-field title+body, Zipfian), with a lossless sanity gate so a broken codec fails the bench.
leit_index: PostingEntry made public + InMemoryIndex::postings_by_term() accessor — the minimal surface to extract doc-sorted (SegmentLocalDocId, TermFreq) postings (the lowering stands in for the ITER-0004 segment-write boundary).
docs/2026-05-30-codec-tradeoffs.md — decode-cost vs memory guidance.
Criterion stays out of every primary crate (added only to leit_wind_tunnel_index dev-deps); the no-Criterion sentinels remain green.

Results (baseline)

Codec	Compressed size	Decode
DeltaVarint	~25.4–25.6% of the 8-byte/posting baseline (~2.03–2.05 B/posting)	fastest
BlockDelta	~26.3–27.4% (~2.10–2.19 B/posting)	~4–11% slower (per-block header)

Guidance: DeltaVarint for v1 simplicity/speed; BlockDelta when the per-block header earns its keep via selective/skip decode (ITER-0003) and Phase-3 WAND.

Run: cargo bench -p leit_wind_tunnel_index --bench codec_compare

🤖 Generated with Claude Code

…ipfian vocab, query fixtures) using rapidhash::v1

…paths) Completes ITER-0000 walking skeleton (T6-T9) atop the leit_wind_tunnel harness: - leit_wind_tunnel_index: index_build/{1k,10k} indexing-throughput benches - leit_wind_tunnel_query: five execution paths (single/OR/AND/fielded + BM25F cross-field) x {1k,10k}, index built once outside the timed region, ExecutionWorkspace reused across iterations - Criterion isolated to the two bench crates (dev-dependencies only); primary crates and leit_benchmark untouched - CI: exclude the three wind-tunnel crates from the no_std/wasm jobs (std-only, mirroring leit_benchmark); no cargo bench step added - harness docs: note the relationship to leit_benchmark (smoke test vs performance lab)

… (STORY-0096) ITER-0001 dependency hygiene per the usage-site rule: the leit_wind_tunnel harness uses only rapidhash in its library surface; leit_core/leit_index/ leit_text are used solely by its #[cfg(test)] integration tests, so they move to [dev-dependencies] and no longer appear in the harness's production dependency graph. The bench crates were already correct (empty lib; all deps dev). Library build, 17 unit tests, and both bench crates verified green.

… (STORY-0112) ITER-0001: BlockId, FilterExprId, SegmentOrd, SegmentLocalDocId in leit_core, each a #[repr(transparent)] newtype over a [u8; 4] little-endian inner deriving bytemuck Pod/Zeroable. The on-disk form is the in-memory form: a &[u8] slice from an mmap'd buffer casts in place to &[Id] with no allocation or deserialization (zero-copy), stable across host endianness; ordering is numeric. bytemuck chosen over zerocopy because zerocopy's derives emit internal #[allow(non_ascii_idents)]/#[allow(non_local_definitions)] that conflict with the workspace's forbid-level Linebender lints (E0453); bytemuck is no_std and lint-clean under the same forbid set. Proven by SCENARIO-0005 (6 unit tests: value + slice + unaligned round-trip, numeric ordering, LE byte layout).

Records the design-decidable decisions for the Phase 2 segment format (DEC-01..10) with rationale, a Phase 3 forward-compatibility audit, and decision->enforcement traceability. Human-confirmed key calls: - DEC-01 segment offsets: u64 (no size cap; removes the only Phase 3 format-migration risk) - DEC-10 integrity: single footer checksum, verified in Full validation mode - DEC-06 block-aware API: public dedicated BlockCursor trait (Phase 3 WAND consumes it without a format/API break) - DEC-05 header: fixed-layout little-endian POD, absolute u64 section offsets, magic + version + format_flags, reserved stored-fields/columnar slots Decision-documentation ACs of STORY-0078/0081-0084/0090/0043-0047 are satisfied here (decided:ITER-0001); their code-enforcement ACs are deferred to ITER-0003/0004. Forward constraint recorded for ITER-0005: block-metadata schema must carry per-block max_score + doc-range for Phase 3 WAND/MaxScore.

…ORY-0112 AC-2) ITER-0001 audit corrective: SCENARIO-0005 now also exercises try_from_bytes/ try_cast_slice (Ok on well-formed, Err on malformed) per AC-2's validated-read obligation.

…elta) [ITER-0002] Codec layer for ITER-0002. A Codec trait with two implementations over a stable v1 block format, plus the layout decisions (DEC-11 fixed 128-doc blocks, DEC-12 layout) and a new TermFreq segment-resident type. - DeltaVarint (CodecId 0) + BlockDelta (CodecId 1, 128-doc independently-decodable blocks with validated first/last-doc header range). - Hand-rolled LEB128 varint into a type-enforced [u8;5]; no_std + alloc; no new deps. - API speaks named segment-resident types SegmentLocalDocId + TermFreq (no anonymous u32 drift); EntityId stays the in-memory abstraction, lowered at the segment boundary. - Decode into caller-provided &mut Vec<..> — scratch-ownership-agnostic (TODO(ITER-0003) / STORY-0079). Doc-sorted precondition enforced via checked_sub (deterministic panic). - CodecId marker per list; segment-format reservation deferred:ITER-0004. Stories: STORY-0002/0003/0004/0005(AC1-2)/0009 done; STORY-0087/0088 decided. Proof: SCENARIO-0006 (36 leit_postings tests). PAR spec + quality reviewed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ER-0002 T6-T7] SCENARIO-0070 (process-level): Criterion benchmark comparing DeltaVarint vs BlockDelta over the deterministic wind-tunnel corpus (1K/10K, multi-field, Zipfian). Measures encode time, decode time, and compressed size vs the 8-byte/posting baseline, with a lossless sanity gate. Baseline: DeltaVarint ~25%, BlockDelta ~26-27% of uncompressed; DeltaVarint decode ~4-11% faster. - crates/leit_wind_tunnel_index/benches/codec_compare.rs (+ [[bench]], leit_postings/ leit_core dev-deps). Criterion stays out of all primary crates (SCENARIO-0061/0069 pass). - leit_index: PostingEntry made public + InMemoryIndex::postings_by_term() accessor, the minimal surface needed to extract doc-sorted (SegmentLocalDocId, TermFreq) postings (lowering stands in for the ITER-0004 segment-write boundary). - docs/2026-05-30-codec-tradeoffs.md — STORY-0006 AC-3 decode-cost vs memory guidance. Stories: STORY-0006 (benchmark + guidance). Proof: SCENARIO-0070. PAR reviewed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

nnunley and others added 9 commits May 29, 2026 19:47

docs: add wind tunnel infrastructure design spec

46f133e

feat: add leit_wind_tunnel harness (deterministic corpus generator, Z…

a2b1641

…ipfian vocab, query fixtures) using rapidhash::v1

test(leit_core): prove validated try_* reads for segment ID types (ST…

a625f33

…ORY-0112 AC-2) ITER-0001 audit corrective: SCENARIO-0005 now also exercises try_from_bytes/ try_cast_slice (Ok on well-formed, Err on malformed) per AC-2's validated-read obligation.

This was referenced May 30, 2026

ITER-0003: cursor architecture (leit_postings) #14

Open

ITER-0003B: index wiring & ranking equivalence #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ITER-0002: codec comparison benchmark + tradeoff guidance#13

ITER-0002: codec comparison benchmark + tradeoff guidance#13
nnunley wants to merge 9 commits into
forest-rs:mainfrom
nnunley:codec-benchmark

nnunley commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nnunley commented May 30, 2026

Summary

What's in it

Results (baseline)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant