ITER-0002: codec comparison benchmark + tradeoff guidance#13
Open
nnunley wants to merge 9 commits into
Open
Conversation
…ipfian vocab, query fixtures) using rapidhash::v1
…paths)
Completes ITER-0000 walking skeleton (T6-T9) atop the leit_wind_tunnel
harness:
- leit_wind_tunnel_index: index_build/{1k,10k} indexing-throughput benches
- leit_wind_tunnel_query: five execution paths (single/OR/AND/fielded +
BM25F cross-field) x {1k,10k}, index built once outside the timed region,
ExecutionWorkspace reused across iterations
- Criterion isolated to the two bench crates (dev-dependencies only);
primary crates and leit_benchmark untouched
- CI: exclude the three wind-tunnel crates from the no_std/wasm jobs
(std-only, mirroring leit_benchmark); no cargo bench step added
- harness docs: note the relationship to leit_benchmark (smoke test vs
performance lab)
… (STORY-0096) ITER-0001 dependency hygiene per the usage-site rule: the leit_wind_tunnel harness uses only rapidhash in its library surface; leit_core/leit_index/ leit_text are used solely by its #[cfg(test)] integration tests, so they move to [dev-dependencies] and no longer appear in the harness's production dependency graph. The bench crates were already correct (empty lib; all deps dev). Library build, 17 unit tests, and both bench crates verified green.
… (STORY-0112) ITER-0001: BlockId, FilterExprId, SegmentOrd, SegmentLocalDocId in leit_core, each a #[repr(transparent)] newtype over a [u8; 4] little-endian inner deriving bytemuck Pod/Zeroable. The on-disk form is the in-memory form: a &[u8] slice from an mmap'd buffer casts in place to &[Id] with no allocation or deserialization (zero-copy), stable across host endianness; ordering is numeric. bytemuck chosen over zerocopy because zerocopy's derives emit internal #[allow(non_ascii_idents)]/#[allow(non_local_definitions)] that conflict with the workspace's forbid-level Linebender lints (E0453); bytemuck is no_std and lint-clean under the same forbid set. Proven by SCENARIO-0005 (6 unit tests: value + slice + unaligned round-trip, numeric ordering, LE byte layout).
Records the design-decidable decisions for the Phase 2 segment format (DEC-01..10) with rationale, a Phase 3 forward-compatibility audit, and decision->enforcement traceability. Human-confirmed key calls: - DEC-01 segment offsets: u64 (no size cap; removes the only Phase 3 format-migration risk) - DEC-10 integrity: single footer checksum, verified in Full validation mode - DEC-06 block-aware API: public dedicated BlockCursor trait (Phase 3 WAND consumes it without a format/API break) - DEC-05 header: fixed-layout little-endian POD, absolute u64 section offsets, magic + version + format_flags, reserved stored-fields/columnar slots Decision-documentation ACs of STORY-0078/0081-0084/0090/0043-0047 are satisfied here (decided:ITER-0001); their code-enforcement ACs are deferred to ITER-0003/0004. Forward constraint recorded for ITER-0005: block-metadata schema must carry per-block max_score + doc-range for Phase 3 WAND/MaxScore.
…ORY-0112 AC-2) ITER-0001 audit corrective: SCENARIO-0005 now also exercises try_from_bytes/ try_cast_slice (Ok on well-formed, Err on malformed) per AC-2's validated-read obligation.
…elta) [ITER-0002] Codec layer for ITER-0002. A Codec trait with two implementations over a stable v1 block format, plus the layout decisions (DEC-11 fixed 128-doc blocks, DEC-12 layout) and a new TermFreq segment-resident type. - DeltaVarint (CodecId 0) + BlockDelta (CodecId 1, 128-doc independently-decodable blocks with validated first/last-doc header range). - Hand-rolled LEB128 varint into a type-enforced [u8;5]; no_std + alloc; no new deps. - API speaks named segment-resident types SegmentLocalDocId + TermFreq (no anonymous u32 drift); EntityId stays the in-memory abstraction, lowered at the segment boundary. - Decode into caller-provided &mut Vec<..> — scratch-ownership-agnostic (TODO(ITER-0003) / STORY-0079). Doc-sorted precondition enforced via checked_sub (deterministic panic). - CodecId marker per list; segment-format reservation deferred:ITER-0004. Stories: STORY-0002/0003/0004/0005(AC1-2)/0009 done; STORY-0087/0088 decided. Proof: SCENARIO-0006 (36 leit_postings tests). PAR spec + quality reviewed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ER-0002 T6-T7] SCENARIO-0070 (process-level): Criterion benchmark comparing DeltaVarint vs BlockDelta over the deterministic wind-tunnel corpus (1K/10K, multi-field, Zipfian). Measures encode time, decode time, and compressed size vs the 8-byte/posting baseline, with a lossless sanity gate. Baseline: DeltaVarint ~25%, BlockDelta ~26-27% of uncompressed; DeltaVarint decode ~4-11% faster. - crates/leit_wind_tunnel_index/benches/codec_compare.rs (+ [[bench]], leit_postings/ leit_core dev-deps). Criterion stays out of all primary crates (SCENARIO-0061/0069 pass). - leit_index: PostingEntry made public + InMemoryIndex::postings_by_term() accessor, the minimal surface needed to extract doc-sorted (SegmentLocalDocId, TermFreq) postings (lowering stands in for the ITER-0004 segment-write boundary). - docs/2026-05-30-codec-tradeoffs.md — STORY-0006 AC-3 decode-cost vs memory guidance. Stories: STORY-0006 (benchmark + guidance). Proof: SCENARIO-0070. PAR reviewed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The measurement half of ITER-0002: a Criterion benchmark comparing the two postings codecs, plus tradeoff guidance grounded in the results.
Stacked on #12 (
postings-codecs), which is stacked on #11. Until the ancestors merge, this PR's diff againstmainincludes their commits; review only the benchmark +leit_indexaccessor + tradeoff doc here.What's in it
crates/leit_wind_tunnel_index/benches/codec_compare.rs— Criterion benchmark measuring encode time, decode time, and compressed size forDeltaVarintCodecvsBlockDeltaCodecover the deterministic wind-tunnel corpus (1K + 10K docs, multi-field title+body, Zipfian), with a lossless sanity gate so a broken codec fails the bench.leit_index:PostingEntrymade public +InMemoryIndex::postings_by_term()accessor — the minimal surface to extract doc-sorted(SegmentLocalDocId, TermFreq)postings (the lowering stands in for the ITER-0004 segment-write boundary).docs/2026-05-30-codec-tradeoffs.md— decode-cost vs memory guidance.leit_wind_tunnel_indexdev-deps); the no-Criterion sentinels remain green.Results (baseline)
Guidance: DeltaVarint for v1 simplicity/speed; BlockDelta when the per-block header earns its keep via selective/skip decode (ITER-0003) and Phase-3 WAND.
Run:
cargo bench -p leit_wind_tunnel_index --bench codec_compare🤖 Generated with Claude Code