Skip to content

Phase 2 architectural decisions + core ID types#11

Open
nnunley wants to merge 7 commits into
forest-rs:mainfrom
nnunley:phase2-decisions
Open

Phase 2 architectural decisions + core ID types#11
nnunley wants to merge 7 commits into
forest-rs:mainfrom
nnunley:phase2-decisions

Conversation

@nnunley
Copy link
Copy Markdown
Collaborator

@nnunley nnunley commented May 30, 2026

Summary

The Phase 2 segment-format architectural decisions (design-decidable now, without measurement) plus the core ID types they depend on.

Stacked on #10 (wind tunnel infrastructure) — please review/merge #10 first. Until it lands, the diff here also shows #10's commits; GitHub will trim this PR to its own changes once #10 merges.

Core ID types (leit_core)

BlockId, FilterExprId, SegmentOrd, SegmentLocalDocId — each a #[repr(transparent)] newtype over a [u8; 4] little-endian value deriving bytemuck Pod/Zeroable. The on-disk form is the in-memory form: a &[u8] slice from an mmap'd buffer casts in place to &[Id] with no allocation or deserialization, stable across host endianness, viewable from any byte offset (alignment-1). Ordering is numeric. Proven by unit tests (single/slice/unaligned round-trip, numeric ordering, LE layout).

bytemuck (not serde, not zerocopy) — serde isn't zero-copy on the mmap read path, and zerocopy's derives emit #[allow]s that conflict with the workspace's forbid-level lints; bytemuck is no_std and lint-clean under the same set.

Phase 2 architectural decisions

docs/2026-05-30-phase2-architectural-decisions.md records DEC-01..10 with rationale, a Phase 3 forward-compatibility audit, and decision→enforcement traceability. Headline calls:

  • u64 segment offsets (no size cap)
  • Fixed-layout little-endian POD header with absolute offsets, magic + version + format_flags, reserved stored-fields/columnar slots
  • Footer integrity checksum, verified in Full validation mode (3 modes: HeaderOnly / Structural / Full)
  • Fully-borrowed zero-copy section views; strict builder/reader type separation
  • Public dedicated BlockCursor trait so a future WAND/MaxScore path consumes block metadata without a format/API break

These are decisions of record; the code that enforces them lands in the segment-format / cursor iterations.

Also

Wind-tunnel dependency hygiene: the harness's test-only leit_* deps moved to [dev-dependencies] (used only by its #[cfg(test)] integration tests), keeping them out of its production dependency graph.

nnunley added 7 commits May 29, 2026 19:47
…ipfian vocab, query fixtures) using rapidhash::v1
…paths)

Completes ITER-0000 walking skeleton (T6-T9) atop the leit_wind_tunnel
harness:

- leit_wind_tunnel_index: index_build/{1k,10k} indexing-throughput benches
- leit_wind_tunnel_query: five execution paths (single/OR/AND/fielded +
  BM25F cross-field) x {1k,10k}, index built once outside the timed region,
  ExecutionWorkspace reused across iterations
- Criterion isolated to the two bench crates (dev-dependencies only);
  primary crates and leit_benchmark untouched
- CI: exclude the three wind-tunnel crates from the no_std/wasm jobs
  (std-only, mirroring leit_benchmark); no cargo bench step added
- harness docs: note the relationship to leit_benchmark (smoke test vs
  performance lab)
… (STORY-0096)

ITER-0001 dependency hygiene per the usage-site rule: the leit_wind_tunnel
harness uses only rapidhash in its library surface; leit_core/leit_index/
leit_text are used solely by its #[cfg(test)] integration tests, so they
move to [dev-dependencies] and no longer appear in the harness's production
dependency graph. The bench crates were already correct (empty lib; all deps
dev). Library build, 17 unit tests, and both bench crates verified green.
… (STORY-0112)

ITER-0001: BlockId, FilterExprId, SegmentOrd, SegmentLocalDocId in leit_core,
each a #[repr(transparent)] newtype over a [u8; 4] little-endian inner deriving
bytemuck Pod/Zeroable. The on-disk form is the in-memory form: a &[u8] slice
from an mmap'd buffer casts in place to &[Id] with no allocation or
deserialization (zero-copy), stable across host endianness; ordering is numeric.

bytemuck chosen over zerocopy because zerocopy's derives emit internal
#[allow(non_ascii_idents)]/#[allow(non_local_definitions)] that conflict with the
workspace's forbid-level Linebender lints (E0453); bytemuck is no_std and
lint-clean under the same forbid set. Proven by SCENARIO-0005 (6 unit tests:
value + slice + unaligned round-trip, numeric ordering, LE byte layout).
Records the design-decidable decisions for the Phase 2 segment format
(DEC-01..10) with rationale, a Phase 3 forward-compatibility audit, and
decision->enforcement traceability. Human-confirmed key calls:

- DEC-01 segment offsets: u64 (no size cap; removes the only Phase 3
  format-migration risk)
- DEC-10 integrity: single footer checksum, verified in Full validation mode
- DEC-06 block-aware API: public dedicated BlockCursor trait (Phase 3 WAND
  consumes it without a format/API break)
- DEC-05 header: fixed-layout little-endian POD, absolute u64 section offsets,
  magic + version + format_flags, reserved stored-fields/columnar slots

Decision-documentation ACs of STORY-0078/0081-0084/0090/0043-0047 are satisfied
here (decided:ITER-0001); their code-enforcement ACs are deferred to
ITER-0003/0004. Forward constraint recorded for ITER-0005: block-metadata schema
must carry per-block max_score + doc-range for Phase 3 WAND/MaxScore.
…ORY-0112 AC-2)

ITER-0001 audit corrective: SCENARIO-0005 now also exercises try_from_bytes/
try_cast_slice (Ok on well-formed, Err on malformed) per AC-2's validated-read
obligation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant