perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442
Open
samuelburnham wants to merge 17 commits into
Open
perf(rust-kernel): Intern-assigned uids, symbolic Nat offsets, environment-machine WHNF reducer#442samuelburnham wants to merge 17 commits into
samuelburnham wants to merge 17 commits into
Conversation
zisk/scripts/bench-cycles.sh runs the standard dumped-input suite (~/benchdata/zisk-inputs) through ziskemu, records steps, and verifies the committed `failures` publics word is zero. check_one mirrors the Zisk guest's one-work-item path natively so IX_* instrumentation can target a single expensive constant.
Port of IxVM dbc4177/5dcab7f/f7cfe23 to the Rust kernel, three coordinated changes: - whnf (try_nat_offset_stuck): leave `Nat.add base (Lit n)` (symbolic base, n>0) and `Nat.div/mod base (Lit k)` (k>=2) STUCK in compact form instead of delta-unfolding into succ^n towers / the division algorithm. Iota over such majors already works via cleanup_nat_offset_major. - linear-rec: collapse `succ^o(Nat.rec base succ-step (Lit n))` with a SYMBOLIC base to `Nat.add base (Lit n+o)` instead of declining into n iota steps. - def-eq (try_def_eq_offset): decompose both sides to base + offset (literals, succ layers, add-lit read in O(1)) and strip the shared offset in ONE step. The previous one-succ-peel recursed through full is_def_eq per layer, which blew MAX_DEF_EQ_DEPTH=2000 on deep offsets. Guest cycles (ziskemu, vs d2ea3d9 baseline): Int16.instRxcHasSize_eq 5,697,166,200 -> 56,028,269 (-99.0%) binsearch 95,800,411 -> 87,735,518 (-8.4%) Nat.add_comm 18,690,815 -> 17,805,210 (-4.7%) mergesort shard 4,005,320,347 -> 3,920,521,476 (-2.1%) Vector.extract_append._proof_1 unchanged (+0.02%), rest ~0. 604 kernel tests pass (one test updated: whnf of `x + 2` now asserts the compact stuck form; new test covers tower-vs-compact def-eq).
Every KExpr/KUniv construction used to compute a blake3 hash of its content for identity (intern key, cache keys, equality) — ~20% of guest cycles on reduction-heavy constants (app_hash 22-33% cumulative). Identity is now an intern-assigned u64 uid: - uids come from a process-global atomic counter and are NEVER reused, so uid equality implies structural equality and cache keys cannot alias across intern-table clears (a stale key can only miss). - InternTable keys on shallow structural keys (ExprKey/UnivKey): variant tag + child uids + semantic payload, mirroring exactly what the old content hash covered (display names, binder info, mdata excluded). intern_expr/intern_univ canonicalize children recursively when handed nodes built outside the table, preserving the old "structural identity" interning semantics. - PartialEq for KExpr/KUniv is now structural with the uid fast path (canonical-vs-canonical compares O(1)); hash_eq remains the fast-but-incomplete uid check for cache/quick-path callers where a false negative only costs a fallback. - Cache ctx components stay blake3 (CtxAddr) but hash uids instead of content addrs; push_local/push_let intern their frame types so equal context suffixes share cache entries. - ingress's hash-first intern probe becomes key-first (same skip-the- construction win, no hash). Soundness note: affirmative identity (equal uid => equal term) is what def-eq quick paths and caches rely on, and it holds by construction; incomplete negatives only cost cache misses. Nat/Str literals keep their blake3 blob Addresses (ingress-meaningful). Guest cycles (ziskemu, vs previous commit): mergesort shard 3,920,521,476 -> 2,526,226,208 (-35.6%) Vector.extract_..._1 7,692,517,694 -> 5,369,879,578 (-30.2%) Nat.add_comm 17,805,210 -> 12,687,822 (-28.7%) binsearch 87,735,518 -> 64,562,040 (-26.4%) Int16.instRxc..._eq 56,028,269 -> 45,024,529 (-19.6%) natgcdcomm/stringappend/rbmap: -11.2% / -6.9% / -2.7% 604 kernel tests pass (16 hash-determinism tests updated to assert the preserved property — identity ignores names/bi/mdata — via structural equality, shallow-key equality, or shared-table interning).
The WHNF loops (full + no-delta) and the def-eq lazy-delta loop probed all five primitive recognizers (native/bitvec/nat/decidable/string) every iteration; each collects its own app spine and runs its own gauntlet of 32-byte address compares — pure tax on ordinary constant-headed terms. Classify the head once per iteration via an allocation-free app-chain walk + per-address memo (KEnv::prim_family_cache), and call at most the one matching family reducer. Family head-address sets are disjoint, so dispatch order semantics are unchanged. The nat-offset-stuck probe gates on the Nat family too. Mirrors the IxVM prim_family dispatch memo (ix-aiur, measured there as part of -28% on Vector._proof_2). Guest cycles (ziskemu, vs previous commit): mergesort shard 2,526,226,208 -> 2,399,597,758 (-5.0%) Vector.extract_..._1 5,369,879,578 -> 5,112,313,900 (-4.8%) Nat.add_comm 12,687,822 -> 12,249,578 (-3.5%) Int16.instRxc..._eq 45,024,529 -> 43,835,207 (-2.6%) binsearch 64,562,040 -> 64,095,836 (-0.7%) others ~0 (+25 steps cache-population noise on two inputs) 604 kernel tests pass.
Env::get_anon paid one blake3 content-hash per constant at parse time. The binding check moves into LazyConstant::get() (pending_addr field): constants shipped in a closure but never forced by the typechecker are never hashed, while everything the kernel certifies is still verified on materialization — checking a constant forces it. The compile-side from_constant path (pre-populated cache) is exempt as before. Guest cycles: rbmap -9.5%, natgcdcomm -6.4%, stringappend -4.2%, Int16 -3.7%, binsearch -1.6%, nataddcomm -0.3%; mergesort/vector +0.01% (re-verification of the few constants materialized more than once). 200 ixon + 604 kernel tests pass (tampering test updated to assert rejection at get() instead of at parse).
Both complete at ~42M steps, matching Int16 — the nat-offset compaction makes the cost range-independent (pre-fix, Int16's 2^16 range alone was 5.7B steps and Int32/Int64 would not have completed).
Records the two-identity-layer boundary after 1e3029d: constants/blobs keep their blake3 Ixon Addresses (claims, Merkle roots, proof store, cross-run reuse, and per-constant inclusion proofs are all unchanged), while the removed blake3 hashing was the kernel-internal per-node scheme used only for intern/cache identity inside one checker run — never serialized, never part of a proof artifact. Documents the soundness argument (equal uid => equal term, never reused), the boundary rule (uids must not escape into serialized artifacts), and what a future sub-constant commitment feature would need to recompute at the egress boundary.
PrimFamily is referenced by the public KEnv::prim_family_cache field, so it must be pub; drop a redundant path qualification.
H-15 (perf): the transient-nat probes run on every whnf/whnf_core call BEFORE the cache lookup; they paid two collect_app_spine Vec allocations plus a KConst::Recr clone per call on the overwhelmingly common non-Nat-recursor path. Pre-filter with an allocation-free spine_head_and_len; spine collection + Recr lookup now happen only for actual Nat-recursor heads. Also drop the redundant rules.clone() in the iota IotaInfo build (rules is already owned from the try_get_const clone). H-12 (hardening): output-size cap (2^26 bits) on natively computed shiftLeft/pow results — a ~12-byte adversarial term could otherwise demand a multi-GiB allocation (cycle explosion in-guest). Oversized results stay stuck, like the existing pow exponent guard. Ported from jcb/fixes 3763356 (pre-workspace layout), adapted to kernel-riscv. Guest cycles (ziskemu, vs previous commit): mergesort shard 2,399,808,204 -> 2,313,445,541 (-3.6%) Vector.extract_..._1 5,111,066,294 -> 4,949,833,244 (-3.2%) Nat.add_comm 12,217,888 -> 11,798,013 (-3.4%) Int16/32/64 rxc ~42.2M -> ~41.3M (-2.1%) binsearch 63,115,445 -> 62,640,316 (-0.75%) small inputs ~0. 604 kernel tests pass.
Closes the adversarial-binding review of the uid-interning design: - ixon: blob bytes are blake3-verified against their address at load in all three env deserializers (ports jcb/fixes H-1). This is the one place adversarial CONTENT enters kernel identity: ExprKey::Nat/Str and literal structural equality key on the blob Address (mirroring the old content hash), which is sound only if address binds bytes. Costs +0.06..+0.29% guest cycles (blobs are a small fraction of env bytes; constants stay lazily verified). - kernel: fresh_uid aborts on counter exhaustion instead of wrapping (>2^67 guest cycles to reach; identity soundness should not rest on "probably won't happen"). - kernel: literal structural-equality arms compare values as well as blob addresses (defense in depth — redundant under the load-time check, degrades to inequality if that invariant were ever violated). - kernel: intern hits assert structural equality in debug builds. - docs/kernel_identity.md: adversarial-model section — why uid identity cannot be cache-poisoned (uids are assigned by a sequential counter, never computed from content, so there is no hash function to collide; intern keys are exact structural Eq, never digests; cache hits require uid equality; map-bucket collisions cost a key compare, not a wrong hit). 604 kernel + 200 ixon tests pass.
…d story The *_mdata_with_addr constructors were the only API through which a caller-supplied (hence potentially duplicated) uid could enter a node; nothing outside expr.rs uses them since the ingress key-first rework, so make them private — uid uniqueness per process is now enforced by visibility, not convention. Docs: spell out the within-shard (impossible by assignment) vs between-shards (certain and harmless; uids never cross the process boundary) collision analysis.
Krivine-style machine for the structural WHNF fragment, ported from the proven IxVM implementation (ix-aiur d0d3375 Phase A hybrid entry, b4c4ea3 Phase B closure-iota; measured there Int16 -33.8%, Vector -17.1%, mergeSort -6.8%). Covers the Rust-specific mapping the IxVM side doesn't have: cheap_proj/cheap_rec threading, fuel retention (adversarial posture), ambient let/LDecl zeta via the exit-to-outer- loop contract, iterative loop for stack discipline, clo_subst readback with lbr guards + per-call scratch, unchanged cache/uid semantics. Phased A (beta/zeta) then B (closure-iota, the UTF-8-class payoff), with the known curried-sharing regression class added to the bench suite before implementation.
…beta path) Port of the IxVM environment machine (~/ix-aiur d0d3375; design in docs/env_machine_whnf.md). whnf_core's App arm enters a Krivine-style machine when a beta fires: beta/zeta become O(1) environment pushes onto a persistent cons-list of closures, and substitution materializes only at machine exits (clo_subst readback) — work proportional to what the reduction consumes, not to every beta performed. A beta chain ending in another beta never materializes its intermediate bodies. - subst.rs: Clo/MEnv closure types (carried env length; per-Clo OnceCell readback memo standing in for Aiur's global memoization), clo_subst/clo_readback with lbr guards, per-call (uid, depth) memo from a new InternTable scratch pool, results interned like subst. - whnf.rs: iterative machine_whnf loop + machine_exit, replacing the multi-lambda peel + simul_subst block. Fuel charged on beta/zeta at the eager path's one-substitution-event granularity. Every exit reads back and re-enters the outer loop, so ambient let/LDecl zeta, iota, prim dispatch, and cheap_proj/cheap_rec def-eq discipline are unchanged; all caches and keys untouched. - bench-cycles.sh: add natreclinear/natfoldsucc (foldAdd-class sentinels for the known curried-sharing regression class). 609 kernel + 200 ixon tests pass; native check_one true on Nat.add_comm, Nat.gcd_comm, Int16.instRxcHasSize_eq, natRecLinearCheck, Nat.fold_succ; perf_check whole-env failure parity with HEAD on 11 envs. Guest cycle benchmarks pending (box busy).
Port of IxVM try_iota_c (~/ix-aiur 097e9d5). The machine's Const exit no longer reads its spine back when the head is a recursor on the main ctor-rule path: only the major premise materializes (readback + the same full whnf the eager path uses), and on a hit the rule RHS re-enters the machine with params/motives/minors and post-major arguments as their ORIGINAL closures plus the ctor fields wrapped closed. The rule's Lam-chain betas push them straight into an environment, so unselected minors — dropped match/Decidable branches, the UTF-8 codec class — are never substituted and never read back. Off-main-path cases miss to the plain readback exit by construction, preserving semantics exactly: K recursors (ctor synthesis needs infer + def-eq), Nat literal majors (transient-work cache discipline and the linear-rec/offset shortcuts stay on the plain path), Str literal majors, struct-eta candidates, and stuck majors. cheap_rec mode skips machine-iota entirely, mirroring the eager gating. A miss costs one readback of the major; the plain path's whnf of it then hits a warm cache. Machine-native delta (IxVM Phase C1.5) does NOT port at this layer: their machine lives in a whnf that includes delta, while ours sits inside whnf_core, which must stay delta-free for def-eq's lazy-delta unfold ordering. Spanning the whnf delta loop with closures is a separate, larger restructure. 610 kernel + 200 ixon tests pass (new closure-iota-via-beta unit test covers the spine arithmetic incl. post-major shift); native check_one true on the usual five constants; perf_check whole-env failure parity on 11 envs. Guest cycle benchmarks pending alongside Phase A's.
map_or_else over map().unwrap_or_else(), pass ExprKey by reference in timed_intern_or_build, explicit clone for spec_params dedup snapshot.
fbf4c7a to
4d2452d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request implements Rust-kernel and ixon optimizations for the Zisk guest, measured against the base #411 as deterministic guest cycle counts (
ziskemu -m) over dumped shard inputs (zisk/scripts/bench-cycles.sh), with every counted run's committedfailurespublics word verified zero.Key Results
Int16.instRxcHasSize_eq: 5.70 B → 42.0 M guest steps (135.6×);Int32/Int64variants confirmed at the same magnitude (~42 M) — in-circuit cost is now integer-width-independentVector.extract_append._proof_1−32.4%; the Init Array/VectorExtractproof family −22% to −33% each (five multi-billion-step constants)Std.Tactic.BVDecide.BVExpr.bitblast.goCache_Inv_of_Inv._mutual(28.35 B steps, the most expensive constant measured in Init+Std): previously crashed the 512 MB guest (OOM-class null read) → now completes as a valid kernel pass — lazy closures cut peak RAM, not just cyclesdocs/kernel_identity.md), blob bytes verified at load in all deserializersFifteen Commits (oldest first)
bench: guest cycle harness + native single-constant check
bench-cycles.shsuite over dumped guest stdins with failures-word validation;check_oneexample mirrors the guest's reuse-mode check nativelykernel: keep symbolic Nat offsets compact (whnf stuck + offset def-eq)
Nat.add x lit/Nat.div|mod x kstay stuck as compact offsets instead of materializingsucc^n(x)chains or the division algorithm; bulk offset def-eq decides offset pairs directlyInt16.instRxcHasSize_eqcollapse (5.70 B → 56 M) and the UTF-8-codec-class fixkernel: intern-assigned uids replace per-node blake3 content hashing
kernel: memoized prim-family dispatch in the WHNF/def-eq reduction loops
ixon: defer per-constant address verification to first materialization
get(); blob bytes still verified eagerly at loadbench: add Int32/Int64 instRxcHasSize_eq to the cycle suite
docs: kernel uid identity vs Ixon content addressing
kernel: fix PrimFamily visibility/qualification warnings
kernel: port jcb/fixes H-15 whnf probe pre-filter + H-12 nat output caps
kernel+ixon: harden term identity against adversarial inputs
kernel: privatize uid-accepting constructors; document cross-shard uid story
docs: design for the environment-machine WHNF port
kernel: environment-machine WHNF (Phase A — lazy substitution on the beta path)
whnf_core's App arm enters a closure machine when a beta fires: beta/zeta are O(1) environment pushes; substitution materializes only at machine exits (clo_substreadback), so work is proportional to what the reduction consumeskernel: closure-iota at the machine's recursor exit (Phase B)
Decidablebranches) are never substituted and never read backdocs: env-machine WHNF design is implemented (Phases A+B)