feat: Add SP1 and Zisk kernel proving backends#411
Draft
samuelburnham wants to merge 18 commits into
Draft
Conversation
d2ea3d9 to
554d76d
Compare
Move Rust sources from src/ into per-crate trees under crates/ (aiur, common, compile, ffi, ixon, kernel) and convert the root Cargo.toml into a workspace manifest with shared dependencies, lints, and profiles.
Fix CI Fix licenses Updates Fixups
The `#[quickcheck]` attribute macro lives in `quickcheck_macros`, not `quickcheck`. Without this import the ixon test target fails to compile.
Re-home jcb's monolithic `crate::ix::{profile,shard}` into `crates/kernel`
(ProfileSink must be reachable from KEnv) and rewrite imports to the
workspace convention (ix_common::address, ixon::merkle, crate::{profile,shard},
ix_kernel::{profile,shard} from the FFI). No logic changes to the ported code.
ix-kernel builds; all 12 shard partitioner tests + the profile-sink
delta-edge test pass. (Pre-existing, unrelated: ix-ffi fails to build on
kernel-riscv too — private IOBuffer fields in aiur/protocol.rs from the
workspace split.)
…olding Replace the static range/byte partition heuristics with the offline profiler/partitioner plan as the primary sharding path: * --shard-plan <manifest.ixes>: prove a `.ixes` partition (from `ix profile` + `ix shard`, i.e. real per-block heartbeats balanced + cross-shard delta-ingress minimized). Each manifest shard → one closure-injected leaf proof; leaves tree-fold into one aggregate whose committed subject is bound to the env. Maps manifest block addresses to work items by ingress-block (a Muts work item's `primary` is a member projection, not the block addr), validates full coverage, and composes with --execute and --only-shard. * New `ix-kernel` example `shard_plan`: profile a .ixe + partition → .ixes, out of circuit, without the Lean/FFI layer (handy until the pre-existing ix-ffi build break is fixed). Validated end-to-end in --execute on nataddcomm: 34 work items → 4 balanced shards (heartbeats 78–120), full coverage, 0 failures. Removed as superseded by the planner: * --topo + topo_order_items (the graph cut subsumes within-env scheduling) * --closure-stats / closure_report (manifest reports exact cross_ingress) * --overlap-stats / overlap_report (Design-C measurement scaffolding) The host `shard` partitioner uses rayon + std::time, so the module is gated to non-riscv targets (it runs in the host planner, never the guest); `profile` stays available everywhere since KEnv holds a ProfileSink (always None on the guest). --shard-consts/--shard-bytes remain as no-profile fallbacks.
Add `partition_for_cycle_cap(profile, max_cycles, cycles_per_heartbeat, eps)`: pick the smallest N whose heaviest shard fits a Zisk guest-cycle budget, then partition. Converts the cap to a heartbeat cap (cycles ÷ cycles/hb), estimates N from total load, and grows N proportionally to any overshoot — re-partitioning each round (cheap) to converge on the minimal feasible N. Detects the atomic-block floor: when the largest mutual block alone exceeds the cap, flags `infeasible_atomic_floor` (no N can fit it; split upstream or raise the cap). The cap is the leaf prover's trace size, which sets peak prover RAM; doc gives the conversion from a host-RAM budget (peak_GB ≈ 45 + 32·cycles_B, measured on this prover) so users think in RAM, pass cycles, and get N. Wire it into the `shard_plan` example: `--max-cycles C` (alternative to `--shards N`), `--cycles-per-heartbeat K` (default 208k, mergesort-calibrated; recalibrate per env via one --execute). Verified on mergesort: 4.5e9 → N=2 (heaviest 3.9B cyc), 2e9 → N=5 (1.76B cyc); largest atomic block 2553 hb.
…→215k) Measured whole-env Zisk cycles ÷ profiler heartbeats across 12 envs: - large shardable envs (>20k hb) cluster 194-208k whole-env - per-shard ratio runs ~5-8% higher (a shard memoizes less than the whole env); mergesort's heaviest shard is ~216k - tiny arithmetic envs ~130-160k; heavy-def-eq envs (array/string-assoc) ~258k The budget needs the per-shard heaviest ratio (so it never under-predicts a leaf's cycles → never OOMs). Set the default to 215k (matches the measured mergesort heaviest shard; predicts 4.03B vs 4.042B actual). Genuinely workload-dependent, so the doc + --cycles-per-heartbeat recalibration guidance stand; the cap's RAM headroom absorbs the residual. mergesort still → N=2.
Add cycle_cap_for_ram(ram_gb): invert the measured single-leaf prover model (peak_GB ≈ 45 + 32·cycles_B) at ~78% of RAM to a per-shard cycle cap; 0 if the box can't hold the ~45 GB prover base. Compose with the existing partition_for_cycle_cap so RAM → cap → N from total heartbeats. shard_plan example: default mode (no flags) reads /proc/meminfo and sizes N automatically — the user just passes the .ixe. --shards / --max-cycles remain as explicit overrides; --ram-gb for what-if. Verified: 250 GB → N=2 (mergesort), 64 GB → N=47 + atomic-floor warning (its biggest block won't fit), 40 GB refused.
The recursive min-cut partitioner now returns its bisection tree (AggNode: leaves are shard ids, internal nodes are min-cut splits) via partition_with_tree; rec_bisect builds it as it recurses, so the tree mirrors the partition exactly. agg_plan lowers the binary tree to an arity-bounded post-order fold plan (FoldOp::Leaf/Agg), collapsing whole subtrees so each agg call folds up to `arity` children — agg-call count stays ~N/(arity-1), same as a flat fold, while children remain subtree-aligned and in shard order (keeping the subject merkle-fold left-associative). The tree is serialized as an optional trailing .ixes section: pre-tree manifests decode to tree=None and provers fall back to a flat fold. partition_for_cycle_cap and shard_esp populate it; the shard_plan example writes it (omitted for --naive baselines). Folding along this tree lets aggregation discharge each cross-shard assumption at the lowest common ancestor of the shards that share it, where the assumption sets are smallest.
The agg guest gains a mode word. Mode 0 is the legacy O(1) merkle_join
fold of opaque subject/assumption roots (used by the flat fallback,
range, and reuse paths — unchanged semantics). Mode 1 implements set
discharge: the host supplies each child's subject/assumption preimage
address lists as untrusted witness; the guest re-derives every list's
canonical merkle root and asserts it equals the child's committed root,
then commits canonical roots over union(subjects) and
union(assumptions) \ union(subjects). Assumptions proven by any sibling
are discharged in-circuit at the lowest level of the tree where
producer and consumer meet; when the whole env folds, the final
assumptions root collapses to the zero sentinel and the aggregate claim
is unconditional. Mode-1 outputs are canonical set commitments, so
agg-of-agg witness verification is uniform at every level, and the mode
is committed in field_b.
The shard-plan host path folds along the manifest's bisection tree
(pruned to the shards actually proven), executes the arity-collapsed
plan, supplies the discharge witness per agg call, mirrors the
union/discharge host-side, binds the root subject to the canonical root
over the covered union, and binds the root assumptions to the mirror —
reporting closure ("unconditional") or the outstanding count. Each leaf
additionally checks the host's assumption mirror against the guest's
committed root at prove time, so a rule divergence fails fast rather
than as a witness panic deep in the agg.
Measured on mergesort (8 shards, identical leaves, warm GPU): discharge
adds 2.5s to the single arity-16 agg call (19.0s -> 21.5s, +0.9% total
run) and resolves all 2439 cross-shard assumptions to zero.
Also: the Assembly executor is now the default (--emulator opts out).
It is ~4-6x faster at trace generation and a prerequisite for the hints
stream. The historical multi-program breakage (second setup() clobbers
the first program's ASM ROM histogram -> rom.rs:178 panic) is fixed by
ordering: the agg program is set up only after the leaf loop, never
before; the Emulator keeps the upfront setup. Validated end to end on
single-leaf, balanced (4-leaf), unbalanced mixed-vk (3-leaf), and
8-shard runs.
Simplifications, no behavior change (validated: the 3-shard discharge prove reproduces the same leaf steps, discharge counts, and final subject root as before): - Drop the flat mode-0 fallback fold from run_shard_plan. It was reachable only via manifests written before the tree section existed, which regenerate in seconds; aggregation now requires the bisection tree and says so. One fold path instead of two. - Move tree pruning into the kernel as AggNode::prune (where the type lives), with the unit test it was missing. - Drop the leaf_subjects vector (redundant with leaf_subject_sets) and the duplicate cover->Address conversion in the leaf loop. - Extract discharge() — the host mirror of the agg guest's union/resolve — out of the fold loop. - Name build_inputs' return triple (LeafInputs) and tighten build_plan's frontier expansion. - Zero compiler and clippy warnings across ix-kernel and zisk-host (was 12): unused mut, documented dead-code allow on the publics layout field, redundant to_string in format args, indexed loop, then -> then_some. Audited and kept: every CLI flag is read; subject_of_cover, fold_subjects, and tree_partition remain in use by the only-const, range, and reuse paths; the agg guest's two-mode structure is already minimal.
…euse path
--store-dir now composes with --shard-plan (and requires it). A manifest
shard whose certified targets are all covered by stored proofs is not
re-proven: the covering stored proofs are folded into the aggregate as
extra children, where the agg guest verifies them (vk-pinned,
witness-root-checked) and discharges assumptions against their subjects
— reuse by verification, never by trust. Partially-covered shards are
re-proven whole (the shard is the proof unit). Freshly proven shards are
banked per-shard (proofs/<n>.{proof,cover,asm}) as they complete, so an
interrupted run resumes where it left off; .asm carries the assumption
preimages mode-1 discharge needs, and the proofs directory is the
store's single source of truth (proven.bin is gone).
The fold is one op-list executor: novel leaves fold along the pruned
bisection tree as before; stored proofs join as extra slots; a
synthesized tail fold reduces the forest to one verified root, with an
agg-of-1 when a lone stored proof would otherwise go unverified.
Mode-1 commitments are canonical sets, so fold shape never changes the
final claim — validated: fresh-store, full-reuse (0 leaves proven), and
mixed (novel tree + stored tail) runs all commit the identical subject
root and discharge to an unconditional claim.
The legacy standalone reuse path (mode-0 join fold, naive count/byte
chunking, proven.bin) is deleted, along with its private flags
(--closure-inject, --limit-items) and now-orphaned helpers
(fold_subjects, plan_chunks_by_closure, addr_bytes, load_store,
save_store). --require-closed derives the covered set from the proof
index. Net -392 lines; zero compiler/clippy warnings.
…reuse The planner (shard_plan --store-dir) now diffs the env against the proof store before profiling: work items whose targets are all store-covered are skipped entirely — not typechecked (planning gets faster), not partitioned, and their blocks excluded from the hypergraph (a novel→covered delta edge is an assumption discharged at aggregation, not a cut to minimize). The manifest covers only novel work. The host accepts such manifests: grouping requires every work item to be either manifest-assigned or store-covered, and the covering-proof selection generalizes from "reused manifest shards" to "every needed target the novel shards won't certify" — subsuming reused shards and unmanifested covered work. A fixpoint additionally folds stored proofs that cover the assumptions of already-included stored proofs, so a closed store yields an unconditional claim even cross-env. This fixes the cross-env conservatism where the min-cut objective pulled already-proven library constants into novel shards (clustering dependents with dependencies) and one novel constant forced a whole mixed shard to re-prove. Validated: with nataddcomm proven into a store, planning natmulcomm skips 34 of 47 work items, proves 2 novel shards (13 items), folds the 3 stored proofs (verified in-circuit, assumption-closure fixpoint), and discharges to an unconditional claim — 72% of the work reused across envs. Same-env full reuse regression unchanged (identical subject root).
The open-assumption filter scanned stored×subjects per assumption per round — O(assumptions × proofs × subjects), seconds-to-minutes of host CPU at init-store scale (~206 proofs × 51k subjects). One HashSet of included-proof subjects per round makes it linear and reads better. Re-validated: full-reuse fold of the cross-env store commits the identical subject root, outstanding 0, unconditional.
…_utf8EncodeChar_append`
554d76d to
735d07c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request adds RISC-V zkVM proving backends for the Rust kernel — Zisk as the primary target (single process, single GPU), SP1 as an experimental alternative — together with the host pipeline that makes whole-environment Ix proving practical: RAM-aware sharding, tree aggregation with in-circuit assumption discharge, and cross-run proof reuse. Guest runs are validated by the shard's committed
failurespublics word; cycle counts are deterministicziskemustep counts.Key Results
.ixesmanifest) → leaf proofs → tree-aligned aggregation to a single proof with no dangling assumptionsByteArray.utf8DecodeChar?_utf8EncodeChar_append— the most expensive proof inInit— provable via theNatSuccMode::Stuckcache