Skip to content

feat: Add SP1 and Zisk kernel proving backends#411

Draft
samuelburnham wants to merge 18 commits into
mainfrom
kernel-riscv
Draft

feat: Add SP1 and Zisk kernel proving backends#411
samuelburnham wants to merge 18 commits into
mainfrom
kernel-riscv

Conversation

@samuelburnham

@samuelburnham samuelburnham commented May 15, 2026

Copy link
Copy Markdown
Member

This pull request adds RISC-V zkVM proving backends for the Rust kernel — Zisk as the primary target (single process, single GPU), SP1 as an experimental alternative — together with the host pipeline that makes whole-environment Ix proving practical: RAM-aware sharding, tree aggregation with in-circuit assumption discharge, and cross-run proof reuse. Guest runs are validated by the shard's committed failures publics word; cycle counts are deterministic ziskemu step counts.

Key Results

  • Whole-environment proving works end to end on a single GPU: profile → shard plan (.ixes manifest) → leaf proofs → tree-aligned aggregation to a single proof with no dangling assumptions
  • Shard count is auto-sized from machine RAM and profiled heartbeats — no manual budget tuning
  • Cross-run reuse: store-aware planning re-proves only novel shards; a shard whose certified targets are already covered by stored proofs is skipped entirely
  • Guest checks an explicit check-list against a closure-injected sub-env (never the whole env); 112-byte shard publics bind the env hash, range/count, failures word, and assumption roots
  • ByteArray.utf8DecodeChar?_utf8EncodeChar_append — the most expensive proof in Init — provable via the NatSuccMode::Stuck cache
  • Init is provable in principle in 288 shards on a 256GB RAM machine, ~0.35 constants per second out of ~52k total constants = ~37-45 hours on a single GPU.

@samuelburnham samuelburnham force-pushed the kernel-riscv branch 2 times, most recently from d2ea3d9 to 554d76d Compare June 12, 2026 15:39
Move Rust sources from src/ into per-crate trees under crates/
(aiur, common, compile, ffi, ixon, kernel) and convert the root
Cargo.toml into a workspace manifest with shared dependencies, lints,
and profiles.
Fix CI

Fix licenses

Updates

Fixups
The `#[quickcheck]` attribute macro lives in `quickcheck_macros`, not
`quickcheck`. Without this import the ixon test target fails to
compile.
Re-home jcb's monolithic `crate::ix::{profile,shard}` into `crates/kernel`
(ProfileSink must be reachable from KEnv) and rewrite imports to the
workspace convention (ix_common::address, ixon::merkle, crate::{profile,shard},
ix_kernel::{profile,shard} from the FFI). No logic changes to the ported code.

ix-kernel builds; all 12 shard partitioner tests + the profile-sink
delta-edge test pass. (Pre-existing, unrelated: ix-ffi fails to build on
kernel-riscv too — private IOBuffer fields in aiur/protocol.rs from the
workspace split.)
…olding

Replace the static range/byte partition heuristics with the offline
profiler/partitioner plan as the primary sharding path:

* --shard-plan <manifest.ixes>: prove a `.ixes` partition (from `ix profile`
  + `ix shard`, i.e. real per-block heartbeats balanced + cross-shard
  delta-ingress minimized). Each manifest shard → one closure-injected leaf
  proof; leaves tree-fold into one aggregate whose committed subject is bound
  to the env. Maps manifest block addresses to work items by ingress-block
  (a Muts work item's `primary` is a member projection, not the block addr),
  validates full coverage, and composes with --execute and --only-shard.
* New `ix-kernel` example `shard_plan`: profile a .ixe + partition → .ixes,
  out of circuit, without the Lean/FFI layer (handy until the pre-existing
  ix-ffi build break is fixed). Validated end-to-end in --execute on
  nataddcomm: 34 work items → 4 balanced shards (heartbeats 78–120), full
  coverage, 0 failures.

Removed as superseded by the planner:
* --topo + topo_order_items (the graph cut subsumes within-env scheduling)
* --closure-stats / closure_report (manifest reports exact cross_ingress)
* --overlap-stats / overlap_report (Design-C measurement scaffolding)

The host `shard` partitioner uses rayon + std::time, so the module is gated
to non-riscv targets (it runs in the host planner, never the guest); `profile`
stays available everywhere since KEnv holds a ProfileSink (always None on the
guest). --shard-consts/--shard-bytes remain as no-profile fallbacks.
Add `partition_for_cycle_cap(profile, max_cycles, cycles_per_heartbeat, eps)`:
pick the smallest N whose heaviest shard fits a Zisk guest-cycle budget, then
partition. Converts the cap to a heartbeat cap (cycles ÷ cycles/hb), estimates
N from total load, and grows N proportionally to any overshoot — re-partitioning
each round (cheap) to converge on the minimal feasible N. Detects the
atomic-block floor: when the largest mutual block alone exceeds the cap, flags
`infeasible_atomic_floor` (no N can fit it; split upstream or raise the cap).

The cap is the leaf prover's trace size, which sets peak prover RAM; doc gives
the conversion from a host-RAM budget (peak_GB ≈ 45 + 32·cycles_B, measured on
this prover) so users think in RAM, pass cycles, and get N.

Wire it into the `shard_plan` example: `--max-cycles C` (alternative to
`--shards N`), `--cycles-per-heartbeat K` (default 208k, mergesort-calibrated;
recalibrate per env via one --execute). Verified on mergesort: 4.5e9 → N=2
(heaviest 3.9B cyc), 2e9 → N=5 (1.76B cyc); largest atomic block 2553 hb.
…→215k)

Measured whole-env Zisk cycles ÷ profiler heartbeats across 12 envs:
- large shardable envs (>20k hb) cluster 194-208k whole-env
- per-shard ratio runs ~5-8% higher (a shard memoizes less than the whole env);
  mergesort's heaviest shard is ~216k
- tiny arithmetic envs ~130-160k; heavy-def-eq envs (array/string-assoc) ~258k

The budget needs the per-shard heaviest ratio (so it never under-predicts a
leaf's cycles → never OOMs). Set the default to 215k (matches the measured
mergesort heaviest shard; predicts 4.03B vs 4.042B actual). Genuinely
workload-dependent, so the doc + --cycles-per-heartbeat recalibration guidance
stand; the cap's RAM headroom absorbs the residual. mergesort still → N=2.
Add cycle_cap_for_ram(ram_gb): invert the measured single-leaf prover model
(peak_GB ≈ 45 + 32·cycles_B) at ~78% of RAM to a per-shard cycle cap; 0 if the
box can't hold the ~45 GB prover base. Compose with the existing
partition_for_cycle_cap so RAM → cap → N from total heartbeats.

shard_plan example: default mode (no flags) reads /proc/meminfo and sizes N
automatically — the user just passes the .ixe. --shards / --max-cycles remain
as explicit overrides; --ram-gb for what-if. Verified: 250 GB → N=2 (mergesort),
64 GB → N=47 + atomic-floor warning (its biggest block won't fit), 40 GB refused.
The recursive min-cut partitioner now returns its bisection tree
(AggNode: leaves are shard ids, internal nodes are min-cut splits) via
partition_with_tree; rec_bisect builds it as it recurses, so the tree
mirrors the partition exactly. agg_plan lowers the binary tree to an
arity-bounded post-order fold plan (FoldOp::Leaf/Agg), collapsing whole
subtrees so each agg call folds up to `arity` children — agg-call count
stays ~N/(arity-1), same as a flat fold, while children remain
subtree-aligned and in shard order (keeping the subject merkle-fold
left-associative).

The tree is serialized as an optional trailing .ixes section: pre-tree
manifests decode to tree=None and provers fall back to a flat fold.
partition_for_cycle_cap and shard_esp populate it; the shard_plan
example writes it (omitted for --naive baselines).

Folding along this tree lets aggregation discharge each cross-shard
assumption at the lowest common ancestor of the shards that share it,
where the assumption sets are smallest.
The agg guest gains a mode word. Mode 0 is the legacy O(1) merkle_join
fold of opaque subject/assumption roots (used by the flat fallback,
range, and reuse paths — unchanged semantics). Mode 1 implements set
discharge: the host supplies each child's subject/assumption preimage
address lists as untrusted witness; the guest re-derives every list's
canonical merkle root and asserts it equals the child's committed root,
then commits canonical roots over union(subjects) and
union(assumptions) \ union(subjects). Assumptions proven by any sibling
are discharged in-circuit at the lowest level of the tree where
producer and consumer meet; when the whole env folds, the final
assumptions root collapses to the zero sentinel and the aggregate claim
is unconditional. Mode-1 outputs are canonical set commitments, so
agg-of-agg witness verification is uniform at every level, and the mode
is committed in field_b.

The shard-plan host path folds along the manifest's bisection tree
(pruned to the shards actually proven), executes the arity-collapsed
plan, supplies the discharge witness per agg call, mirrors the
union/discharge host-side, binds the root subject to the canonical root
over the covered union, and binds the root assumptions to the mirror —
reporting closure ("unconditional") or the outstanding count. Each leaf
additionally checks the host's assumption mirror against the guest's
committed root at prove time, so a rule divergence fails fast rather
than as a witness panic deep in the agg.

Measured on mergesort (8 shards, identical leaves, warm GPU): discharge
adds 2.5s to the single arity-16 agg call (19.0s -> 21.5s, +0.9% total
run) and resolves all 2439 cross-shard assumptions to zero.

Also: the Assembly executor is now the default (--emulator opts out).
It is ~4-6x faster at trace generation and a prerequisite for the hints
stream. The historical multi-program breakage (second setup() clobbers
the first program's ASM ROM histogram -> rom.rs:178 panic) is fixed by
ordering: the agg program is set up only after the leaf loop, never
before; the Emulator keeps the upfront setup. Validated end to end on
single-leaf, balanced (4-leaf), unbalanced mixed-vk (3-leaf), and
8-shard runs.
Simplifications, no behavior change (validated: the 3-shard discharge
prove reproduces the same leaf steps, discharge counts, and final
subject root as before):

- Drop the flat mode-0 fallback fold from run_shard_plan. It was
  reachable only via manifests written before the tree section existed,
  which regenerate in seconds; aggregation now requires the bisection
  tree and says so. One fold path instead of two.
- Move tree pruning into the kernel as AggNode::prune (where the type
  lives), with the unit test it was missing.
- Drop the leaf_subjects vector (redundant with leaf_subject_sets) and
  the duplicate cover->Address conversion in the leaf loop.
- Extract discharge() — the host mirror of the agg guest's
  union/resolve — out of the fold loop.
- Name build_inputs' return triple (LeafInputs) and tighten
  build_plan's frontier expansion.
- Zero compiler and clippy warnings across ix-kernel and zisk-host
  (was 12): unused mut, documented dead-code allow on the publics
  layout field, redundant to_string in format args, indexed loop,
  then -> then_some.

Audited and kept: every CLI flag is read; subject_of_cover,
fold_subjects, and tree_partition remain in use by the only-const,
range, and reuse paths; the agg guest's two-mode structure is already
minimal.
…euse path

--store-dir now composes with --shard-plan (and requires it). A manifest
shard whose certified targets are all covered by stored proofs is not
re-proven: the covering stored proofs are folded into the aggregate as
extra children, where the agg guest verifies them (vk-pinned,
witness-root-checked) and discharges assumptions against their subjects
— reuse by verification, never by trust. Partially-covered shards are
re-proven whole (the shard is the proof unit). Freshly proven shards are
banked per-shard (proofs/<n>.{proof,cover,asm}) as they complete, so an
interrupted run resumes where it left off; .asm carries the assumption
preimages mode-1 discharge needs, and the proofs directory is the
store's single source of truth (proven.bin is gone).

The fold is one op-list executor: novel leaves fold along the pruned
bisection tree as before; stored proofs join as extra slots; a
synthesized tail fold reduces the forest to one verified root, with an
agg-of-1 when a lone stored proof would otherwise go unverified.
Mode-1 commitments are canonical sets, so fold shape never changes the
final claim — validated: fresh-store, full-reuse (0 leaves proven), and
mixed (novel tree + stored tail) runs all commit the identical subject
root and discharge to an unconditional claim.

The legacy standalone reuse path (mode-0 join fold, naive count/byte
chunking, proven.bin) is deleted, along with its private flags
(--closure-inject, --limit-items) and now-orphaned helpers
(fold_subjects, plan_chunks_by_closure, addr_bytes, load_store,
save_store). --require-closed derives the covered set from the proof
index. Net -392 lines; zero compiler/clippy warnings.
…reuse

The planner (shard_plan --store-dir) now diffs the env against the proof
store before profiling: work items whose targets are all store-covered
are skipped entirely — not typechecked (planning gets faster), not
partitioned, and their blocks excluded from the hypergraph (a
novel→covered delta edge is an assumption discharged at aggregation,
not a cut to minimize). The manifest covers only novel work.

The host accepts such manifests: grouping requires every work item to
be either manifest-assigned or store-covered, and the covering-proof
selection generalizes from "reused manifest shards" to "every needed
target the novel shards won't certify" — subsuming reused shards and
unmanifested covered work. A fixpoint additionally folds stored proofs
that cover the assumptions of already-included stored proofs, so a
closed store yields an unconditional claim even cross-env.

This fixes the cross-env conservatism where the min-cut objective pulled
already-proven library constants into novel shards (clustering
dependents with dependencies) and one novel constant forced a whole
mixed shard to re-prove. Validated: with nataddcomm proven into a
store, planning natmulcomm skips 34 of 47 work items, proves 2 novel
shards (13 items), folds the 3 stored proofs (verified in-circuit,
assumption-closure fixpoint), and discharges to an unconditional claim
— 72% of the work reused across envs. Same-env full reuse regression
unchanged (identical subject root).
The open-assumption filter scanned stored×subjects per assumption per
round — O(assumptions × proofs × subjects), seconds-to-minutes of host
CPU at init-store scale (~206 proofs × 51k subjects). One HashSet of
included-proof subjects per round makes it linear and reads better.
Re-validated: full-reuse fold of the cross-env store commits the
identical subject root, outstanding 0, unconditional.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant