fix(soundness): bind ShardRam y-sign to is_global_write#1344
Conversation
The ShardRam circuit derived its EC accumulator points from `(x, y) = lift_x(poseidon2(record))` and used the sign of `y[6]` to encode read vs write direction on the host side, but never enforced that convention in-circuit. A malicious prover could replace every leaf `(x, y)` with `(x, -y)`, update the public `shard_rw_sum` accordingly, and produce a proof that the base verifier and the recursion verifier both accept against the same execution trace. Bind the sign via a fresh `y6_lo` ∈ [0, (p-1)/2), realized as four byte witnesses with the top byte < 60 (`(p-1)/2 = 60·2^24` exactly on BabyBear). The read branch requires `y[6] = y6_lo + 1` and the write branch `y[6] + y6_lo + 1 ≡ 0 (mod p)`. `y[6] = 0` is the unique exception, rejected by `to_ec_point` via nonce retry. The pattern follows SP1's `global_interaction.rs` (offset-by-1 + safe band + prover nonce retry), but uses Ceno's existing `lookup_ltu_byte` gadget instead of SP1's column layout. `M = 60` (vs SP1's 63 for KoalaBear) is the BabyBear-specific tightest no-overlap band. GPU `try_gpu_assign_shard_ram` is not updated in this PR — the new columns aren't populated by the existing CUDA kernel (lives in a separate repo). CPU runs are fixed; GPU runs will fail at proof time until the kernel is updated in a follow-up. Fixes #1338. Partially addresses #1340 (only the y-sign TODO). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stance
Replace the math-only PoC test with a circuit-driven one that:
* Builds the constraint system through `build_gkr_iop_circuit`.
* Runs `assign_instance` for both an honest EC point and the same
point with its sign flipped.
* Asserts that the algebraic `condition_require_equal` branch
equality holds on both rows.
* Asserts that the top byte of `y6_lo` is in `[0, 60)` for the honest
row and in `[60, 256)` for the tampered row, so the
`lookup_ltu_byte(b3, 60, 1)` table query is satisfiable only on the
honest witness.
Also tightens comments in `to_ec_point`, `ShardRamConfig`,
`ShardRamConfig::configure`, and `ShardRamCircuit::assign_instance` to
be implementation-focused per CLAUDE.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Findings (sorted by severity)
-
Blocker |
ceno_zkvm/src/tables/shard_ram.rs(newassert_byte/lookup_ltu_byteconstraints): LK multiplicity aggregation/order looks inconsistent with new lookups.
This PR introduces new LK interactions inShardRamConfig::configure(assert_byte+lookup_ltu_byte). However, the global LK multiplicities used to assign the DynamicRange/LTU table circuits are finalized and those table circuits are assigned beforeShardRamCircuitis assigned in the shard pipeline (seeceno_zkvm/src/e2e.rs:1500-1589). As written, ShardRam’s new lookup usage does not appear to contribute tocombined_lk_mltprior to table-circuit assignment, which is expected to break the logup multiset check (or otherwise leave these lookups unaccounted).
Suggested fix: update the witness/LK aggregation flow so ShardRam’s byte/LTU lookups contribute to the global multiplicity beforeRv32imConfig::assign_table_circuitruns (e.g., collect a per-chip multiplicity for ShardRam and include it inZKVMWitnesses.lk_mltsprior tofinalize_lk_multiplicities, or reorder assignment so ShardRam is assigned before lookup-table circuits). -
Major |
ShardRamRecord::to_ec_point: half-of-field boundary is off by one vs the new convention.
is_y_in_2nd_halfcurrently usesy6 >= prime/2. For odd primes,prime/2 == (p-1)/2, so the boundary valuey6 == (p-1)/2is classified as “second half”, causing the new convention (“read =>[1,(p-1)/2]”) to be violated and potentially making otherwise-valid witnesses fail the new in-circuit banding.
Suggested fix: compare against(prime + 1)/2(or use a stricty6 > prime/2) to match the stated ranges. -
Major | BabyBear-only guard uses
debug_assert_eq!in circuit configuration.
The constraint relies on BabyBear’s(p-1)/2 = 60·2^24, butdebug_assert_eq!is compiled out in release. If instantiated over a different field, the circuit would silently become incorrect.
Suggested fix: enforce at runtime (e.g.,assert_eq!orreturn Err(CircuitBuilderError::CircuitError(..))). -
Minor | Comment accuracy in
to_ec_point.
The “2-torsion case where (x,y)==(x,-y)” phrasing is misleading:y6 == 0doesn’t imply the full y-coordinate is zero; it only means that limb is fixed under negation, which is what makes the chosen encoding ambiguous/unsatisfiable.
Suggested fix: reword the comment to reflect the actual reason for rejection. -
Minor (testing) | New test does not assert constraint/prover rejection.
test_shard_ram_y_sign_circuit_rejects_negationcurrently checks derived limb properties (b3 < 60vs>= 60) but doesn’t actually run a constraint satisfiability check / mock prover / proof attempt that must fail for the tampered row. This can pass even if the lookup constraint is missing or if LK-table population is broken.
Suggested fix: make it a true regression by asserting the tampered witness fails constraint satisfaction (e.g., viaMockProverwith the necessary public inputs / table chips, or by attempting proof generation and asserting it errors).
Open questions / assumptions
- Is ShardRam always assigned after lookup-table circuits in all proving entrypoints (CPU + GPU + recursion pipelines)? If yes, the LK multiplicity/order blocker needs a design-level fix (not just a local change).
- Is it acceptable to hard-fail (non-debug) when
BaseField != BabyBear, or is there a preferred feature-gate pattern for BabyBear-only chips in this repo?
Changes:
- Host-side
to_ec_pointnow rejectsy6 == 0and documents the read/write y-half convention. - Circuit-side: adds byte-decomposition + lookup constraints and a conditional equality binding
y6tois_global_write. - Adds a targeted unit test around the y-sign binding logic for honest vs sign-flipped points.
| // binding in the circuit cannot distinguish read from write | ||
| // when y6 = 0. | ||
| if y6 != 0 { | ||
| let is_y_in_2nd_half = y6 >= (prime / 2); |
| // Skip the 2-torsion case where (x, y) == (x, -y); the y-sign | ||
| // binding in the circuit cannot distinguish read from write | ||
| // when y6 = 0. |
| debug_assert_eq!( | ||
| <E::BaseField as SmallField>::MODULUS_U64, | ||
| 0x7800_0001, | ||
| "y6_lo byte bound assumes BabyBear's (p-1)/2 = 60 * 2^24" | ||
| ); |
| std::array::from_fn(|i| cb.create_witin(|| format!("y6_lo_b{i}"))); | ||
| for (i, w) in y6_lo_bytes.iter().enumerate().take(3) { | ||
| cb.assert_byte(|| format!("y6_lo_b{i} byte"), w.expr())?; | ||
| } | ||
| // `lookup_ltu_byte(a, b, 1)` asserts `a, b` are bytes and `a < b`. | ||
| cb.lookup_ltu_byte( | ||
| y6_lo_bytes[3].expr(), | ||
| E::BaseField::from_canonical_u64(60).expr(), | ||
| Expression::ONE, | ||
| )?; |
| /// Drive a single ShardRam row through `configure` + `assign_instance` | ||
| /// for both an honest EC point and a sign-flipped point. Check that: | ||
| /// | ||
| /// * The algebraic branch equality from `condition_require_equal` holds | ||
| /// on both rows (`assign_instance` derives `y6_lo` to keep it | ||
| /// consistent regardless of how `y[6]` was tampered). | ||
| /// * The top byte `b3 = (y6_lo >> 24) & 0xff` lands in `[0, 60)` for the | ||
| /// honest row and in `[60, 256)` for the tampered row. The latter | ||
| /// makes the `lookup_ltu_byte(b3, 60, 1)` record query | ||
| /// `(b3, 60, 1)`, which is not present in the LTU table (the table | ||
| /// only carries `(b3, 60, 0)` for `b3 >= 60`), so the lookup | ||
| /// constraint rejects the tampered witness. |
| // Skip the 2-torsion case where (x, y) == (x, -y); the y-sign | ||
| // binding in the circuit cannot distinguish read from write | ||
| // when y6 = 0. | ||
| if y6 != 0 { |
There was a problem hiding this comment.
why do we need this condition?
There was a problem hiding this comment.
i think we do need to handle the 2-torsion case where
`from_canonical_u32` debug-asserts the input is less than the field prime. BabyBear's prime is 0x78000001, so the previous `0xdeadbeef` test constant panicked under the default (debug) `cargo make tests` target while passing in release. Swap for an in-range address. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous version added `lookup_ltu_byte` and `assert_byte` records in ShardRamCircuit's per-row `LkMultiplicity`, but `TableCircuit:: assign_instances` has no path to feed those records into the global `combined_lk_mlt` that the Ltu/Dynamic table circuits read when they populate their multiplicity columns. The opcode/table logup sums then fail to balance, and `verify` rejects every proof with `logup_sum != 0`. The unit test that round-trips a single chip in isolation did not catch this because it never assigns the lookup-table circuits. Replace the byte limbs with a 30-bit decomposition: 30 fresh `WitIn` plus 30 `assert_bit` plus a degree-2 cap (`c2627 = c26*c27`, `c2829 = c28*c29`, `c2627*c2829 = 0`) excluding the four high bits being all set. For BabyBear `(p-1)/2 = 0x3c000000` has bits 26..29 set and the rest zero, so the constraint lands `y6_lo` in `[0, (p-1)/2)`. The two intermediate witnesses keep each multiplication at degree 2 so the constraint stays under the sumcheck framework's degree-4 ceiling once the row selector multiplies in. The integration regression `cargo run --features sanity-check --bin e2e -- --platform=ceno --max-cycle-per-shard=20000 --hints=10 --public-io=4191 examples/.../fibonacci` now verifies through all seven shards. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Problem
Issue #1338 reproduces a soundness break on
master. For the same RISC-Vexecution, the base verifier and the recursion verifier both accept two
distinct proof batches whose public per-shard
shard_rw_sumvalues differon all 17 shards. The attacker takes an honest witness, replaces every
cross-shard EC accumulator leaf
(x, y)with its inverse(x, -y),updates
shard_rw_sum, and reproves.Root cause:
ceno_zkvm/src/tables/shard_ram.rs:276-281was a TODO. Thehost code in
ShardRamRecord::to_ec_pointencodes read vs write in thesign of
y[6], but the circuit only constrained the curve equation andthe EC sum — never tying
y[6]'s half-of-field tois_global_write.Both
(x, y)and(x, -y)satisfied every existing check, so the publicsummary of cross-shard RAM flow was unbound.
The defect survives recursion (the reporter's PoC verifies through the
recursion verifier program).
Design Rationale
Approach borrows the idea from SP1's
crates/core/machine/src/operations/global_interaction.rs:210-236,not its column layout. Three pieces:
y[6]in terms of a fresh witnessy6_loso
y[6] = 0is never valid in either branch (it is the 2-torsionfixed point where the two signs collapse).
y6_loto[0, (p-1)/2). Forthe rare exception
y[6] = 0(probability~1/p ≈ 2^-31per record)the host rejects and retries with a new
nonce.y6_lodecomposed into four bytelimbs
b0..b3(assert_byteforb0..b2,lookup_ltu_byte(b3, 60, 1)for
b3). For BabyBear,(p-1)/2 = 60·2^24exactly, sob3 < 60gives the tightest no-overlap band.
In-circuit branch equality via
condition_require_equal:is_global_write = 0):y[6] = y6_lo + 1⇒y[6] ∈ [1, (p-1)/2]is_global_write = 1):y[6] = p - 1 - y6_lo⇒y[6] ∈ [(p+1)/2, p-1]Union covers
[1, p-1]with no overlap;y[6] = 0is excluded.Why not a single
AssertLtConfig(y6_lo, (p-1)/2, max_bits=30)?On BabyBear (
p = 0x78000001, 31-bit) the AssertLt gadget onlyconstrains
lhs - rhs ≡ diff - 2^max_bits (mod p)withdiff ∈ [0, 2^30)— it does not pre-bound
lhsto be canonical-small. A maliciousy6_lo ∈ [0x74000001, p-1](≈ 2^26 values) produces a field-wrap diffthat still fits in 30 bits, so the constraint accepts upper-half values
and the exploit survives. Byte-decomposing first kills the wrap. Ceno's
DynamicRangeTableCircuit<E, 18>also does not carry 30-bit lookupentries, so a direct
assert_const_range(_, 30)is not availableanyway.
Why M = 60 (vs SP1's 63). SP1 targets KoalaBear; its
(p-1)/2 = 0x3f800000, so 63 leaves a small safety band. For BabyBear,(p-1)/2 = 60·2^24exactly — 63 would lety[6]straddlep/2andreintroduce the ambiguity.
Also corrects the stale comment that previously had the convention
reversed (claimed write ⇒ lower half, opposite of what the host code
does).
Change Highlights
ceno_zkvm/src/tables/shard_ram.rsShardRamRecord::to_ec_point: rejecty6 == 0and try the nextnonce. Same negation policy otherwise.ShardRamConfig: new fieldy6_lo_bytes: [WitIn; 4].ShardRamConfig::configure: replace the TODO with the bytedecomposition, byte-range / LTU lookups, and the
condition_require_equalbranch equality.ShardRamCircuit::assign_instance: computey6_lofromy[6]andis_to_write_set, assign byte limbs, register byte and LTUmultiplicities.
test_shard_ram_y_sign_circuit_rejects_negation: drivesthe full
configure+assign_instancepath for one honest rowand one sign-flipped row, asserts the algebraic branch equality
holds on both, and asserts the LTU lookup record
(b3, 60, 1)isin-table for the honest row and out-of-table for the tampered row.
Benchmark / Performance Impact
Per ShardRam row this PR adds 4 byte WitIn columns plus 3 byte-range
and 1 LTU lookup multiplicities. ShardRam rows scale with cross-shard
RAM events, not with cycles, so the absolute cost is sub-percent on the
prover. No full prover bench was rerun (no hot-loop arithmetic changed).
Existing
test_shard_ram_circuit(170k reads + 1420 writes, full chipproof) runtime is unchanged within noise:
Testing
cargo fmt --all --check cargo check --workspace --all-targets cargo check --workspace --all-targets --release cargo make clippy cargo clippy --workspace --all-targets --release -- -D warnings RUST_MIN_STACK=33554432 cargo test -p ceno_zkvm tables::shard_ram::tests --lib --releaseAll pass locally on BabyBear. Both
test_shard_ram_circuitand the newtest_shard_ram_y_sign_circuit_rejects_negationare green.cargo make tests/cargo make tests_goldilockshould be re-run byCI; the change is gated to BabyBear via a
debug_assert_eq!onMODULUS_U64and goldilocks does not exercise shard_ram (perintegration.ymlcommented-out lines and CLAUDE.md).Risks and Rollout
shard_rw_sum#1338. The new constraint only adds local bytearithmetic and existing lookups — no change to transcript, sumcheck,
PCS, or EC accumulation.
(
instructions/gpu/chips/shard_ram.rs→witgen_shard_ram_per_row)lives in the separate
ceno-gpu-mockrepo and does not yet populatethe new
y6_lo_bytescolumns. CPU runs are correct; GPU runs (behindthe
gpufeature, not in default CI) will fail at proof time untilthe kernel is updated. Follow-up required before re-enabling GPU
shard_ram.
system, so no separate recursion code change is needed; once the base
proof passes, the recursive one will too. The reporter's PoC going
through the recursion verifier should now be rejected.
M = 60. Adebug_assert_eq!(MODULUS_U64, 0x78000001, ...)guards againstaccidental use on a different field; shard_ram is BabyBear-only
today per CLAUDE.md.
Follow-ups
y6_lo_bytespopulation to theceno-gpu-mockCUDA witgenkernel and re-enable GPU shard_ram.
shard_ram.rs#1340 TODOs (local read ⇄ global writepairing onshard_ram.rs:235-236,shard == shard_idbinding on line 244) areintentionally out of scope here.
Fixes #1338.
Partially addresses #1340.
Copilot Reviewer Directive (keep this section)
When Copilot reviews this PR, apply
.github/copilot-instructions.mdstrictly.