fix(soundness): bind ShardRam y-sign to is_global_write by spherel · Pull Request #1344 · scroll-tech/ceno

spherel · 2026-05-25T01:23:33Z

Problem

Issue #1338 reproduces a soundness break on master. For the same RISC-V
execution, the base verifier and the recursion verifier both accept two
distinct proof batches whose public per-shard shard_rw_sum values differ
on all 17 shards. The attacker takes an honest witness, replaces every
cross-shard EC accumulator leaf (x, y) with its inverse (x, -y),
updates shard_rw_sum, and reproves.

Root cause: ceno_zkvm/src/tables/shard_ram.rs:276-281 was a TODO. The
host code in ShardRamRecord::to_ec_point encodes read vs write in the
sign of y[6], but the circuit only constrained the curve equation and
the EC sum — never tying y[6]'s half-of-field to is_global_write.
Both (x, y) and (x, -y) satisfied every existing check, so the public
summary of cross-shard RAM flow was unbound.

The defect survives recursion (the reporter's PoC verifies through the
recursion verifier program).

Design Rationale

Approach borrows the idea from SP1's
crates/core/machine/src/operations/global_interaction.rs:210-236,
not its column layout. Three pieces:

Offset by +1. Express y[6] in terms of a fresh witness y6_lo
so y[6] = 0 is never valid in either branch (it is the 2-torsion
fixed point where the two signs collapse).
Safe band + prover retry. Restrict y6_lo to [0, (p-1)/2). For
the rare exception y[6] = 0 (probability ~1/p ≈ 2^-31 per record)
the host rejects and retries with a new nonce.
Byte-decomposition range check. y6_lo decomposed into four byte
limbs b0..b3 (assert_byte for b0..b2, lookup_ltu_byte(b3, 60, 1)
for b3). For BabyBear, (p-1)/2 = 60·2^24 exactly, so b3 < 60
gives the tightest no-overlap band.

In-circuit branch equality via condition_require_equal:

read (is_global_write = 0): y[6] = y6_lo + 1 ⇒ y[6] ∈ [1, (p-1)/2]
write (is_global_write = 1): y[6] = p - 1 - y6_lo ⇒ y[6] ∈ [(p+1)/2, p-1]

Union covers [1, p-1] with no overlap; y[6] = 0 is excluded.

Why not a single AssertLtConfig(y6_lo, (p-1)/2, max_bits=30)?
On BabyBear (p = 0x78000001, 31-bit) the AssertLt gadget only
constrains lhs - rhs ≡ diff - 2^max_bits (mod p) with diff ∈ [0, 2^30)
— it does not pre-bound lhs to be canonical-small. A malicious
y6_lo ∈ [0x74000001, p-1] (≈ 2^26 values) produces a field-wrap diff
that still fits in 30 bits, so the constraint accepts upper-half values
and the exploit survives. Byte-decomposing first kills the wrap. Ceno's
DynamicRangeTableCircuit<E, 18> also does not carry 30-bit lookup
entries, so a direct assert_const_range(_, 30) is not available
anyway.

Why M = 60 (vs SP1's 63). SP1 targets KoalaBear; its (p-1)/2 = 0x3f800000, so 63 leaves a small safety band. For BabyBear,
(p-1)/2 = 60·2^24 exactly — 63 would let y[6] straddle p/2 and
reintroduce the ambiguity.

Also corrects the stale comment that previously had the convention
reversed (claimed write ⇒ lower half, opposite of what the host code
does).

Change Highlights

ceno_zkvm/src/tables/shard_ram.rs
- ShardRamRecord::to_ec_point: reject y6 == 0 and try the next
  nonce. Same negation policy otherwise.
- ShardRamConfig: new field y6_lo_bytes: [WitIn; 4].
- ShardRamConfig::configure: replace the TODO with the byte
  decomposition, byte-range / LTU lookups, and the
  condition_require_equal branch equality.
- ShardRamCircuit::assign_instance: compute y6_lo from y[6] and
  is_to_write_set, assign byte limbs, register byte and LTU
  multiplicities.
- New test test_shard_ram_y_sign_circuit_rejects_negation: drives
  the full configure + assign_instance path for one honest row
  and one sign-flipped row, asserts the algebraic branch equality
  holds on both, and asserts the LTU lookup record (b3, 60, 1) is
  in-table for the honest row and out-of-table for the tampered row.

Benchmark / Performance Impact

Per ShardRam row this PR adds 4 byte WitIn columns plus 3 byte-range
and 1 LTU lookup multiplicities. ShardRam rows scale with cross-shard
RAM events, not with cycles, so the absolute cost is sub-percent on the
prover. No full prover bench was rerun (no hot-loop arithmetic changed).

Existing test_shard_ram_circuit (170k reads + 1420 writes, full chip
proof) runtime is unchanged within noise:

master   : ~5.0 s
this PR  : ~5.0 s

Testing

cargo fmt --all --check
cargo check --workspace --all-targets
cargo check --workspace --all-targets --release
cargo make clippy
cargo clippy --workspace --all-targets --release -- -D warnings
RUST_MIN_STACK=33554432 cargo test -p ceno_zkvm tables::shard_ram::tests --lib --release

All pass locally on BabyBear. Both test_shard_ram_circuit and the new
test_shard_ram_y_sign_circuit_rejects_negation are green.

cargo make tests / cargo make tests_goldilock should be re-run by
CI; the change is gated to BabyBear via a debug_assert_eq! on
MODULUS_U64 and goldilocks does not exercise shard_ram (per
integration.yml commented-out lines and CLAUDE.md).

Risks and Rollout

Soundness. Closes ShardRam read/write direction is not bound to the public shard_rw_sum #1338. The new constraint only adds local byte
arithmetic and existing lookups — no change to transcript, sumcheck,
PCS, or EC accumulation.
GPU. The CUDA witness generator
(instructions/gpu/chips/shard_ram.rs → witgen_shard_ram_per_row)
lives in the separate ceno-gpu-mock repo and does not yet populate
the new y6_lo_bytes columns. CPU runs are correct; GPU runs (behind
the gpu feature, not in default CI) will fail at proof time until
the kernel is updated. Follow-up required before re-enabling GPU
shard_ram.
Recursion. The recursive verifier consumes the same constraint
system, so no separate recursion code change is needed; once the base
proof passes, the recursive one will too. The reporter's PoC going
through the recursion verifier should now be rejected.
Field support. Hardcodes the BabyBear constant M = 60. A
debug_assert_eq!(MODULUS_U64, 0x78000001, ...) guards against
accidental use on a different field; shard_ram is BabyBear-only
today per CLAUDE.md.

Follow-ups

Port the y6_lo_bytes population to the ceno-gpu-mock CUDA witgen
kernel and re-enable GPU shard_ram.
The remaining finish the TODOs in shard_ram.rs #1340 TODOs (local read ⇄ global write pairing on
shard_ram.rs:235-236, shard == shard_id binding on line 244) are
intentionally out of scope here.

Fixes #1338.
Partially addresses #1340.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

The ShardRam circuit derived its EC accumulator points from `(x, y) = lift_x(poseidon2(record))` and used the sign of `y[6]` to encode read vs write direction on the host side, but never enforced that convention in-circuit. A malicious prover could replace every leaf `(x, y)` with `(x, -y)`, update the public `shard_rw_sum` accordingly, and produce a proof that the base verifier and the recursion verifier both accept against the same execution trace. Bind the sign via a fresh `y6_lo` ∈ [0, (p-1)/2), realized as four byte witnesses with the top byte < 60 (`(p-1)/2 = 60·2^24` exactly on BabyBear). The read branch requires `y[6] = y6_lo + 1` and the write branch `y[6] + y6_lo + 1 ≡ 0 (mod p)`. `y[6] = 0` is the unique exception, rejected by `to_ec_point` via nonce retry. The pattern follows SP1's `global_interaction.rs` (offset-by-1 + safe band + prover nonce retry), but uses Ceno's existing `lookup_ltu_byte` gadget instead of SP1's column layout. `M = 60` (vs SP1's 63 for KoalaBear) is the BabyBear-specific tightest no-overlap band. GPU `try_gpu_assign_shard_ram` is not updated in this PR — the new columns aren't populated by the existing CUDA kernel (lives in a separate repo). CPU runs are fixed; GPU runs will fail at proof time until the kernel is updated in a follow-up. Fixes #1338. Partially addresses #1340 (only the y-sign TODO). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…stance Replace the math-only PoC test with a circuit-driven one that: * Builds the constraint system through `build_gkr_iop_circuit`. * Runs `assign_instance` for both an honest EC point and the same point with its sign flipped. * Asserts that the algebraic `condition_require_equal` branch equality holds on both rows. * Asserts that the top byte of `y6_lo` is in `[0, 60)` for the honest row and in `[60, 256)` for the tampered row, so the `lookup_ltu_byte(b3, 60, 1)` table query is satisfiable only on the honest witness. Also tightens comments in `to_ec_point`, `ShardRamConfig`, `ShardRamConfig::configure`, and `ShardRamCircuit::assign_instance` to be implementation-focused per CLAUDE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Findings (sorted by severity)

Blocker | ceno_zkvm/src/tables/shard_ram.rs (new assert_byte / lookup_ltu_byte constraints): LK multiplicity aggregation/order looks inconsistent with new lookups.
This PR introduces new LK interactions in ShardRamConfig::configure (assert_byte + lookup_ltu_byte). However, the global LK multiplicities used to assign the DynamicRange/LTU table circuits are finalized and those table circuits are assigned before ShardRamCircuit is assigned in the shard pipeline (see ceno_zkvm/src/e2e.rs:1500-1589). As written, ShardRam’s new lookup usage does not appear to contribute to combined_lk_mlt prior to table-circuit assignment, which is expected to break the logup multiset check (or otherwise leave these lookups unaccounted).
Suggested fix: update the witness/LK aggregation flow so ShardRam’s byte/LTU lookups contribute to the global multiplicity before Rv32imConfig::assign_table_circuit runs (e.g., collect a per-chip multiplicity for ShardRam and include it in ZKVMWitnesses.lk_mlts prior to finalize_lk_multiplicities, or reorder assignment so ShardRam is assigned before lookup-table circuits).
Major | ShardRamRecord::to_ec_point: half-of-field boundary is off by one vs the new convention.
is_y_in_2nd_half currently uses y6 >= prime/2. For odd primes, prime/2 == (p-1)/2, so the boundary value y6 == (p-1)/2 is classified as “second half”, causing the new convention (“read => [1,(p-1)/2]”) to be violated and potentially making otherwise-valid witnesses fail the new in-circuit banding.
Suggested fix: compare against (prime + 1)/2 (or use a strict y6 > prime/2) to match the stated ranges.
Major | BabyBear-only guard uses debug_assert_eq! in circuit configuration.
The constraint relies on BabyBear’s (p-1)/2 = 60·2^24, but debug_assert_eq! is compiled out in release. If instantiated over a different field, the circuit would silently become incorrect.
Suggested fix: enforce at runtime (e.g., assert_eq! or return Err(CircuitBuilderError::CircuitError(..))).
Minor | Comment accuracy in to_ec_point.
The “2-torsion case where (x,y)==(x,-y)” phrasing is misleading: y6 == 0 doesn’t imply the full y-coordinate is zero; it only means that limb is fixed under negation, which is what makes the chosen encoding ambiguous/unsatisfiable.
Suggested fix: reword the comment to reflect the actual reason for rejection.
Minor (testing) | New test does not assert constraint/prover rejection.
test_shard_ram_y_sign_circuit_rejects_negation currently checks derived limb properties (b3 < 60 vs >= 60) but doesn’t actually run a constraint satisfiability check / mock prover / proof attempt that must fail for the tampered row. This can pass even if the lookup constraint is missing or if LK-table population is broken.
Suggested fix: make it a true regression by asserting the tampered witness fails constraint satisfaction (e.g., via MockProver with the necessary public inputs / table chips, or by attempting proof generation and asserting it errors).

Open questions / assumptions

Is ShardRam always assigned after lookup-table circuits in all proving entrypoints (CPU + GPU + recursion pipelines)? If yes, the LK multiplicity/order blocker needs a design-level fix (not just a local change).
Is it acceptable to hard-fail (non-debug) when BaseField != BabyBear, or is there a preferred feature-gate pattern for BabyBear-only chips in this repo?

Changes:

Host-side to_ec_point now rejects y6 == 0 and documents the read/write y-half convention.
Circuit-side: adds byte-decomposition + lookup constraints and a conditional equality binding y6 to is_global_write.
Adds a targeted unit test around the y-sign binding logic for honest vs sign-flipped points.

+                // binding in the circuit cannot distinguish read from write
+                // when y6 = 0.
+                if y6 != 0 {
+                    let is_y_in_2nd_half = y6 >= (prime / 2);


+                // Skip the 2-torsion case where (x, y) == (x, -y); the y-sign
+                // binding in the circuit cannot distinguish read from write
+                // when y6 = 0.


+        debug_assert_eq!(
+            <E::BaseField as SmallField>::MODULUS_U64,
+            0x7800_0001,
+            "y6_lo byte bound assumes BabyBear's (p-1)/2 = 60 * 2^24"
+        );


+            std::array::from_fn(|i| cb.create_witin(|| format!("y6_lo_b{i}")));
+        for (i, w) in y6_lo_bytes.iter().enumerate().take(3) {
+            cb.assert_byte(|| format!("y6_lo_b{i} byte"), w.expr())?;
+        }
+        // `lookup_ltu_byte(a, b, 1)` asserts `a, b` are bytes and `a < b`.
+        cb.lookup_ltu_byte(
+            y6_lo_bytes[3].expr(),
+            E::BaseField::from_canonical_u64(60).expr(),
+            Expression::ONE,
+        )?;


+    /// Drive a single ShardRam row through `configure` + `assign_instance`
+    /// for both an honest EC point and a sign-flipped point. Check that:
+    ///
+    /// * The algebraic branch equality from `condition_require_equal` holds
+    ///   on both rows (`assign_instance` derives `y6_lo` to keep it
+    ///   consistent regardless of how `y[6]` was tampered).
+    /// * The top byte `b3 = (y6_lo >> 24) & 0xff` lands in `[0, 60)` for the
+    ///   honest row and in `[60, 256)` for the tampered row. The latter
+    ///   makes the `lookup_ltu_byte(b3, 60, 1)` record query
+    ///   `(b3, 60, 1)`, which is not present in the LTU table (the table
+    ///   only carries `(b3, 60, 0)` for `b3 >= 60`), so the lookup
+    ///   constraint rejects the tampered witness.


kunxian-xia · 2026-05-25T03:09:50Z

+                // Skip the 2-torsion case where (x, y) == (x, -y); the y-sign
+                // binding in the circuit cannot distinguish read from write
+                // when y6 = 0.
+                if y6 != 0 {


why do we need this condition?

i think we do need to handle the 2-torsion case where $y = \vec{\mathbf{0}}$ in which we cannot distinguish read from write.

`from_canonical_u32` debug-asserts the input is less than the field prime. BabyBear's prime is 0x78000001, so the previous `0xdeadbeef` test constant panicked under the default (debug) `cargo make tests` target while passing in release. Swap for an in-range address. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous version added `lookup_ltu_byte` and `assert_byte` records in ShardRamCircuit's per-row `LkMultiplicity`, but `TableCircuit:: assign_instances` has no path to feed those records into the global `combined_lk_mlt` that the Ltu/Dynamic table circuits read when they populate their multiplicity columns. The opcode/table logup sums then fail to balance, and `verify` rejects every proof with `logup_sum != 0`. The unit test that round-trips a single chip in isolation did not catch this because it never assigns the lookup-table circuits. Replace the byte limbs with a 30-bit decomposition: 30 fresh `WitIn` plus 30 `assert_bit` plus a degree-2 cap (`c2627 = c26*c27`, `c2829 = c28*c29`, `c2627*c2829 = 0`) excluding the four high bits being all set. For BabyBear `(p-1)/2 = 0x3c000000` has bits 26..29 set and the rest zero, so the constraint lands `y6_lo` in `[0, (p-1)/2)`. The two intermediate witnesses keep each multiplication at degree 2 so the constraint stays under the sumcheck framework's degree-4 ceiling once the row selector multiplies in. The integration regression `cargo run --features sanity-check --bin e2e -- --platform=ceno --max-cycle-per-shard=20000 --hints=10 --public-io=4191 examples/.../fibonacci` now verifies through all seven shards. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

spherel and others added 2 commits May 24, 2026 17:55

spherel requested review from Copilot and hero78119 May 25, 2026 01:23

Copilot started reviewing on behalf of spherel May 25, 2026 01:23 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

kunxian-xia reviewed May 25, 2026

View reviewed changes

spherel and others added 2 commits May 24, 2026 22:34

spherel marked this pull request as draft May 25, 2026 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(soundness): bind ShardRam y-sign to is_global_write#1344

fix(soundness): bind ShardRam y-sign to is_global_write#1344
spherel wants to merge 4 commits into
masterfrom
fix/issue-1338-shard-ram-y-sign

spherel commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

kunxian-xia May 25, 2026

Uh oh!

kunxian-xia May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spherel commented May 25, 2026

Problem

Design Rationale

Change Highlights

Benchmark / Performance Impact

Testing

Risks and Rollout

Follow-ups

Copilot Reviewer Directive (keep this section)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Findings (sorted by severity)

Open questions / assumptions

Uh oh!

kunxian-xia May 25, 2026

Choose a reason for hiding this comment

Uh oh!

kunxian-xia May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants