Skip to content

fix(soundness): bind ShardRam y-sign to is_global_write#1344

Draft
spherel wants to merge 4 commits into
masterfrom
fix/issue-1338-shard-ram-y-sign
Draft

fix(soundness): bind ShardRam y-sign to is_global_write#1344
spherel wants to merge 4 commits into
masterfrom
fix/issue-1338-shard-ram-y-sign

Conversation

@spherel
Copy link
Copy Markdown
Member

@spherel spherel commented May 25, 2026

Problem

Issue #1338 reproduces a soundness break on master. For the same RISC-V
execution, the base verifier and the recursion verifier both accept two
distinct proof batches whose public per-shard shard_rw_sum values differ
on all 17 shards. The attacker takes an honest witness, replaces every
cross-shard EC accumulator leaf (x, y) with its inverse (x, -y),
updates shard_rw_sum, and reproves.

Root cause: ceno_zkvm/src/tables/shard_ram.rs:276-281 was a TODO. The
host code in ShardRamRecord::to_ec_point encodes read vs write in the
sign of y[6], but the circuit only constrained the curve equation and
the EC sum — never tying y[6]'s half-of-field to is_global_write.
Both (x, y) and (x, -y) satisfied every existing check, so the public
summary of cross-shard RAM flow was unbound.

The defect survives recursion (the reporter's PoC verifies through the
recursion verifier program).

Design Rationale

Approach borrows the idea from SP1's
crates/core/machine/src/operations/global_interaction.rs:210-236,
not its column layout. Three pieces:

  1. Offset by +1. Express y[6] in terms of a fresh witness y6_lo
    so y[6] = 0 is never valid in either branch (it is the 2-torsion
    fixed point where the two signs collapse).
  2. Safe band + prover retry. Restrict y6_lo to [0, (p-1)/2). For
    the rare exception y[6] = 0 (probability ~1/p ≈ 2^-31 per record)
    the host rejects and retries with a new nonce.
  3. Byte-decomposition range check. y6_lo decomposed into four byte
    limbs b0..b3 (assert_byte for b0..b2, lookup_ltu_byte(b3, 60, 1)
    for b3). For BabyBear, (p-1)/2 = 60·2^24 exactly, so b3 < 60
    gives the tightest no-overlap band.

In-circuit branch equality via condition_require_equal:

  • read (is_global_write = 0): y[6] = y6_lo + 1y[6] ∈ [1, (p-1)/2]
  • write (is_global_write = 1): y[6] = p - 1 - y6_loy[6] ∈ [(p+1)/2, p-1]

Union covers [1, p-1] with no overlap; y[6] = 0 is excluded.

Why not a single AssertLtConfig(y6_lo, (p-1)/2, max_bits=30)?
On BabyBear (p = 0x78000001, 31-bit) the AssertLt gadget only
constrains lhs - rhs ≡ diff - 2^max_bits (mod p) with diff ∈ [0, 2^30)
— it does not pre-bound lhs to be canonical-small. A malicious
y6_lo ∈ [0x74000001, p-1] (≈ 2^26 values) produces a field-wrap diff
that still fits in 30 bits, so the constraint accepts upper-half values
and the exploit survives. Byte-decomposing first kills the wrap. Ceno's
DynamicRangeTableCircuit<E, 18> also does not carry 30-bit lookup
entries, so a direct assert_const_range(_, 30) is not available
anyway.

Why M = 60 (vs SP1's 63). SP1 targets KoalaBear; its (p-1)/2 = 0x3f800000, so 63 leaves a small safety band. For BabyBear,
(p-1)/2 = 60·2^24 exactly — 63 would let y[6] straddle p/2 and
reintroduce the ambiguity.

Also corrects the stale comment that previously had the convention
reversed (claimed write ⇒ lower half, opposite of what the host code
does).

Change Highlights

  • ceno_zkvm/src/tables/shard_ram.rs
    • ShardRamRecord::to_ec_point: reject y6 == 0 and try the next
      nonce. Same negation policy otherwise.
    • ShardRamConfig: new field y6_lo_bytes: [WitIn; 4].
    • ShardRamConfig::configure: replace the TODO with the byte
      decomposition, byte-range / LTU lookups, and the
      condition_require_equal branch equality.
    • ShardRamCircuit::assign_instance: compute y6_lo from y[6] and
      is_to_write_set, assign byte limbs, register byte and LTU
      multiplicities.
    • New test test_shard_ram_y_sign_circuit_rejects_negation: drives
      the full configure + assign_instance path for one honest row
      and one sign-flipped row, asserts the algebraic branch equality
      holds on both, and asserts the LTU lookup record (b3, 60, 1) is
      in-table for the honest row and out-of-table for the tampered row.

Benchmark / Performance Impact

Per ShardRam row this PR adds 4 byte WitIn columns plus 3 byte-range
and 1 LTU lookup multiplicities. ShardRam rows scale with cross-shard
RAM events, not with cycles, so the absolute cost is sub-percent on the
prover. No full prover bench was rerun (no hot-loop arithmetic changed).

Existing test_shard_ram_circuit (170k reads + 1420 writes, full chip
proof) runtime is unchanged within noise:

master   : ~5.0 s
this PR  : ~5.0 s

Testing

cargo fmt --all --check
cargo check --workspace --all-targets
cargo check --workspace --all-targets --release
cargo make clippy
cargo clippy --workspace --all-targets --release -- -D warnings
RUST_MIN_STACK=33554432 cargo test -p ceno_zkvm tables::shard_ram::tests --lib --release

All pass locally on BabyBear. Both test_shard_ram_circuit and the new
test_shard_ram_y_sign_circuit_rejects_negation are green.

cargo make tests / cargo make tests_goldilock should be re-run by
CI; the change is gated to BabyBear via a debug_assert_eq! on
MODULUS_U64 and goldilocks does not exercise shard_ram (per
integration.yml commented-out lines and CLAUDE.md).

Risks and Rollout

  • Soundness. Closes ShardRam read/write direction is not bound to the public shard_rw_sum #1338. The new constraint only adds local byte
    arithmetic and existing lookups — no change to transcript, sumcheck,
    PCS, or EC accumulation.
  • GPU. The CUDA witness generator
    (instructions/gpu/chips/shard_ram.rswitgen_shard_ram_per_row)
    lives in the separate ceno-gpu-mock repo and does not yet populate
    the new y6_lo_bytes columns. CPU runs are correct; GPU runs (behind
    the gpu feature, not in default CI) will fail at proof time until
    the kernel is updated. Follow-up required before re-enabling GPU
    shard_ram.
  • Recursion. The recursive verifier consumes the same constraint
    system, so no separate recursion code change is needed; once the base
    proof passes, the recursive one will too. The reporter's PoC going
    through the recursion verifier should now be rejected.
  • Field support. Hardcodes the BabyBear constant M = 60. A
    debug_assert_eq!(MODULUS_U64, 0x78000001, ...) guards against
    accidental use on a different field; shard_ram is BabyBear-only
    today per CLAUDE.md.

Follow-ups

  • Port the y6_lo_bytes population to the ceno-gpu-mock CUDA witgen
    kernel and re-enable GPU shard_ram.
  • The remaining finish the TODOs in shard_ram.rs #1340 TODOs (local read ⇄ global write pairing on
    shard_ram.rs:235-236, shard == shard_id binding on line 244) are
    intentionally out of scope here.

Fixes #1338.
Partially addresses #1340.

Copilot Reviewer Directive (keep this section)

When Copilot reviews this PR, apply .github/copilot-instructions.md strictly.

spherel and others added 2 commits May 24, 2026 17:55
The ShardRam circuit derived its EC accumulator points from
`(x, y) = lift_x(poseidon2(record))` and used the sign of `y[6]` to
encode read vs write direction on the host side, but never enforced
that convention in-circuit. A malicious prover could replace every
leaf `(x, y)` with `(x, -y)`, update the public `shard_rw_sum`
accordingly, and produce a proof that the base verifier and the
recursion verifier both accept against the same execution trace.

Bind the sign via a fresh `y6_lo` ∈ [0, (p-1)/2), realized as four
byte witnesses with the top byte < 60 (`(p-1)/2 = 60·2^24` exactly
on BabyBear). The read branch requires `y[6] = y6_lo + 1` and the
write branch `y[6] + y6_lo + 1 ≡ 0 (mod p)`. `y[6] = 0` is the unique
exception, rejected by `to_ec_point` via nonce retry.

The pattern follows SP1's `global_interaction.rs` (offset-by-1 + safe
band + prover nonce retry), but uses Ceno's existing `lookup_ltu_byte`
gadget instead of SP1's column layout. `M = 60` (vs SP1's 63 for
KoalaBear) is the BabyBear-specific tightest no-overlap band.

GPU `try_gpu_assign_shard_ram` is not updated in this PR — the new
columns aren't populated by the existing CUDA kernel (lives in a
separate repo). CPU runs are fixed; GPU runs will fail at proof
time until the kernel is updated in a follow-up.

Fixes #1338.
Partially addresses #1340 (only the y-sign TODO).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…stance

Replace the math-only PoC test with a circuit-driven one that:
  * Builds the constraint system through `build_gkr_iop_circuit`.
  * Runs `assign_instance` for both an honest EC point and the same
    point with its sign flipped.
  * Asserts that the algebraic `condition_require_equal` branch
    equality holds on both rows.
  * Asserts that the top byte of `y6_lo` is in `[0, 60)` for the honest
    row and in `[60, 256)` for the tampered row, so the
    `lookup_ltu_byte(b3, 60, 1)` table query is satisfiable only on the
    honest witness.

Also tightens comments in `to_ec_point`, `ShardRamConfig`,
`ShardRamConfig::configure`, and `ShardRamCircuit::assign_instance` to
be implementation-focused per CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Findings (sorted by severity)

  • Blocker | ceno_zkvm/src/tables/shard_ram.rs (new assert_byte / lookup_ltu_byte constraints): LK multiplicity aggregation/order looks inconsistent with new lookups.
    This PR introduces new LK interactions in ShardRamConfig::configure (assert_byte + lookup_ltu_byte). However, the global LK multiplicities used to assign the DynamicRange/LTU table circuits are finalized and those table circuits are assigned before ShardRamCircuit is assigned in the shard pipeline (see ceno_zkvm/src/e2e.rs:1500-1589). As written, ShardRam’s new lookup usage does not appear to contribute to combined_lk_mlt prior to table-circuit assignment, which is expected to break the logup multiset check (or otherwise leave these lookups unaccounted).
    Suggested fix: update the witness/LK aggregation flow so ShardRam’s byte/LTU lookups contribute to the global multiplicity before Rv32imConfig::assign_table_circuit runs (e.g., collect a per-chip multiplicity for ShardRam and include it in ZKVMWitnesses.lk_mlts prior to finalize_lk_multiplicities, or reorder assignment so ShardRam is assigned before lookup-table circuits).

  • Major | ShardRamRecord::to_ec_point: half-of-field boundary is off by one vs the new convention.
    is_y_in_2nd_half currently uses y6 >= prime/2. For odd primes, prime/2 == (p-1)/2, so the boundary value y6 == (p-1)/2 is classified as “second half”, causing the new convention (“read => [1,(p-1)/2]”) to be violated and potentially making otherwise-valid witnesses fail the new in-circuit banding.
    Suggested fix: compare against (prime + 1)/2 (or use a strict y6 > prime/2) to match the stated ranges.

  • Major | BabyBear-only guard uses debug_assert_eq! in circuit configuration.
    The constraint relies on BabyBear’s (p-1)/2 = 60·2^24, but debug_assert_eq! is compiled out in release. If instantiated over a different field, the circuit would silently become incorrect.
    Suggested fix: enforce at runtime (e.g., assert_eq! or return Err(CircuitBuilderError::CircuitError(..))).

  • Minor | Comment accuracy in to_ec_point.
    The “2-torsion case where (x,y)==(x,-y)” phrasing is misleading: y6 == 0 doesn’t imply the full y-coordinate is zero; it only means that limb is fixed under negation, which is what makes the chosen encoding ambiguous/unsatisfiable.
    Suggested fix: reword the comment to reflect the actual reason for rejection.

  • Minor (testing) | New test does not assert constraint/prover rejection.
    test_shard_ram_y_sign_circuit_rejects_negation currently checks derived limb properties (b3 < 60 vs >= 60) but doesn’t actually run a constraint satisfiability check / mock prover / proof attempt that must fail for the tampered row. This can pass even if the lookup constraint is missing or if LK-table population is broken.
    Suggested fix: make it a true regression by asserting the tampered witness fails constraint satisfaction (e.g., via MockProver with the necessary public inputs / table chips, or by attempting proof generation and asserting it errors).

Open questions / assumptions

  • Is ShardRam always assigned after lookup-table circuits in all proving entrypoints (CPU + GPU + recursion pipelines)? If yes, the LK multiplicity/order blocker needs a design-level fix (not just a local change).
  • Is it acceptable to hard-fail (non-debug) when BaseField != BabyBear, or is there a preferred feature-gate pattern for BabyBear-only chips in this repo?

Changes:

  • Host-side to_ec_point now rejects y6 == 0 and documents the read/write y-half convention.
  • Circuit-side: adds byte-decomposition + lookup constraints and a conditional equality binding y6 to is_global_write.
  • Adds a targeted unit test around the y-sign binding logic for honest vs sign-flipped points.

Comment thread ceno_zkvm/src/tables/shard_ram.rs Outdated
// binding in the circuit cannot distinguish read from write
// when y6 = 0.
if y6 != 0 {
let is_y_in_2nd_half = y6 >= (prime / 2);
Comment thread ceno_zkvm/src/tables/shard_ram.rs Outdated
Comment on lines +136 to +138
// Skip the 2-torsion case where (x, y) == (x, -y); the y-sign
// binding in the circuit cannot distinguish read from write
// when y6 = 0.
Comment on lines +293 to +297
debug_assert_eq!(
<E::BaseField as SmallField>::MODULUS_U64,
0x7800_0001,
"y6_lo byte bound assumes BabyBear's (p-1)/2 = 60 * 2^24"
);
Comment on lines +299 to +308
std::array::from_fn(|i| cb.create_witin(|| format!("y6_lo_b{i}")));
for (i, w) in y6_lo_bytes.iter().enumerate().take(3) {
cb.assert_byte(|| format!("y6_lo_b{i} byte"), w.expr())?;
}
// `lookup_ltu_byte(a, b, 1)` asserts `a, b` are bytes and `a < b`.
cb.lookup_ltu_byte(
y6_lo_bytes[3].expr(),
E::BaseField::from_canonical_u64(60).expr(),
Expression::ONE,
)?;
Comment thread ceno_zkvm/src/tables/shard_ram.rs Outdated
Comment on lines +1006 to +1017
/// Drive a single ShardRam row through `configure` + `assign_instance`
/// for both an honest EC point and a sign-flipped point. Check that:
///
/// * The algebraic branch equality from `condition_require_equal` holds
/// on both rows (`assign_instance` derives `y6_lo` to keep it
/// consistent regardless of how `y[6]` was tampered).
/// * The top byte `b3 = (y6_lo >> 24) & 0xff` lands in `[0, 60)` for the
/// honest row and in `[60, 256)` for the tampered row. The latter
/// makes the `lookup_ltu_byte(b3, 60, 1)` record query
/// `(b3, 60, 1)`, which is not present in the LTU table (the table
/// only carries `(b3, 60, 0)` for `b3 >= 60`), so the lookup
/// constraint rejects the tampered witness.
Comment thread ceno_zkvm/src/tables/shard_ram.rs Outdated
// Skip the 2-torsion case where (x, y) == (x, -y); the y-sign
// binding in the circuit cannot distinguish read from write
// when y6 = 0.
if y6 != 0 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this condition?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we do need to handle the 2-torsion case where $y = \vec{\mathbf{0}}$ in which we cannot distinguish read from write.

spherel and others added 2 commits May 24, 2026 22:34
`from_canonical_u32` debug-asserts the input is less than the field
prime. BabyBear's prime is 0x78000001, so the previous `0xdeadbeef`
test constant panicked under the default (debug) `cargo make tests`
target while passing in release. Swap for an in-range address.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous version added `lookup_ltu_byte` and `assert_byte` records
in ShardRamCircuit's per-row `LkMultiplicity`, but `TableCircuit::
assign_instances` has no path to feed those records into the global
`combined_lk_mlt` that the Ltu/Dynamic table circuits read when they
populate their multiplicity columns. The opcode/table logup sums then
fail to balance, and `verify` rejects every proof with
`logup_sum != 0`. The unit test that round-trips a single chip in
isolation did not catch this because it never assigns the lookup-table
circuits.

Replace the byte limbs with a 30-bit decomposition: 30 fresh `WitIn`
plus 30 `assert_bit` plus a degree-2 cap (`c2627 = c26*c27`,
`c2829 = c28*c29`, `c2627*c2829 = 0`) excluding the four high bits
being all set. For BabyBear `(p-1)/2 = 0x3c000000` has bits 26..29 set
and the rest zero, so the constraint lands `y6_lo` in
`[0, (p-1)/2)`. The two intermediate witnesses keep each multiplication
at degree 2 so the constraint stays under the sumcheck framework's
degree-4 ceiling once the row selector multiplies in.

The integration regression
`cargo run --features sanity-check --bin e2e -- --platform=ceno
 --max-cycle-per-shard=20000 --hints=10 --public-io=4191
 examples/.../fibonacci`
now verifies through all seven shards.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@spherel spherel marked this pull request as draft May 25, 2026 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ShardRam read/write direction is not bound to the public shard_rw_sum

3 participants