Skip to content

perf: delegate G2 MSM to generic XYZZ Pippenger#86

Merged
MatteoMer merged 2 commits into
mainfrom
worktree-perf+g2-msm-xyzz-buckets
Apr 18, 2026
Merged

perf: delegate G2 MSM to generic XYZZ Pippenger#86
MatteoMer merged 2 commits into
mainfrom
worktree-perf+g2-msm-xyzz-buckets

Conversation

@MatteoMer
Copy link
Copy Markdown
Owner

Summary

  • Replace hand-written G2 Pippenger (Jacobian buckets, per-window parallelism, ~339 lines) with a thin wrapper that delegates to the generic MSM(F, Fp2).computeWithPool() (~80 lines of new code)
  • G2 MSM now gets the same optimizations as G1: XYZZ bucket coordinates (7M+2S vs 7M+4S per mixed add), batch window normalization (Montgomery's trick), and chunk-based parallelism (better cache locality)
  • Zero-cost type bridging via @ptrCast with comptime layout assertions — G2Point and AffinePoint(Fp2) have identical memory layout
  • Public API (msmG2, msmG2Bench) unchanged — no caller modifications needed

Test plan

  • zig build test — all unit tests pass, including the arkworks-validated G2 MSM fixture vectors
  • Benchmark G2 MSM at various sizes to measure speedup
  • End-to-end: cargo run --release -p zolt -- prove examples/sha256_2048.elf timing comparison

🤖 Generated with Claude Code

MatteoMer and others added 2 commits April 18, 2026 10:01
Replace the hand-written G2 Pippenger (Jacobian buckets, per-window
parallelism) with a thin wrapper that delegates to the generic
MSM(F, Fp2).computeWithPool(). This gives G2 the same optimizations
as G1: XYZZ bucket coordinates (7M+2S vs 7M+4S per mixed add),
batch window normalization via Montgomery's trick, and chunk-based
parallelism for better cache locality.

G2Point and AffinePoint(Fp2) share identical memory layout, so the
bridging is a zero-cost @ptrCast with comptime layout assertions.

339 → 118 lines, public API (msmG2, msmG2Bench) unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The generic MSM's parallel path previously fell back to sequential
Pippenger when n < num_threads*256 (chunks too small). This caused a
regression for G2 MSM at sizes 256-2048, where the old hand-written
code used per-window parallelism.

Add pippengerMSMWindowParallel: each thread processes a subset of
windows over all points, using XYZZ buckets + batch normalization.
Used when n < num_threads*256; chunk-based parallelism still used
for larger inputs where cache locality matters more.

Benefits both G1 and G2. G2 parallel MSM at N=1024: 6.9ms → 4.4ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatteoMer MatteoMer merged commit 8d5bf91 into main Apr 18, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant