RISC-V：Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15% by anthony-zy · Pull Request #235 · google/snappy

anthony-zy · 2026-04-17T02:39:03Z

PR Title

Optimize RVV memcpy path to mirror AVX fast-path for short copies

Summary

Refactors the RISC-V Vector (RVV) acceleration path in the short memcpy helper to mirror the existing AVX implementation. Instead of a generic vsetvl loop, this version performs a fixed 32-byte vector copy (with an optional second 32-byte segment), matching the profile-driven assumption that nearly all copies are short.

This eliminates loop overhead in the hot path and aligns RVV behavior with the well-tuned AVX code path.

Performance

Tested with lzbench on Spacemit(R) X60 (RISC-V, RVV 1.0):

Before: 269 MB/s
After: 310 MB/s
Improvement: +15%

Implementation Details

Uses __riscv_vsetvl_e8m2(32) to configure a 32-byte vector operation, matching the kShortMemCopy constant used in the AVX path.
Performs an unconditional first 32-byte load/store.
A second 32-byte segment is conditionally executed (with SNAPPY_PREDICT_FALSE) only when size > kShortMemCopy, since profiling shows long copies are rare.
Note: with e8m1 and VLEN < 256 (e.g., VLEN=128, vl=16), two segments per 32 B would be required; using e8m2 ensures 32 B per op on common configurations.

Verification

Environment: Spacemit(R) X60 (RISC-V, RVV 1.0)
Benchmark tool: lzbench
Compiled with -march=rv64gcv
Passed all existing Snappy unit tests.

camel-cdr · 2026-04-23T15:06:36Z

Again, how about utilizing the full vector register: https://godbolt.org/z/MovxdEzzv
Using LMUL=1 might be faster, you'll have to benchmark that.

anthony-zy · 2026-04-24T02:08:49Z

Again, how about utilizing the full vector register: https://godbolt.org/z/MovxdEzzv Using LMUL=1 might be faster, you'll have to benchmark that.

@camel-cdr
"Thanks for the suggestion! I've conducted benchmarks comparing different implementations and LMUL settings. Here are the results:

Compared to the current PR, using the logic from the previous Godbolt link results in an 8% performance drop.
If I modify that Godbolt version to use LMUL=1, the performance is about 1.2% higher than the current PR.
Simply switching the current PR to LMUL=1 yields a 0.6% gain.
These results confirm that while LMUL=1 is slightly more efficient on this specific microarchitecture, the catch is that on VLEN=128bit systems, LMUL=1 only handles 16 bytes per iteration, which falls short of our 32-byte target.

To balance portability and performance, I see two options:

Option A (Current): Stick with LMUL=2. It keeps the code simple and ensures we process 32 bytes per loop on VLEN=128 systems, despite the minor 0.6% performance trade-off.
Option B (Dynamic Dispatch): Add a runtime check for vlenb. We could use LMUL=1 if vlenb >= 32 (VLEN>=256) and fallback to LMUL=2 otherwise.
I personally lean towards keeping the current LMUL=2 implementation for its robustness across different VLEN configurations, given that the 0.6% difference is marginal. What do you think?"

danilak-G · 2026-05-09T11:22:31Z

-  while (remaining_bytes > 0) {
-    // Set vector configuration: e8 (8-bit elements), m2 (LMUL=2).
-    // Use e8m2 configuration to maximize throughput.
-    size_t vl = VSETVL_E8M2(remaining_bytes);


I liked the macro version before, now it's not the case. If you can fix it, I'd be grateful

anthony-zy closed this Apr 24, 2026

anthony-zy reopened this Apr 24, 2026

danilak-G reviewed May 9, 2026

View reviewed changes

refactor memcopy64

acfa760

anthony-zy force-pushed the refact_memcopy64 branch from c306c2b to acfa760 Compare May 14, 2026 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RISC-V：Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15%#235

RISC-V：Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15%#235
anthony-zy wants to merge 1 commit into
google:mainfrom
anthony-zy:refact_memcopy64

anthony-zy commented Apr 17, 2026

Uh oh!

camel-cdr commented Apr 23, 2026

Uh oh!

anthony-zy commented Apr 24, 2026

Uh oh!

danilak-G May 9, 2026

Uh oh!

anthony-zy May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

anthony-zy commented Apr 17, 2026

PR Title

Summary

Performance

Implementation Details

Verification

Uh oh!

camel-cdr commented Apr 23, 2026

Uh oh!

anthony-zy commented Apr 24, 2026

Uh oh!

danilak-G May 9, 2026

Choose a reason for hiding this comment

Uh oh!

anthony-zy May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants