RISC-V:Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15%#235
RISC-V:Optimize decompression throughput by mirroring AVX fast-path for RVV short memcpy +15%#235anthony-zy wants to merge 1 commit into
Conversation
|
Again, how about utilizing the full vector register: https://godbolt.org/z/MovxdEzzv |
@camel-cdr Compared to the current PR, using the logic from the previous Godbolt link results in an 8% performance drop. To balance portability and performance, I see two options: Option A (Current): Stick with LMUL=2. It keeps the code simple and ensures we process 32 bytes per loop on VLEN=128 systems, despite the minor 0.6% performance trade-off. |
| while (remaining_bytes > 0) { | ||
| // Set vector configuration: e8 (8-bit elements), m2 (LMUL=2). | ||
| // Use e8m2 configuration to maximize throughput. | ||
| size_t vl = VSETVL_E8M2(remaining_bytes); |
There was a problem hiding this comment.
I liked the macro version before, now it's not the case. If you can fix it, I'd be grateful
c306c2b to
acfa760
Compare
PR Title
Optimize RVV memcpy path to mirror AVX fast-path for short copies
Summary
Refactors the RISC-V Vector (RVV) acceleration path in the short memcpy helper to mirror the existing AVX implementation. Instead of a generic
vsetvlloop, this version performs a fixed 32-byte vector copy (with an optional second 32-byte segment), matching the profile-driven assumption that nearly all copies are short.This eliminates loop overhead in the hot path and aligns RVV behavior with the well-tuned AVX code path.
Performance
Tested with lzbench on Spacemit(R) X60 (RISC-V, RVV 1.0):
Implementation Details
__riscv_vsetvl_e8m2(32)to configure a 32-byte vector operation, matching thekShortMemCopyconstant used in the AVX path.SNAPPY_PREDICT_FALSE) only whensize > kShortMemCopy, since profiling shows long copies are rare.e8m1andVLEN < 256(e.g.,VLEN=128,vl=16), two segments per 32 B would be required; usinge8m2ensures 32 B per op on common configurations.Verification
-march=rv64gcv