AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285
Draft
osamu620 wants to merge 2 commits into
Draft
AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285osamu620 wants to merge 2 commits into
osamu620 wants to merge 2 commits into
Conversation
Block-level SIMD decodes one codeblock at a time, but the first step of that
decode -- recovering the VLC and MEL segments of the HTJ2K cleanup pass, what
this decoder calls "step 1" -- is a serial dependency chain that leaves the
core's execution units under-utilised (IPC well below peak). The way to fill
those idle slots is to decode several *independent* codeblocks together and let
the out-of-order engine overlap their chains. That interleaving cannot live
inside the per-block SIMD decoder, which by definition sees only one codeblock;
the code that knows about "a row of codeblocks in a subband" sits one level up,
in subband::pull_line. This commit adds the batching seam there, kept
ISA-neutral so that shared decode code is touched in just one place and any
decoder can opt in.
- A cb_decoder_batch_fun32 typedef and an optional decode_cb32_batch function
pointer in codeblock_fun, alongside the existing decode_cb32/decode_cb64 and
chosen by the same runtime CPU-feature dispatch. NULL by default; an ISA
sets it only if it has a batch kernel.
- codeblock::decode_row(blocks, count) gathers a subband row's decodable
32-bit codeblocks into parallel arrays and calls the batch in windows of 64,
marking zero-blocks directly. It falls back to per-block decode() when no
batch is registered (NULL) or the row is 64-bit, reproducing the exact 1-way
behaviour and error handling.
- subband::pull_line decodes the row through decode_row instead of looping
decode() per block.
Keeping this half ISA-neutral means a SIMD kernel plugs in with just its own
batch function and one dispatch line, with no further changes to that shared
code (the AVX2 kernel follows in the next commit; AVX-512/NEON could reuse it
unchanged). By itself it is a no-op -- with decode_cb32_batch NULL every block
still takes the existing 1-way decode() -- so it carries zero risk to current
output and can be reviewed on its own.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first step of decoding a codeblock -- recovering the VLC and MEL segments of
the HTJ2K cleanup pass ("step 1") -- is a serial per-quad dependency chain and
the single hottest decode kernel (~42% of 8-bit AVX2 decode, execution-bound at
IPC ~2.8). Codeblocks are independent, so interleaving several of them lets the
out-of-order engine overlap the otherwise-serial chains.
Add a templated N-way step-1 kernel (decode_cb_step1_vlc_nway<N>) that decodes
N same-dimension codeblocks in lockstep, driven by a new batched entry
(ojph_decode_codeblock_avx2_batch) that groups a subband row's same-width blocks
N-at-a-time and falls back to the 1-way path for edge/odd/invalid blocks. The
per-block validation (cb_decode_setup) and the second step -- MagSgn decode plus
the optional SgnProp/MagRef passes (cb_decode_finish) -- are factored into shared
helpers so the 1-way and batched entries stay in sync. Wiring: set
decode_cb32_batch on the AVX2 dispatch (it hooks into the ISA-agnostic
codeblock::decode_row added previously).
The step-1 kernel is pure scalar; its per-stream loops carry an OJPH_UNROLL
hint, a small portable macro added to ojph_arch.h next to OJPH_FORCE_INLINE:
_Pragma("GCC unroll 8") on GCC/Clang, empty on MSVC (which has no such pragma
and would otherwise warn C4068).
N defaults to 2 (OJPH_DEC_BATCH_N). On Zen 5 N=2 is the AVX2 sweet spot: the
bulkier AVX2 step-2 runs between batches, so N=4's ~48KB step-1 working set
saturates the 48KB L1d and thrashes; N=2 also keeps the unrolled per-stream
state in registers for the 32-bit (i386) build this file serves.
Whole-decode, best-of-15 ojph_expand to tmpfs on Zen 5, single-thread, vs a
1-way reference build: 8-bit lossless -10%, 16-bit lossless -8.7%, 8-bit lossy
-8.3% (wall clock -9.5% / -8.6% / -8.6%). Byte-identical to the 1-way path on
8/16-bit lossless, lossy, the 512x8 streaming shape, and a Kakadu multipass
stream (SgnProp/MagRef coverage); 82/82 ctest pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
|
Dear @osamu620, Thank you for this PR. Kind regards, |
Contributor
|
@osamu620 lossless benchmark round-trip passes! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #276. The first step of decoding a codeblock — recovering the VLC and MEL segments of the HTJ2K cleanup pass (this decoder's "step 1") — is a serial dependency chain and the single hottest decode kernel (~42% of 8-bit AVX2 decode, IPC ~2.8). Codeblocks are independent, so decoding several in lockstep lets the out-of-order engine overlap their chains.
Two commits
ISA-agnostic plumbing. That interleaving can't live inside the per-block SIMD decoder, which by definition sees only one codeblock — the code that knows about "a row of codeblocks in a subband" sits one level up, in
subband::pull_line, so the seam belongs there. This adds an optionaldecode_cb32_batchpointer (next todecode_cb32/decode_cb64, selected by the same CPU-feature dispatch) and acodeblock::decode_rowthat gathers a row and either calls the batch or falls back to the exact 1-waydecode()(when no batch is registered, or for 64-bit blocks). No functional change on its own — that shared code is touched in just one place, future ISAs plug in with only a kernel + one dispatch line, and current output can't change.AVX2 kernel. A templated N-way step-1 kernel (
decode_cb_step1_vlc_nway<N>) that decodes N same-width codeblocks in lockstep, driven byojph_decode_codeblock_avx2_batch, which groups a subband row's same-width blocks N-at-a-time and falls back to 1-way for edge/odd/invalid blocks. Per-block validation and the second step (MagSgn + optional SgnProp/MagRef) are factored into shared helpers so the 1-way and batched paths stay in sync.Performance
Single-thread,
ojph_expand→ tmpfs,taskset -c0, best-of-15, vs a 1-way reference build (Zen 5):Whole-process includes the PPM write, so the decode-kernel gain is larger.
OJPH_DEC_BATCH_Ndefaults to 2 — the measured Zen 5 sweet spot (the bulkier AVX2 step-2 between batches makes N=4's larger step-1 working set thrash L1; N=2 also keeps the unrolled per-stream state in registers for 32-bit builds). It's compile-overridable, and it can never regress: anything not batched takes the unchanged 1-way path.Correctness
Byte-identical to the 1-way path on 8/16-bit lossless, lossy, the 512×8 streaming shape, and a Kakadu multipass stream (SgnProp/MagRef coverage); 82/82 ctest.
Built clean (0 warnings) locally across GCC on Linux (AVX2-only, AVX-512-enabled, and no-SIMD configs), MSVC VS2022 (x64 + Win32/32-bit), and MinGW UCRT64 GCC 16.
@palemieux — would you mind running this against the lossless benchmark round-trip, as you did on #276? Marking draft until that and CI are green.
🤖 Generated with Claude Code