AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode by osamu620 · Pull Request #285 · aous72/OpenJPH

osamu620 · 2026-06-13T10:48:04Z

Follow-up to #276. The first step of decoding a codeblock — recovering the VLC and MEL segments of the HTJ2K cleanup pass (this decoder's "step 1") — is a serial dependency chain and the single hottest decode kernel (~42% of 8-bit AVX2 decode, IPC ~2.8). Codeblocks are independent, so decoding several in lockstep lets the out-of-order engine overlap their chains.

Two commits

ISA-agnostic plumbing. That interleaving can't live inside the per-block SIMD decoder, which by definition sees only one codeblock — the code that knows about "a row of codeblocks in a subband" sits one level up, in subband::pull_line, so the seam belongs there. This adds an optional decode_cb32_batch pointer (next to decode_cb32/decode_cb64, selected by the same CPU-feature dispatch) and a codeblock::decode_row that gathers a row and either calls the batch or falls back to the exact 1-way decode() (when no batch is registered, or for 64-bit blocks). No functional change on its own — that shared code is touched in just one place, future ISAs plug in with only a kernel + one dispatch line, and current output can't change.
AVX2 kernel. A templated N-way step-1 kernel (decode_cb_step1_vlc_nway<N>) that decodes N same-width codeblocks in lockstep, driven by ojph_decode_codeblock_avx2_batch, which groups a subband row's same-width blocks N-at-a-time and falls back to 1-way for edge/odd/invalid blocks. Per-block validation and the second step (MagSgn + optional SgnProp/MagRef) are factored into shared helpers so the 1-way and batched paths stay in sync.

Performance

Single-thread, ojph_expand → tmpfs, taskset -c0, best-of-15, vs a 1-way reference build (Zen 5):

condition	whole-process wall
8-bit lossless	−9.5%
16-bit lossless	−8.6%
8-bit lossy	−8.6%

Whole-process includes the PPM write, so the decode-kernel gain is larger. OJPH_DEC_BATCH_N defaults to 2 — the measured Zen 5 sweet spot (the bulkier AVX2 step-2 between batches makes N=4's larger step-1 working set thrash L1; N=2 also keeps the unrolled per-stream state in registers for 32-bit builds). It's compile-overridable, and it can never regress: anything not batched takes the unchanged 1-way path.

Correctness

Byte-identical to the 1-way path on 8/16-bit lossless, lossy, the 512×8 streaming shape, and a Kakadu multipass stream (SgnProp/MagRef coverage); 82/82 ctest.

Built clean (0 warnings) locally across GCC on Linux (AVX2-only, AVX-512-enabled, and no-SIMD configs), MSVC VS2022 (x64 + Win32/32-bit), and MinGW UCRT64 GCC 16.

@palemieux — would you mind running this against the lossless benchmark round-trip, as you did on #276? Marking draft until that and CI are green.

🤖 Generated with Claude Code

Block-level SIMD decodes one codeblock at a time, but the first step of that decode -- recovering the VLC and MEL segments of the HTJ2K cleanup pass, what this decoder calls "step 1" -- is a serial dependency chain that leaves the core's execution units under-utilised (IPC well below peak). The way to fill those idle slots is to decode several *independent* codeblocks together and let the out-of-order engine overlap their chains. That interleaving cannot live inside the per-block SIMD decoder, which by definition sees only one codeblock; the code that knows about "a row of codeblocks in a subband" sits one level up, in subband::pull_line. This commit adds the batching seam there, kept ISA-neutral so that shared decode code is touched in just one place and any decoder can opt in. - A cb_decoder_batch_fun32 typedef and an optional decode_cb32_batch function pointer in codeblock_fun, alongside the existing decode_cb32/decode_cb64 and chosen by the same runtime CPU-feature dispatch. NULL by default; an ISA sets it only if it has a batch kernel. - codeblock::decode_row(blocks, count) gathers a subband row's decodable 32-bit codeblocks into parallel arrays and calls the batch in windows of 64, marking zero-blocks directly. It falls back to per-block decode() when no batch is registered (NULL) or the row is 64-bit, reproducing the exact 1-way behaviour and error handling. - subband::pull_line decodes the row through decode_row instead of looping decode() per block. Keeping this half ISA-neutral means a SIMD kernel plugs in with just its own batch function and one dispatch line, with no further changes to that shared code (the AVX2 kernel follows in the next commit; AVX-512/NEON could reuse it unchanged). By itself it is a no-op -- with decode_cb32_batch NULL every block still takes the existing 1-way decode() -- so it carries zero risk to current output and can be reviewed on its own. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The first step of decoding a codeblock -- recovering the VLC and MEL segments of the HTJ2K cleanup pass ("step 1") -- is a serial per-quad dependency chain and the single hottest decode kernel (~42% of 8-bit AVX2 decode, execution-bound at IPC ~2.8). Codeblocks are independent, so interleaving several of them lets the out-of-order engine overlap the otherwise-serial chains. Add a templated N-way step-1 kernel (decode_cb_step1_vlc_nway<N>) that decodes N same-dimension codeblocks in lockstep, driven by a new batched entry (ojph_decode_codeblock_avx2_batch) that groups a subband row's same-width blocks N-at-a-time and falls back to the 1-way path for edge/odd/invalid blocks. The per-block validation (cb_decode_setup) and the second step -- MagSgn decode plus the optional SgnProp/MagRef passes (cb_decode_finish) -- are factored into shared helpers so the 1-way and batched entries stay in sync. Wiring: set decode_cb32_batch on the AVX2 dispatch (it hooks into the ISA-agnostic codeblock::decode_row added previously). The step-1 kernel is pure scalar; its per-stream loops carry an OJPH_UNROLL hint, a small portable macro added to ojph_arch.h next to OJPH_FORCE_INLINE: _Pragma("GCC unroll 8") on GCC/Clang, empty on MSVC (which has no such pragma and would otherwise warn C4068). N defaults to 2 (OJPH_DEC_BATCH_N). On Zen 5 N=2 is the AVX2 sweet spot: the bulkier AVX2 step-2 runs between batches, so N=4's ~48KB step-1 working set saturates the 48KB L1d and thrashes; N=2 also keeps the unrolled per-stream state in registers for the 32-bit (i386) build this file serves. Whole-decode, best-of-15 ojph_expand to tmpfs on Zen 5, single-thread, vs a 1-way reference build: 8-bit lossless -10%, 16-bit lossless -8.7%, 8-bit lossy -8.3% (wall clock -9.5% / -8.6% / -8.6%). Byte-identical to the 1-way path on 8/16-bit lossless, lossy, the 512x8 streaming shape, and a Kakadu multipass stream (SgnProp/MagRef coverage); 82/82 ctest pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aous72 · 2026-06-13T11:51:05Z

Dear @osamu620,

Thank you for this PR.
Perhaps, we can talk.

Kind regards,
Aous.

palemieux · 2026-06-19T21:31:45Z

@osamu620 lossless benchmark round-trip passes!

osamu620 and others added 2 commits June 13, 2026 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285

AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285
osamu620 wants to merge 2 commits into
aous72:masterfrom
osamu620:feature/avx2-batch-decode-pr

osamu620 commented Jun 13, 2026

Uh oh!

aous72 commented Jun 13, 2026

Uh oh!

palemieux commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

osamu620 commented Jun 13, 2026

Two commits

Performance

Correctness

Uh oh!

aous72 commented Jun 13, 2026

Uh oh!

palemieux commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants