Skip to content

AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285

Draft
osamu620 wants to merge 2 commits into
aous72:masterfrom
osamu620:feature/avx2-batch-decode-pr
Draft

AVX2 decoder: N-way codeblock interleave for the step-1 VLC/MEL decode#285
osamu620 wants to merge 2 commits into
aous72:masterfrom
osamu620:feature/avx2-batch-decode-pr

Conversation

@osamu620

Copy link
Copy Markdown
Contributor

Follow-up to #276. The first step of decoding a codeblock — recovering the VLC and MEL segments of the HTJ2K cleanup pass (this decoder's "step 1") — is a serial dependency chain and the single hottest decode kernel (~42% of 8-bit AVX2 decode, IPC ~2.8). Codeblocks are independent, so decoding several in lockstep lets the out-of-order engine overlap their chains.

Two commits

  1. ISA-agnostic plumbing. That interleaving can't live inside the per-block SIMD decoder, which by definition sees only one codeblock — the code that knows about "a row of codeblocks in a subband" sits one level up, in subband::pull_line, so the seam belongs there. This adds an optional decode_cb32_batch pointer (next to decode_cb32/decode_cb64, selected by the same CPU-feature dispatch) and a codeblock::decode_row that gathers a row and either calls the batch or falls back to the exact 1-way decode() (when no batch is registered, or for 64-bit blocks). No functional change on its own — that shared code is touched in just one place, future ISAs plug in with only a kernel + one dispatch line, and current output can't change.

  2. AVX2 kernel. A templated N-way step-1 kernel (decode_cb_step1_vlc_nway<N>) that decodes N same-width codeblocks in lockstep, driven by ojph_decode_codeblock_avx2_batch, which groups a subband row's same-width blocks N-at-a-time and falls back to 1-way for edge/odd/invalid blocks. Per-block validation and the second step (MagSgn + optional SgnProp/MagRef) are factored into shared helpers so the 1-way and batched paths stay in sync.

Performance

Single-thread, ojph_expand → tmpfs, taskset -c0, best-of-15, vs a 1-way reference build (Zen 5):

condition whole-process wall
8-bit lossless −9.5%
16-bit lossless −8.6%
8-bit lossy −8.6%

Whole-process includes the PPM write, so the decode-kernel gain is larger. OJPH_DEC_BATCH_N defaults to 2 — the measured Zen 5 sweet spot (the bulkier AVX2 step-2 between batches makes N=4's larger step-1 working set thrash L1; N=2 also keeps the unrolled per-stream state in registers for 32-bit builds). It's compile-overridable, and it can never regress: anything not batched takes the unchanged 1-way path.

Correctness

Byte-identical to the 1-way path on 8/16-bit lossless, lossy, the 512×8 streaming shape, and a Kakadu multipass stream (SgnProp/MagRef coverage); 82/82 ctest.

Built clean (0 warnings) locally across GCC on Linux (AVX2-only, AVX-512-enabled, and no-SIMD configs), MSVC VS2022 (x64 + Win32/32-bit), and MinGW UCRT64 GCC 16.


@palemieux — would you mind running this against the lossless benchmark round-trip, as you did on #276? Marking draft until that and CI are green.

🤖 Generated with Claude Code

osamu620 and others added 2 commits June 13, 2026 18:41
Block-level SIMD decodes one codeblock at a time, but the first step of that
decode -- recovering the VLC and MEL segments of the HTJ2K cleanup pass, what
this decoder calls "step 1" -- is a serial dependency chain that leaves the
core's execution units under-utilised (IPC well below peak). The way to fill
those idle slots is to decode several *independent* codeblocks together and let
the out-of-order engine overlap their chains. That interleaving cannot live
inside the per-block SIMD decoder, which by definition sees only one codeblock;
the code that knows about "a row of codeblocks in a subband" sits one level up,
in subband::pull_line. This commit adds the batching seam there, kept
ISA-neutral so that shared decode code is touched in just one place and any
decoder can opt in.

  - A cb_decoder_batch_fun32 typedef and an optional decode_cb32_batch function
    pointer in codeblock_fun, alongside the existing decode_cb32/decode_cb64 and
    chosen by the same runtime CPU-feature dispatch. NULL by default; an ISA
    sets it only if it has a batch kernel.
  - codeblock::decode_row(blocks, count) gathers a subband row's decodable
    32-bit codeblocks into parallel arrays and calls the batch in windows of 64,
    marking zero-blocks directly. It falls back to per-block decode() when no
    batch is registered (NULL) or the row is 64-bit, reproducing the exact 1-way
    behaviour and error handling.
  - subband::pull_line decodes the row through decode_row instead of looping
    decode() per block.

Keeping this half ISA-neutral means a SIMD kernel plugs in with just its own
batch function and one dispatch line, with no further changes to that shared
code (the AVX2 kernel follows in the next commit; AVX-512/NEON could reuse it
unchanged). By itself it is a no-op -- with decode_cb32_batch NULL every block
still takes the existing 1-way decode() -- so it carries zero risk to current
output and can be reviewed on its own.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first step of decoding a codeblock -- recovering the VLC and MEL segments of
the HTJ2K cleanup pass ("step 1") -- is a serial per-quad dependency chain and
the single hottest decode kernel (~42% of 8-bit AVX2 decode, execution-bound at
IPC ~2.8). Codeblocks are independent, so interleaving several of them lets the
out-of-order engine overlap the otherwise-serial chains.

Add a templated N-way step-1 kernel (decode_cb_step1_vlc_nway<N>) that decodes
N same-dimension codeblocks in lockstep, driven by a new batched entry
(ojph_decode_codeblock_avx2_batch) that groups a subband row's same-width blocks
N-at-a-time and falls back to the 1-way path for edge/odd/invalid blocks. The
per-block validation (cb_decode_setup) and the second step -- MagSgn decode plus
the optional SgnProp/MagRef passes (cb_decode_finish) -- are factored into shared
helpers so the 1-way and batched entries stay in sync. Wiring: set
decode_cb32_batch on the AVX2 dispatch (it hooks into the ISA-agnostic
codeblock::decode_row added previously).

The step-1 kernel is pure scalar; its per-stream loops carry an OJPH_UNROLL
hint, a small portable macro added to ojph_arch.h next to OJPH_FORCE_INLINE:
_Pragma("GCC unroll 8") on GCC/Clang, empty on MSVC (which has no such pragma
and would otherwise warn C4068).

N defaults to 2 (OJPH_DEC_BATCH_N). On Zen 5 N=2 is the AVX2 sweet spot: the
bulkier AVX2 step-2 runs between batches, so N=4's ~48KB step-1 working set
saturates the 48KB L1d and thrashes; N=2 also keeps the unrolled per-stream
state in registers for the 32-bit (i386) build this file serves.

Whole-decode, best-of-15 ojph_expand to tmpfs on Zen 5, single-thread, vs a
1-way reference build: 8-bit lossless -10%, 16-bit lossless -8.7%, 8-bit lossy
-8.3% (wall clock -9.5% / -8.6% / -8.6%). Byte-identical to the 1-way path on
8/16-bit lossless, lossy, the 512x8 streaming shape, and a Kakadu multipass
stream (SgnProp/MagRef coverage); 82/82 ctest pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aous72

aous72 commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Dear @osamu620,

Thank you for this PR.
Perhaps, we can talk.

Kind regards,
Aous.

@palemieux

Copy link
Copy Markdown
Contributor

@osamu620 lossless benchmark round-trip passes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants