Skip to content

Implement shared exponent helper for cluster encoding#1052

Merged
zeux merged 10 commits into
masterfrom
mlc-dxr
May 13, 2026
Merged

Implement shared exponent helper for cluster encoding#1052
zeux merged 10 commits into
masterfrom
mlc-dxr

Conversation

@zeux
Copy link
Copy Markdown
Owner

@zeux zeux commented May 12, 2026

When compressing cluster positions, it is convenient to use shared
exponent encoding where all cluster vertices are quantized to the same
integer grid. This exponent needs to be shared across all transitively
connected clusters to avoid gaps in the mesh.

The resulting integers can be stored directly as 24-bit (9 bytes per
vertex), as fixed bit count offsets from the anchor (if max_bits is 16
this would add up to 6 bytes per vertex plus a per-cluster or per-mesh
anchor), or as variable bit count offsets from the per-cluster anchor.

The latter format is part of an upcoming DXR2 specification, exposed as
D3D12_VERTEX_FORMAT_COMPRESSED1 (for which max_bits would be set to 16).

In addition to meshopt_computePositionExponent which computes the
exponent that ensures anchors/deltas fit within the specified limits, this change
also adds an example (encodeMeshletsDXR) which uses meshlet encoder
to encode topology and Compressed1 encoding to encode positions.

This contribution is sponsored by Valve.

zeux added 6 commits May 12, 2026 10:38
When compressing cluster positions, it is convenient to use shared
exponent encoding where all cluster vertices are quantized to the same
integer grid. This exponent needs to be shared across all transitively
connected clusters to avoid gaps in the mesh.

The resulting integers can be stored directly as 24-bit (9 bytes per
vertex), as fixed bit count offsets from the anchor (if max_bits is 16
this would add up to 6 bytes per vertex plus a per-cluster or per-mesh
anchor), or as variable bit count offsets from the per-cluster anchor.

The latter format is part of an upcoming DXR2 specification, exposed as
D3D12_VERTEX_FORMAT_COMPRESSED1 (for which max_bits would be set to 16).
Because we are working with a 24-bit signed grid, and floating point
values are using a 24-bit unsigned mantissa (implied), we have to
truncate a bit. However, since the input values are rounded with an
uncertain direction, this could lead to the anchor overflowing; in this
case we need to bump the exponent range by 1.

1 - FLT_EPSILON / 2 is the frexp equivalent of 1.1111... mantissa, as it
remaps the mantissa to [0.5, 1) range; this would be cleaner with direct
bit manipulation.
For best performance and good size for raytracing friendly geometry, we
can encode meshlet topology (no vertex references) using meshopt codec,
and meshlet vertex positions using DXR2 Compressed1 format, where each
cluster has its own anchor and a shared exponent which prevents gaps.

This encoding method uses meshopt_computeClusterPositionExponent to
establish the exponent that fits all clusters.

Here we pack each cluster independently with a DXR2-compatible header
(plus one word for our own cluster data). The decoding process is
meshopt_decodeMeshlet + memcpy. Note that for optimal post-deflate size,
a different data layout may be preferable which might require a little
more memcpy/reshuffling to decode.
This acts as a benchmark and as an example of how to destructure the
encoded data; it runs at ~12-13 GB/s on one Zen4 core.
This is consistent with how DGF stores the data and is also optimal for
RT: since we do not preserve original indices here, there's no need to
worry about attribute seams et al, and as such we want a deduplicated
position stream.
Even if range at a given scale fits in the given bit count, because the
caller independently quantizes the endpoints to the new exponent, this
can push the quantized range such that the delta does not fit in
max_bits.

What we want is a similar check to maxc_off: increase the exponent by 1
if we are past the quantization boundary; however here the boundary is
defined by the input bit count restriction, not the floating point
precision, so we need to synthesize it from the input bits instead.

Additionally, max_bits=1 is not a well formed constraint in general;
restricting the deltas to be 0 requires picking an impossibly large
exponent in practice that collapses the entire cluster into a point
which does not seem useful. Bit count may be 1 along a given axis but it
must be >1 on at least one axis, so restrict it to 2 to avoid misuse.
@zeux
Copy link
Copy Markdown
Owner Author

zeux commented May 12, 2026

Note that the example code stores the individual cluster bits sequentially for simplicity: each cluster stores a 16-byte header (including DXR2 Compressed1 header), followed by packed positions, followed by encoded topology.

For best post-compression with deflate/Zstd, you would want to store all headers followed by all positions followed by all topology, as this maximizes the reuse. An even better option would store individual axis bits consecutively (all X bits => all Y bits => all Z bits for all clusters), but this would require bit repacking as DXR2 doesn't allow splitting the axes this way. As it stands, this format can be decoded into a runtime-friendly representation for direct DXR2 consumption with meshopt_decodeMeshlet + memcpy per cluster, running at a cumulative ~12-13 GB/sec decode on one core. buddha.obj takes 0.45 msec to decode.

Before deflate, the results are smaller compared to DGFS when matching exponent used for encoding; with the existing layout, the results are a little larger than DGFS post-deflate as it uses a fully deinterleaved layout; a partially interleaved layout (without bit interleaving) ends up being a little smaller than DGFS. A fully deinterleaved layout is noticeably smaller than DGFS but it requires custom SIMD bit deinterleaving code to decode quickly and I don't feel like writing this now :)

In general the demo format (encodeMeshletsDXR) is not intended as some sort of "standard", and instead works as a sketch that can be adapted to the needs of each application. In some cases, DXR bit width restriction (16 bits per axis) is too limiting anyhow - Nanite uses up to 21 bits per axis because hierarchical clustering imposes more severe limits to avoid gaps - in which case this code can be trivially adapted for a larger bit count or a different anchoring strategy.

@zeux
Copy link
Copy Markdown
Owner Author

zeux commented May 12, 2026

Some data for a few meshes, using exponent -14 (to match quality). The size is the raw number reported by the demo program; the Zstd size is computed from the raw stream as well as a partially-deinterleaved format (headers, positions, topology - this is still as fast to decode as raw). Not including the fully bit deinterleaved data here as it would be slower to decode without more specialized processing.

Mesh size (MB) bytes/triangle Δ vs DGFS Δ vs DGFS (Zstd) Δ vs DGFS (Zstd, 3 streams)
armadillo 0.75 3.70 −28.5% −8.5% −11.0%
buddha 3.08 2.97 −13.4% +4.8% +0.5%
bunny 0.26 3.93 −22.4% −5.1% −7.4%
dragon 2.55 3.06 −19.9% +5.8% +1.5%
pig 2.85 3.68 −26.7% −5.2% −8.4%
roadbike 4.24 2.65 −23.1% −1.4% −4.7%

DGFS uses sequential byte aligned deltas to store positional components, which takes more space but compresses better in certain cases. The demo format is not AMD dependent and does not require "transcoding" in a meaningful sense, it only requires decoding topology into 3 bytes per triangle for the input to be directly compatible with DXR2 APIs.

zeux added 3 commits May 13, 2026 07:20
…nExponent

The name is a little bit too long; after evaluating two alternatives,
computeClusterExponent and computePositionExponent, the latter won: the
encoding here is position specific but could be used outside of clusters
in theory too, and the input to the function is either a mesh AABB or a
cluster AABB depending on the use case.

Also clean up header documentation.
We test a variety of configurations including some corner cases; most of
these are synthetic. For each configuration we also check the general
expected properties: endpoints should be representable as 24-bit signed
integers, and endpoints minus anchor should be representable as the
unsigned K-bit value (max_bits = K).
Add documentation for cluster position quantization based on the new
helper function.
@zeux
Copy link
Copy Markdown
Owner Author

zeux commented May 13, 2026

Since the deltas are 16-bit there's also a variety of other schemes possible here that aren't DXR2 specific and don't rely on Compressed1 runtime layout. For example, you could store the deltas from anchors as 16-bit per axis, and compress the result using meshopt_encodeVertexBuffer, while still storing the topology using encoded meshlet format. This decodes at ~6 GB/s (vs 12-13 GB/s in the demo; buddha.obj takes ~1.4ms) with the following size characteristics:

Mesh size (MB) bytes/triangle Δ vs DGFS Δ vs DGFS (zstd)
armadillo 0.71 3.50 −32.3% −18.9%
buddha 2.98 2.87 −16.2% −15.5%
bunny 0.25 3.73 −26.5% −17.7%
dragon 2.43 2.93 −23.5% −16.3%
pig 2.71 3.49 −30.4% −15.1%
roadbike 3.98 2.49 −27.7% −23.1%

In this case the cluster encoding is not independent, as encodeVertexBuffer needs to compress across clusters for maximum efficiency; but this is a nice encoding if you have a whole page of clusters to encode, and need something that's rasterization friendly. Of course this trades off some decoding speed, and decompresses to a larger representation compared to Compressed1 layout.

(note, data above is with each vertex padded to 8 bytes to conform to encodeVertexBuffer alignment restrictions; you can also encode packed 6 byte per vertex data with stride 12, which compresses a little worse but decompresses to less data - lots of viable options to consider!)

Slightly cleanup the code sample and adjust wording wrt position
encoding, highlighting fixed precision storage further.

Also add Markdown files to .editorconfig to fix indentation issues.
@zeux zeux force-pushed the mlc-dxr branch 2 times, most recently from 15a4253 to 53698f3 Compare May 13, 2026 17:45
@zeux zeux merged commit 75b96fe into master May 13, 2026
26 checks passed
@zeux zeux deleted the mlc-dxr branch May 13, 2026 19:23
@zeux
Copy link
Copy Markdown
Owner Author

zeux commented May 14, 2026

After some experiments I've also arrived at a decent further improvement to DXR2 C1 encoding: instead of encoding each offset for each vertex independently, we take the resulting value as a single up-to-48-bit value and encode the delta with the previous value in the same cluster. The delta wraps around and is encoded in the same bit count; this results in the same size, but allows Zstd to compress better, with even stronger gains vs DGFS:

Mesh size (MB) bytes/triangle Δ vs DGFS Δ vs DGFS (zstd)
armadillo 0.75 3.70 -28.5% -14.9%
buddha 3.08 2.97 -13.4% -8.0%
bunny 0.26 3.93 -22.4% -10.9%
dragon 2.55 3.06 -19.9% -8.9%
pig 2.85 3.68 -26.7% -12.6%
roadbike 4.24 2.65 -23.1% -18.2%

This requires unpacking the deltas during deserialization; because the deltas are merged into a single value the cost is relatively small: a simple scalar loop (unaligned read of a 64-bit value; shift into alignment; accumulate delta; append result to output 64-bit value; shift bytes into the output using an unaligned 64-bit store) takes ~1.04ms to decode buddha.obj (vs ~0.45ms for raw Compressed1 offsets).

code
// vc = vertex count in cluster, tb = bitsx + bitsy + bitsz (<= 48)
// input/output buffers should have memory safety padding for 64-bit unaligned loads/stores
static void decode_xyzdp(uint8_t* dst, const uint8_t* src, int vc, int tb)
{
    uint64_t mask = (1ull << tb) - 1;
    uint64_t prev = 0, wacc = 0;
    int wbits = 0;
    size_t rp = 0;
    uint8_t* pw = dst;

    for (int j = 0; j < vc; ++j)
    {
        uint64_t word = *(const uint64_t*)(src + (rp >> 3));
        uint64_t delta = (word >> (rp & 7)) & mask;
        rp += tb;

        uint64_t curr = (prev + delta) & mask;
        prev = curr;

        wacc |= curr << wbits;
        wbits += tb;

        if (wbits >= 64 - tb)
        {
            *(uint64_t*)pw = wacc;
            int nbytes = wbits >> 3;
            pw += nbytes;
            wacc >>= nbytes * 8;
            wbits -= nbytes * 8;
        }
    }

    *(uint64_t*)pw = wacc;
}

I've also experimented with an interleaved layout with delta encoding; it's possible to decode fairly quickly (~1.38ms) too but it requires PDEP/PEXT & AVX to decode 4 values at a time, which makes it less portable as PDEP/PEXT are unavailable on ARM and are slow on earlier Zen CPUs. It does get further gains after Zstd but honestly at this point this is enough options for people to consider :) Feel free to reach out if any of the above is of interest for production use.

@zeux
Copy link
Copy Markdown
Owner Author

zeux commented May 14, 2026

Finally because all the encodings above are layered and require several decompression steps, it could be instructive to look at bytes/triangle over the three different stages: let's call them raw (ready for BVH build), encoded (stored in engine file representation) and compressed (stored on disk when Zstd/etc. is used on top).

For DGFS these map to DGF, DGFS and DGFS+Zstd. For the encoding in the previous comment, these map to DXR2 Compressed1 positions + 3-byte triangle indices, compressed1+meshopt encoding, and compressed1+meshopt encoding+Zstd. The sizes here are all in bytes/triangle, smallest entry in each respective category is bolded.

Mesh raw (DXR2) encoded compressed compressed (+deltas) raw (DGF) encoded compressed
armadillo 5.83 3.70 3.19 3.12 6.50 5.17 3.66
buddha 5.07 2.97 2.53 2.35 4.76 3.42 2.55
bunny 6.06 3.93 3.51 3.39 7.14 5.07 3.81
dragon 5.17 3.06 2.57 2.37 5.02 3.83 2.60
pig 5.78 3.68 3.30 3.23 6.32 5.01 3.70
roadbike 4.81 2.65 2.10 1.82 4.31 3.44 2.23

Note that "raw" storage here is not the BVH itself; it's the inputs given to the BVH builder. On newer AMD RDNA hardware I would of course expect the BVH build to be more straightforward if DGF blocks are given as an input, but any other hardware, including earlier AMD chips, would need to unpack DGF blocks and repack the data, which will not be faster or better than raw DXR2 encoding.

buddha.obj takes ~1ms to decode using the variant with deltas, or <0.5ms using the variant without. DGFS takes ~24ms. These numbers are not inclusive of Zstd if it's used for either, as you can swap Zstd for any other high quality codec, including HW Kraken if you're so lucky. Software Zstd would add ~1.5ms to either option, so 2-2.5ms for meshopt vs ~25ms for DGFS.

My conclusion overall based on this and other limitations is that DGF is fine as a leaf BVH encoding for RDNA5 but overall DGF-derived encodings are not Pareto optimal and superior alternatives exist. Compressed1 position encoding though is quite nice and I recommend other APIs like Vulkan adopt it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant