Conversation
When compressing cluster positions, it is convenient to use shared exponent encoding where all cluster vertices are quantized to the same integer grid. This exponent needs to be shared across all transitively connected clusters to avoid gaps in the mesh. The resulting integers can be stored directly as 24-bit (9 bytes per vertex), as fixed bit count offsets from the anchor (if max_bits is 16 this would add up to 6 bytes per vertex plus a per-cluster or per-mesh anchor), or as variable bit count offsets from the per-cluster anchor. The latter format is part of an upcoming DXR2 specification, exposed as D3D12_VERTEX_FORMAT_COMPRESSED1 (for which max_bits would be set to 16).
Because we are working with a 24-bit signed grid, and floating point values are using a 24-bit unsigned mantissa (implied), we have to truncate a bit. However, since the input values are rounded with an uncertain direction, this could lead to the anchor overflowing; in this case we need to bump the exponent range by 1. 1 - FLT_EPSILON / 2 is the frexp equivalent of 1.1111... mantissa, as it remaps the mantissa to [0.5, 1) range; this would be cleaner with direct bit manipulation.
For best performance and good size for raytracing friendly geometry, we can encode meshlet topology (no vertex references) using meshopt codec, and meshlet vertex positions using DXR2 Compressed1 format, where each cluster has its own anchor and a shared exponent which prevents gaps. This encoding method uses meshopt_computeClusterPositionExponent to establish the exponent that fits all clusters. Here we pack each cluster independently with a DXR2-compatible header (plus one word for our own cluster data). The decoding process is meshopt_decodeMeshlet + memcpy. Note that for optimal post-deflate size, a different data layout may be preferable which might require a little more memcpy/reshuffling to decode.
This acts as a benchmark and as an example of how to destructure the encoded data; it runs at ~12-13 GB/s on one Zen4 core.
This is consistent with how DGF stores the data and is also optimal for RT: since we do not preserve original indices here, there's no need to worry about attribute seams et al, and as such we want a deduplicated position stream.
Even if range at a given scale fits in the given bit count, because the caller independently quantizes the endpoints to the new exponent, this can push the quantized range such that the delta does not fit in max_bits. What we want is a similar check to maxc_off: increase the exponent by 1 if we are past the quantization boundary; however here the boundary is defined by the input bit count restriction, not the floating point precision, so we need to synthesize it from the input bits instead. Additionally, max_bits=1 is not a well formed constraint in general; restricting the deltas to be 0 requires picking an impossibly large exponent in practice that collapses the entire cluster into a point which does not seem useful. Bit count may be 1 along a given axis but it must be >1 on at least one axis, so restrict it to 2 to avoid misuse.
|
Note that the example code stores the individual cluster bits sequentially for simplicity: each cluster stores a 16-byte header (including DXR2 Compressed1 header), followed by packed positions, followed by encoded topology. For best post-compression with deflate/Zstd, you would want to store all headers followed by all positions followed by all topology, as this maximizes the reuse. An even better option would store individual axis bits consecutively (all X bits => all Y bits => all Z bits for all clusters), but this would require bit repacking as DXR2 doesn't allow splitting the axes this way. As it stands, this format can be decoded into a runtime-friendly representation for direct DXR2 consumption with Before deflate, the results are smaller compared to DGFS when matching exponent used for encoding; with the existing layout, the results are a little larger than DGFS post-deflate as it uses a fully deinterleaved layout; a partially interleaved layout (without bit interleaving) ends up being a little smaller than DGFS. A fully deinterleaved layout is noticeably smaller than DGFS but it requires custom SIMD bit deinterleaving code to decode quickly and I don't feel like writing this now :) In general the demo format ( |
|
Some data for a few meshes, using exponent -14 (to match quality). The size is the raw number reported by the demo program; the Zstd size is computed from the raw stream as well as a partially-deinterleaved format (headers, positions, topology - this is still as fast to decode as raw). Not including the fully bit deinterleaved data here as it would be slower to decode without more specialized processing.
DGFS uses sequential byte aligned deltas to store positional components, which takes more space but compresses better in certain cases. The demo format is not AMD dependent and does not require "transcoding" in a meaningful sense, it only requires decoding topology into 3 bytes per triangle for the input to be directly compatible with DXR2 APIs. |
…nExponent The name is a little bit too long; after evaluating two alternatives, computeClusterExponent and computePositionExponent, the latter won: the encoding here is position specific but could be used outside of clusters in theory too, and the input to the function is either a mesh AABB or a cluster AABB depending on the use case. Also clean up header documentation.
We test a variety of configurations including some corner cases; most of these are synthetic. For each configuration we also check the general expected properties: endpoints should be representable as 24-bit signed integers, and endpoints minus anchor should be representable as the unsigned K-bit value (max_bits = K).
Add documentation for cluster position quantization based on the new helper function.
|
Since the deltas are 16-bit there's also a variety of other schemes possible here that aren't DXR2 specific and don't rely on Compressed1 runtime layout. For example, you could store the deltas from anchors as 16-bit per axis, and compress the result using
In this case the cluster encoding is not independent, as (note, data above is with each vertex padded to 8 bytes to conform to |
Slightly cleanup the code sample and adjust wording wrt position encoding, highlighting fixed precision storage further. Also add Markdown files to .editorconfig to fix indentation issues.
15a4253 to
53698f3
Compare
|
After some experiments I've also arrived at a decent further improvement to DXR2 C1 encoding: instead of encoding each offset for each vertex independently, we take the resulting value as a single up-to-48-bit value and encode the delta with the previous value in the same cluster. The delta wraps around and is encoded in the same bit count; this results in the same size, but allows Zstd to compress better, with even stronger gains vs DGFS:
This requires unpacking the deltas during deserialization; because the deltas are merged into a single value the cost is relatively small: a simple scalar loop (unaligned read of a 64-bit value; shift into alignment; accumulate delta; append result to output 64-bit value; shift bytes into the output using an unaligned 64-bit store) takes ~1.04ms to decode code// vc = vertex count in cluster, tb = bitsx + bitsy + bitsz (<= 48)
// input/output buffers should have memory safety padding for 64-bit unaligned loads/stores
static void decode_xyzdp(uint8_t* dst, const uint8_t* src, int vc, int tb)
{
uint64_t mask = (1ull << tb) - 1;
uint64_t prev = 0, wacc = 0;
int wbits = 0;
size_t rp = 0;
uint8_t* pw = dst;
for (int j = 0; j < vc; ++j)
{
uint64_t word = *(const uint64_t*)(src + (rp >> 3));
uint64_t delta = (word >> (rp & 7)) & mask;
rp += tb;
uint64_t curr = (prev + delta) & mask;
prev = curr;
wacc |= curr << wbits;
wbits += tb;
if (wbits >= 64 - tb)
{
*(uint64_t*)pw = wacc;
int nbytes = wbits >> 3;
pw += nbytes;
wacc >>= nbytes * 8;
wbits -= nbytes * 8;
}
}
*(uint64_t*)pw = wacc;
}I've also experimented with an interleaved layout with delta encoding; it's possible to decode fairly quickly (~1.38ms) too but it requires PDEP/PEXT & AVX to decode 4 values at a time, which makes it less portable as PDEP/PEXT are unavailable on ARM and are slow on earlier Zen CPUs. It does get further gains after Zstd but honestly at this point this is enough options for people to consider :) Feel free to reach out if any of the above is of interest for production use. |
|
Finally because all the encodings above are layered and require several decompression steps, it could be instructive to look at bytes/triangle over the three different stages: let's call them raw (ready for BVH build), encoded (stored in engine file representation) and compressed (stored on disk when Zstd/etc. is used on top). For DGFS these map to DGF, DGFS and DGFS+Zstd. For the encoding in the previous comment, these map to DXR2 Compressed1 positions + 3-byte triangle indices, compressed1+meshopt encoding, and compressed1+meshopt encoding+Zstd. The sizes here are all in bytes/triangle, smallest entry in each respective category is bolded.
Note that "raw" storage here is not the BVH itself; it's the inputs given to the BVH builder. On newer AMD RDNA hardware I would of course expect the BVH build to be more straightforward if DGF blocks are given as an input, but any other hardware, including earlier AMD chips, would need to unpack DGF blocks and repack the data, which will not be faster or better than raw DXR2 encoding.
My conclusion overall based on this and other limitations is that DGF is fine as a leaf BVH encoding for RDNA5 but overall DGF-derived encodings are not Pareto optimal and superior alternatives exist. Compressed1 position encoding though is quite nice and I recommend other APIs like Vulkan adopt it. |
When compressing cluster positions, it is convenient to use shared
exponent encoding where all cluster vertices are quantized to the same
integer grid. This exponent needs to be shared across all transitively
connected clusters to avoid gaps in the mesh.
The resulting integers can be stored directly as 24-bit (9 bytes per
vertex), as fixed bit count offsets from the anchor (if m
ax_bitsis 16this would add up to 6 bytes per vertex plus a per-cluster or per-mesh
anchor), or as variable bit count offsets from the per-cluster anchor.
The latter format is part of an upcoming DXR2 specification, exposed as
D3D12_VERTEX_FORMAT_COMPRESSED1 (for which
max_bitswould be set to 16).In addition to
meshopt_computePositionExponentwhich computes theexponent that ensures anchors/deltas fit within the specified limits, this change
also adds an example (
encodeMeshletsDXR) which uses meshlet encoderto encode topology and Compressed1 encoding to encode positions.
This contribution is sponsored by Valve.