hello! i'm a spack developer and really invested in software build and packaging performance!
proposal: BLAKE3 🤝 AMD
so BLAKE3 is an unambiguously great checksum approach (see the fantastic paper). some specific reasons why i think it's worth the aocl-crypto project's time to look into:
issues with the reference implementation
those are the fun reasons (i have my own reasons for this, see later section). but there are some other points that would seem relevant to bring to your team's attention as well:
...and there are other reasons i believe the reference implementation is something we should be able to improve on pretty readily. why?
- from what i understand, oneTBB's work-stealing approach is wildly misaligned with the extremely regular computations performed to construct a hash chain. it's way too high-level for a computation like this.
- the (very skilled) cryptographers behind it don't seem to have a terribly effective understanding of how to pipeline across threads, or how to schedule parallel work more generally (https://github.com/BLAKE3-team/BLAKE3/tree/master/c#less-common-api-functions):
Hashing large files with this function usually requires memory-mapping, since reading a file into memory in a single-threaded loop takes longer than hashing the resulting buffer. Note that hashing a memory-mapped file with this function produces a "random" pattern of disk reads, which can be slow on spinning disks.
they also pretty clearly decribe recursive calls and stack space as free real estate in order to avoid analyzing the memory bounds. it turns out the memory bounds for tree hashing are actually highly practical and optimizable!
striping, splitting, merging, for filesystem-like data
there are a few very interesting and unique dataflows that BLAKE3's cryptography enables. this is especially where i think your team's approach (high-quality open-source libraries like hyperscan leveraging microarchitectural task-data co-scheduling) should be ideal.
first, a note on task dependencies:
- i don't know why they don't just say this, but (i'm pretty sure?) the asymptotic runtime is quadratic (note that the hash tree can be etched into an upper triangular matrix).
- it's correct that a tree of binary operations to compress two fixed-size data chunks into a parent chunk can operate in-place with constant space, but mentioning "heap allocation" in 5.2 Multi-threading is misleading imho:
This recursive approach is good fit for a fork-join concurrency model, like those provided by OpenMP, Cilk, and Rayon (Rust). Each left and right part becomes a separate task, and a work-stealing runtime parallelizes those tasks across however many threads are available. This can work without heap allocation, because the runtime can make a fixed-size stack allocation at each recursive call site.
unless i'm mistaken, this recursive "fixed-size" stack allocation will result in linear stack growth, and while work-stealing is great for irregular task-data scheduling, the hash tree is in fact immensely structured...
beyond BLAKE3: extending reuse of memoized substructure to filesystem operations
...and the authors speak to this in 6.5 Incremental Updates:
Another new use case supported by BLAKE3 is incrementally updating the root hash when an input is edited in place. A serial hash function can efficiently append new bytes to the end of an input, but editing bytes at the beginning or in the middle requires re-hashing everything to the right of the edit. With BLAKE3, an application can edit bytes anywhere in its input and re-hash just the modified chunks and their transitive parent nodes.
in particular, they analogize this to a filesystem:
This is similar to the constraints of a typical filesystem, where appends and in-place edits are efficient, but insertions and removals within a file are either slow or unsupported.
in fact, many (most) filesystems compose a "file" from noncontiguous extents, and instead perform allocate-on-flush to merge the ordered sequence of extents into a contiguous allocated region.
this is more than pedantry, as 7.5 Chunk Counter throws up a hurdle in our path:
The main job of the compression function’s 64-bit counter parameter, t, is to support extendable output. By incrementing t (see §2.6), the caller can extract as much output as needed, without imposing any extra steps on the default-length case. However, we also use the t parameter during the input phase, to indicate the chunk number (see §2.4). This means that each chunk has a distinct CV with high probability, even if two chunks have the same input bytes. This is not strictly necessary for security, but it discourages a class of dangerous optimizations.
i believe we (or perhaps your team, at least) could figure out a way to recompute this efficiently with SIMD traversal of all intermediate hash tree nodes, to recreate the t counter upon permutation, then to fully recompute the tree. but i will argue that we should be able to store only the top-level hash nodes of each extent, plus some additional data, in order to compute our permuted tree hash.
the dangerous optimization:
Consider an application that hashes sparse files, which are mostly filled with zeros. The majority of this application’s input chunks could be the all-zero chunk. This application might try to compute the CV of the zero chunk only once, and then check each chunk before compressing it to see whether it can use that precomputed CV. This check is cheap, and for this application it could be a big speedup. But this optimization is dangerous, because it is not constant-time. The runtime of the hash function would leak information about its input.
arguably, the same application could have initially performed a trivial run-length encoding to compress the zero sectors, and then hashed the result, which would result in the same speedup. to flip it around: the application recomputing the zero checksum repeatedly is also leaking information about its input. the constant-time nature of the computation is not a cryptographic security guarantee of BLAKE3, and it certainly wasn't proven in the paper.
*I had an extremely lengthy digression at this point that will turn into a separate research paper.*
i believe the "scratch space" construct also maps to filesystem-like interfaces quite naturally, as well as more general tree computations. here are two examples of more complex data structures i have very much desired to be able to hash efficiently in the past:
- filesystem snapshots, which can be used to virtualize i/o from subprocesses in a build system or other supervisor process,
- the cord data structure from hans boehm at xerox parc (https://hboehm.info/gc/gc_source/cordh.txt) may cheaply prepend or append bytes to its internal tree structure, but splitting off a chunk from one end like the incremental BLAKE3 is as expensive as traversing the entire tree to serialize its contents. we can do better!
Conclusion: learn from hyperscan!
I think my basic point is: this concept of a "scratch space" has worked very effectively for hyperscan, and it's a way to achieve locality that the BLAKE3 authors haven't considered yet.
I'm interested in working on this and would like to know how likely it is for your team to accept it. No pressure, but I am interested in approaching this as a research project. the gnu coreutils team ended up implementing it pretty quickly on their own after i raised this to them a few months ago, and i feel it's potentially a very powerful showcase of AMD processing power with a relatively small amount of effort.
love your work!
*a post-script elaboration on the kind of work I've done previously that led me to reach out here. in particular the multithreaded SIMD parsing of zip files is the precise case where i want something just like this (a more powerful and flexible tree hashing framework which works with the processor and operating system instead of continuously thrashing)*
libmem applications
i previously worked on the pants build tool, which did a lot of highly-structured i/o-centric workflow scheduling: https://www.pantsbuild.org/stable/docs/writing-plugins/the-rules-api/file-system
lots of prior work with pipelined i/o and CPU:
proposal: BLAKE3 ❤️ AMD
so BLAKE3 is an unambiguously great checksum approach (see the fantastic paper). some specific reasons why i think it's worth the aocl-crypto project's time to look into:
issues with the reference implementation
those are the fun reasons (i have my own reasons for this, see later section). but there are some other points that would seem relevant to bring to your team's attention as well:
...but the reference implementation is something i believe we should be able to improve on pretty readily. why?
- from what i understand, oneTBB's work-stealing approach is wildly misaligned with the extremely regular computations performed to construct a hash chain. it's way too high-level for a computation like this.
- the (very skilled) cryptographers behind it don't seem to have a terribly effective understanding of how to pipeline across threads, or how to schedule parallel work more generally (https://github.com/BLAKE3-team/BLAKE3/tree/master/c#less-common-api-functions):
Hashing large files with this function usually requires memory-mapping, since reading a file into memory in a single-threaded loop takes longer than hashing the resulting buffer. Note that hashing a memory-mapped file with this function produces a "random" pattern of disk reads, which can be slow on spinning disks.
striping, splitting, merging, for filesystem-like data
there are a few very interesting and unique dataflows that BLAKE3's cryptography enables. this is especially where i think your team's approach (high-quality open-source libraries like hyperscan leveraging microarchitectural task-data co-scheduling) should be ideal.
first, a note on task dependencies:
- i don't know why they don't just say this, but (i'm pretty sure?) the asymptotic runtime is quadratic (note that the hash tree can be etched into an upper triangular matrix).
- it's correct that a tree of binary operations to compress two fixed-size data chunks into a parent chunk can operate in-place with constant space, but mentioning "heap allocation" in 5.2 Multi-threading is misleading imho:
This recursive approach is good fit for a fork-join concurrency model, like those provided by OpenMP, Cilk, and Rayon (Rust). Each left and right part becomes a separate task, and a work-stealing runtime parallelizes those tasks across however many threads are available. This can work without heap allocation, because the runtime can make a fixed-size stack allocation at each recursive call site.
unless i'm mistaken, this recursive "fixed-size" stack allocation will result in linear stack growth, and while work-stealing is great for irregular task-data scheduling, the hash tree is in fact immensely structured...
beyond BLAKE3: extending reuse of memoized substructure to filesystem operations
...and the authors speak to this in 6.5 Incremental Updates:
Another new use case supported by BLAKE3 is incrementally updating the root hash when an input is edited in place. A serial hash function can efficiently append new bytes to the end of an input, but editing bytes at the beginning or in the middle requires re-hashing everything to the right of the edit. With BLAKE3, an application can edit bytes anywhere in its input and re-hash just the modified chunks and their transitive parent nodes.
in particular, they analogize this to a filesystem (which is my ulterior motive, see later section):
This is similar to the constraints of a typical filesystem, where appends and in-place edits are efficient, but insertions and removals within a file are either slow or unsupported.
in fact, many (most) filesystems compose a "file" from noncontiguous extents, and instead perform allocate-on-flush to merge the ordered sequence of extents into a contiguous allocated region.
this is more than pedantry, as 7.5 Chunk Counter throws up a hurdle in our path:
The main job of the compression function’s 64-bit counter parameter, t, is to support extendable output. By incrementing t (see §2.6), the caller can extract as much output as needed, without imposing any extra steps on the default-length case. However, we also use the t parameter during the input phase, to indicate the chunk number (see §2.4). This means that each chunk has a distinct CV with high probability, even if two chunks have the same input bytes. This is not strictly necessary for security, but it discourages a class of dangerous optimizations.
i believe we (or perhaps your team, at least) could figure out a way to recompute this efficiently with SIMD traversal of all intermediate hash tree nodes, to recreate the t counter upon permutation, then to fully recompute the tree. but i will argue that we should be able to store only the top-level hash nodes of each extent, plus some additional data, in order to compute our permuted tree hash.
the dangerous optimization:
Consider an application that hashes sparse files, which are mostly filled with zeros. The majority of this application’s input chunks could be the all-zero chunk. This application might try to compute the CV of the zero chunk only once, and then check each chunk before compressing it to see whether it can use that precomputed CV. This check is cheap, and for this application it could be a big speedup. But this optimization is dangerous, because it is not constant-time. The runtime of the hash function would leak information about its input.
arguably, the same application could have initially performed a trivial run-length encoding to compress the zero sectors, and then hashed the result, which would result in the same speedup. to flip it around: the application recomputing the zero checksum repeatedly is also leaking information about its input. the constant-time nature of the computation is not a cryptographic security guarantee of BLAKE3, and it certainly wasn't proven in the paper.
I had an extremely lengthy digression at this point that will turn into a separate research paper.
i believe the "scratch space" construct also maps to filesystem-like interfaces quite naturally, as well as more general tree computations. here are two examples of more complex data structures i have very much desired to be able to hash efficiently in the past:
- filesystem snapshots, which can be used to virtualize i/o from subprocesses in a build system or other supervisor process,
- the cord data structure from hans boehm at xerox parc (https://hboehm.info/gc/gc_source/cordh.txt) may cheaply prepend or append bytes to its internal tree structure, but splitting off a chunk from one end like the incremental BLAKE3 is as expensive as traversing the entire tree to serialize its contents. we can do better!
Conclusion: learn from hyperscan!
I think my basic point is: this concept of a "scratch space" has worked very effectively for hyperscan, and it's a way to achieve locality that the BLAKE3 authors haven't considered yet.
I'm interested in working on this and would like to know how likely it is for your team to accept it. No pressure, but I am interested in approaching this as a research project. the gnu coreutils team ended up implementing it pretty quickly on their own after i raised this to them a few months ago, and i feel it's potentially a very powerful showcase of AMD processing power with a relatively small amount of effort.
love your work!
this is a post-script elaboration on the kind of work I've done previously that led me to reach out here. in particular the multithreaded SIMD parsing of zip files (over network and local filesystem operations) was the precise case where i wanted something just like this (a more powerful and flexible tree hashing framework which works with the processor and operating system instead of continuously thrashing)
libmem applications
i previously worked on the pants build tool, which did a lot of highly-structured i/o-centric workflow scheduling: https://www.pantsbuild.org/stable/docs/writing-plugins/the-rules-api/file-system
lots of prior work with pipelined i/o and CPU:
after spending a lot of time on:
- pipelined zip extraction, and
- rewriting the entirety of rust's
std::{fs,path,io,env} (i can show diffs),
- rewriting the entire of pip,
...i realized we definitely need some better primitive operations, and i believe this is one of them!
after spending a lot of time on:
- pipelined zip extraction, and
- rewriting the entirety of rust's
std::{fs,path,io,env} (i can show diffs),
- rewriting the entire of pip,
...i realized we definitely need some better primitive operations, and i believe this is one of them!
hello! i'm a spack developer and really invested in software build and packaging performance!
proposal: BLAKE3 🤝 AMD
so BLAKE3 is an unambiguously great checksum approach (see the fantastic paper). some specific reasons why i think it's worth the aocl-crypto project's time to look into:
issues with the reference implementation
those are the fun reasons (i have my own reasons for this, see later section). but there are some other points that would seem relevant to bring to your team's attention as well:
...and there are other reasons i believe the reference implementation is something we should be able to improve on pretty readily. why?
they also pretty clearly decribe recursive calls and stack space as free real estate in order to avoid analyzing the memory bounds. it turns out the memory bounds for tree hashing are actually highly practical and optimizable!
striping, splitting, merging, for filesystem-like data
there are a few very interesting and unique dataflows that BLAKE3's cryptography enables. this is especially where i think your team's approach (high-quality open-source libraries like hyperscan leveraging microarchitectural task-data co-scheduling) should be ideal.
first, a note on task dependencies:
unless i'm mistaken, this recursive "fixed-size" stack allocation will result in linear stack growth, and while work-stealing is great for irregular task-data scheduling, the hash tree is in fact immensely structured...
beyond BLAKE3: extending reuse of memoized substructure to filesystem operations
...and the authors speak to this in 6.5 Incremental Updates:
in particular, they analogize this to a filesystem:
in fact, many (most) filesystems compose a "file" from noncontiguous extents, and instead perform allocate-on-flush to merge the ordered sequence of extents into a contiguous allocated region.
this is more than pedantry, as 7.5 Chunk Counter throws up a hurdle in our path:
i believe we (or perhaps your team, at least) could figure out a way to recompute this efficiently with SIMD traversal of all intermediate hash tree nodes, to recreate the
tcounter upon permutation, then to fully recompute the tree. but i will argue that we should be able to store only the top-level hash nodes of each extent, plus some additional data, in order to compute our permuted tree hash.the dangerous optimization:
arguably, the same application could have initially performed a trivial run-length encoding to compress the zero sectors, and then hashed the result, which would result in the same speedup. to flip it around: the application recomputing the zero checksum repeatedly is also leaking information about its input. the constant-time nature of the computation is not a cryptographic security guarantee of BLAKE3, and it certainly wasn't proven in the paper.
*I had an extremely lengthy digression at this point that will turn into a separate research paper.*
i believe the "scratch space" construct also maps to filesystem-like interfaces quite naturally, as well as more general tree computations. here are two examples of more complex data structures i have very much desired to be able to hash efficiently in the past:Conclusion: learn from hyperscan!
I think my basic point is: this concept of a "scratch space" has worked very effectively for hyperscan, and it's a way to achieve locality that the BLAKE3 authors haven't considered yet.
I'm interested in working on this and would like to know how likely it is for your team to accept it. No pressure, but I am interested in approaching this as a research project. the gnu coreutils team ended up implementing it pretty quickly on their own after i raised this to them a few months ago, and i feel it's potentially a very powerful showcase of AMD processing power with a relatively small amount of effort.
love your work!
*a post-script elaboration on the kind of work I've done previously that led me to reach out here. in particular the multithreaded SIMD parsing of zip files is the precise case where i want something just like this (a more powerful and flexible tree hashing framework which works with the processor and operating system instead of continuously thrashing)*
libmem applications
i previously worked on the pants build tool, which did a lot of highly-structured i/o-centric workflow scheduling: https://www.pantsbuild.org/stable/docs/writing-plugins/the-rules-api/file-system
lots of prior work with pipelined i/o and CPU:
native-imagefor parallel i/o workloads on the jvm,proposal: BLAKE3 ❤️ AMD
so BLAKE3 is an unambiguously great checksum approach (see the fantastic paper). some specific reasons why i think it's worth the aocl-crypto project's time to look into:
issues with the reference implementation
those are the fun reasons (i have my own reasons for this, see later section). but there are some other points that would seem relevant to bring to your team's attention as well:
...but the reference implementation is something i believe we should be able to improve on pretty readily. why?
striping, splitting, merging, for filesystem-like data
there are a few very interesting and unique dataflows that BLAKE3's cryptography enables. this is especially where i think your team's approach (high-quality open-source libraries like hyperscan leveraging microarchitectural task-data co-scheduling) should be ideal.
first, a note on task dependencies:
unless i'm mistaken, this recursive "fixed-size" stack allocation will result in linear stack growth, and while work-stealing is great for irregular task-data scheduling, the hash tree is in fact immensely structured...
beyond BLAKE3: extending reuse of memoized substructure to filesystem operations
...and the authors speak to this in 6.5 Incremental Updates:
in particular, they analogize this to a filesystem (which is my ulterior motive, see later section):
in fact, many (most) filesystems compose a "file" from noncontiguous extents, and instead perform allocate-on-flush to merge the ordered sequence of extents into a contiguous allocated region.
this is more than pedantry, as 7.5 Chunk Counter throws up a hurdle in our path:
i believe we (or perhaps your team, at least) could figure out a way to recompute this efficiently with SIMD traversal of all intermediate hash tree nodes, to recreate the
tcounter upon permutation, then to fully recompute the tree. but i will argue that we should be able to store only the top-level hash nodes of each extent, plus some additional data, in order to compute our permuted tree hash.the dangerous optimization:
arguably, the same application could have initially performed a trivial run-length encoding to compress the zero sectors, and then hashed the result, which would result in the same speedup. to flip it around: the application recomputing the zero checksum repeatedly is also leaking information about its input. the constant-time nature of the computation is not a cryptographic security guarantee of BLAKE3, and it certainly wasn't proven in the paper.
I had an extremely lengthy digression at this point that will turn into a separate research paper.
i believe the "scratch space" construct also maps to filesystem-like interfaces quite naturally, as well as more general tree computations. here are two examples of more complex data structures i have very much desired to be able to hash efficiently in the past:Conclusion: learn from hyperscan!
I think my basic point is: this concept of a "scratch space" has worked very effectively for hyperscan, and it's a way to achieve locality that the BLAKE3 authors haven't considered yet.
I'm interested in working on this and would like to know how likely it is for your team to accept it. No pressure, but I am interested in approaching this as a research project. the gnu coreutils team ended up implementing it pretty quickly on their own after i raised this to them a few months ago, and i feel it's potentially a very powerful showcase of AMD processing power with a relatively small amount of effort.
love your work!
this is a post-script elaboration on the kind of work I've done previously that led me to reach out here. in particular the multithreaded SIMD parsing of zip files (over network and local filesystem operations) was the precise case where i wanted something just like this (a more powerful and flexible tree hashing framework which works with the processor and operating system instead of continuously thrashing)
libmem applications
i previously worked on the pants build tool, which did a lot of highly-structured i/o-centric workflow scheduling: https://www.pantsbuild.org/stable/docs/writing-plugins/the-rules-api/file-system
lots of prior work with pipelined i/o and CPU:
native-imagefor parallel i/o workloads on the jvm,after spending a lot of time on:
std::{fs,path,io,env}(i can show diffs),...i realized we definitely need some better primitive operations, and i believe this is one of them!
after spending a lot of time on:
std::{fs,path,io,env}(i can show diffs),...i realized we definitely need some better primitive operations, and i believe this is one of them!