We have already run into this issue for low-rank computations. On the sender side, things are fine: variable-sized data can be transmitted. On the receiver side, however, we currently rely on a single, sufficiently large (unified) arena to hold the incoming data.
There are at least two cases where the maximum required size on the receiver side is hard to predict (which is already painful for low-rank, though still manageable because the max rank is somewhat application-dependent): (1) the lattice matrices in Yang Liu (LBL)’s work, and (2) scenarios where we use lossy compression.
This needs to be taken into account in the GPU-direct PR as well.
We have already run into this issue for low-rank computations. On the sender side, things are fine: variable-sized data can be transmitted. On the receiver side, however, we currently rely on a single, sufficiently large (unified) arena to hold the incoming data.
There are at least two cases where the maximum required size on the receiver side is hard to predict (which is already painful for low-rank, though still manageable because the max rank is somewhat application-dependent): (1) the lattice matrices in Yang Liu (LBL)’s work, and (2) scenarios where we use lossy compression.
This needs to be taken into account in the GPU-direct PR as well.