try linear attention

This library https://github.com/fla-org/flash-linear-attention wraps various linear attention mechanisms.
Try some common ones from there, to see if we can get a better physics / computational tradeoff, suitable for CPU or legacy GPU inference that is useful for CERN.