When hard pruning inference, tokens below the threshold will be discarded and do not enter the calculation of the feed-forward layer, but when entering the feed-forward layer after normalization and other operations, the position of the pruned token is not equal to 0, that is, the calculation will still be carried out, also when moving to the next layer to calculate the Q,K matrix. So where does his accelerated inference manifest itself?
When hard pruning inference, tokens below the threshold will be discarded and do not enter the calculation of the feed-forward layer, but when entering the feed-forward layer after normalization and other operations, the position of the pruned token is not equal to 0, that is, the calculation will still be carried out, also when moving to the next layer to calculate the Q,K matrix. So where does his accelerated inference manifest itself?