Inference about hard pruning

When hard pruning inference, tokens below the threshold will be discarded and do not enter the calculation of the feed-forward layer, but when entering the feed-forward layer after normalization and other operations, the position of the pruned token is not equal to 0, that is, the calculation will still be carried out, also when moving to the next layer to calculate the Q,K matrix. So where does his accelerated inference manifest itself?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference about hard pruning #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inference about hard pruning #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions