Does attention mask reduce computation cost？

Hey, there.
After I read the code, I am confused that the computation cost can be reduced by mask more tokens. Did I miss anything?

PS. I see the FLOPS is calculated by the length of tokens retented at each layer which is counted during inference. 

Do you have inference latency metrics?