The current name is image-latent-transformer which we abbreviate as ILT.
This is not so cool, and a better name would be LIT (latent-image-transformer).
But really, what is latent here? it is words, not specifically the image. The input representation can be a byte sequence or an image, or an audio of a word etc. So maybe this is a word-latent-transformer to really contrast byte-latent-transformer so perhaps we should go with WLT?
The current name is
image-latent-transformerwhich we abbreviate asILT.This is not so cool, and a better name would be
LIT(latent-image-transformer).But really, what is latent here? it is words, not specifically the image. The input representation can be a byte sequence or an image, or an audio of a word etc. So maybe this is a
word-latent-transformerto really contrastbyte-latent-transformerso perhaps we should go withWLT?