Hi Yutong,
Thanks, this is a really cool work! I can't seem to wrap my head around the choice of CNN based VQGAN instead of using ViT based ViTVQGAN encoder/decoder in your model. I wonder if there are any insights or reason behind this design choice?
Hi Yutong,
Thanks, this is a really cool work! I can't seem to wrap my head around the choice of CNN based VQGAN instead of using ViT based ViTVQGAN encoder/decoder in your model. I wonder if there are any insights or reason behind this design choice?