Question about usage of CNNs in encoder and decoder

Hi Yutong, 

Thanks, this is a really cool work! I can't seem to wrap my head around the choice of CNN based VQGAN instead of using ViT based ViTVQGAN encoder/decoder in your model. I wonder if there are any insights or reason behind this design choice?