Text-to-Image Synthesis using Multimodal (VQGAN + CLIP) Architectures
-
Updated
Nov 14, 2024 - Jupyter Notebook
Text-to-Image Synthesis using Multimodal (VQGAN + CLIP) Architectures
A Multimodal AI Search Engine built from scratch using CLIP-style architecture (ViT + MPNet). Capable of searching images via text or image queries with 27.6% Recall@1 on Flickr8k.
Add a description, image, and links to the clip-architecture topic page so that developers can more easily learn about it.
To associate your repository with the clip-architecture topic, visit your repo's landing page and select "manage topics."