Hi! I am not in this research field, but I want to try applying ModalFormer to other scenarios. I would like to ask: are the features of other modalities mentioned in the paper (such as those from CLIP, ImageBind, etc.) provided in the dataset, or are they extracted from images by the model?
Hi! I am not in this research field, but I want to try applying ModalFormer to other scenarios. I would like to ask: are the features of other modalities mentioned in the paper (such as those from CLIP, ImageBind, etc.) provided in the dataset, or are they extracted from images by the model?