modality image

Hi! I am not in this research field, but I want to try applying ModalFormer to other scenarios. I would like to ask: are the features of other modalities mentioned in the paper (such as those from CLIP, ImageBind, etc.) provided in the dataset, or are they extracted from images by the model?