I sincerely appreciate your outstanding work! I recently employed the adversarially fine-tuned ViT-L/14 CLIP models ($FARE^4$) that you provided as the vision_encoder_pretrained model and conducted an evaluation (the attack is apgd) on the Flickr30k dataset using llava_eval.sh.
However, I noticed that the reported CIDEr score differs significantly from the results presented in Table 1. This discrepancy has left me somewhat puzzled, and I would greatly appreciate any insights you could provide regarding potential factors that might contribute to this variation.
Looking forward to your response. Thank you for your time and assistance!
I sincerely appreciate your outstanding work! I recently employed the adversarially fine-tuned ViT-L/14 CLIP models ($FARE^4$ ) that you provided as the vision_encoder_pretrained model and conducted an evaluation (the attack is apgd) on the Flickr30k dataset using llava_eval.sh.
However, I noticed that the reported CIDEr score differs significantly from the results presented in Table 1. This discrepancy has left me somewhat puzzled, and I would greatly appreciate any insights you could provide regarding potential factors that might contribute to this variation.
Looking forward to your response. Thank you for your time and assistance!