Thanks for your great work. As mentioned in the paper, the multi-modal transformer encoder is randomly initialized. I am wondering why not just initialize the encoder with the pre-trained weights of BERT? Will it bring performance deterioration?
Thanks for your great work.
As mentioned in the paper, the multi-modal transformer encoder is randomly initialized.
I am wondering why not just initialize the encoder with the pre-trained weights of BERT? Will it bring performance deterioration?