What do you think about combining your architecture with existing pre-trained encoders? Can BERT as an **prior_encoder** help achieve the better results?
What do you think about combining your architecture with existing pre-trained encoders? Can BERT as an prior_encoder help achieve the better results?