🖥 Benchmarking transformers
Hi there,
When I run one of the examples in the text classification folder, and pass max_seq_length =1024 to the model, I got the following warning, which says: WARNING - main - The max_seq_length passed (1024) is larger than the maximum length for the model (512). Using max_seq_length=512.
Set-up
I'm runing on GPU node with the following command.
python ./examples/text-classification/run_glue.py
--model_name_or_path bert-base-cased
--task_name mrpc
--do_train
--do_eval
--max_seq_length 1024
--per_device_train_batch_size 8
--learning_rate 2e-5
--num_train_epochs 1
--overwrite_output_dir
--output_dir /tmp/mrpc/
It can still give me a output. But instead of using the max_seq_length as 1024, it uses max_seq_length=512.
I'm wondering if this is due to the model is still limited to the 512 max token length in memory requirement like most transformer and bert-based models. Or is this caused by the default configuration in the pre-training process? And in the paper, the author mentioned two settings and one of them is 1024, so how can I get the pretained model with max_seq_length=1024? Thanks!
🖥 Benchmarking
transformersHi there,
When I run one of the examples in the text classification folder, and pass max_seq_length =1024 to the model, I got the following warning, which says: WARNING - main - The max_seq_length passed (1024) is larger than the maximum length for the model (512). Using max_seq_length=512.
Set-up
I'm runing on GPU node with the following command.
python ./examples/text-classification/run_glue.py
--model_name_or_path bert-base-cased
--task_name mrpc
--do_train
--do_eval
--max_seq_length 1024
--per_device_train_batch_size 8
--learning_rate 2e-5
--num_train_epochs 1
--overwrite_output_dir
--output_dir /tmp/mrpc/
It can still give me a output. But instead of using the max_seq_length as 1024, it uses max_seq_length=512.
I'm wondering if this is due to the model is still limited to the 512 max token length in memory requirement like most transformer and bert-based models. Or is this caused by the default configuration in the pre-training process? And in the paper, the author mentioned two settings and one of them is 1024, so how can I get the pretained model with max_seq_length=1024? Thanks!