3rd place solution by F.H.S.D.Y. of Jigsaw Unintended Bias in Toxicity Classification (https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)
Please see https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/97471#latest-582610 for more information.
apex
attrdict==2.0.1
nltk==3.4.4
numpy==1.16.4
optuna==0.13.0
pandas==0.24.2
pytorch-pretrained-bert==0.6.2
scikit-learn==0.21.2
torch==1.1.0
tqdm==4.32.1
apex from https://github.com/NVIDIA/apex
If you use BERT, you have to install pytorch-pretrained-bert from pip.
pip install pytorch-pretrained-bert
If you use GPT2, you have to install pytorch-pretrained-bert from git.
pip install git+https://github.com/pronkinnikita/pytorch-pretrained-BERT
Hyper-parameters are managed by JSON files in config directory.
lm_model_name: required, type = str. Which language model to use.max_len: default =220, type = int. Maximum length of tokens used as input. In case of BERT, this contains[CLS]and[SEP].max_head_len: default =128, type = int. Maximum length of first tokens used as input. This doesn't contain[CLS]or[SEP]. For example, whenmax_len= 220,max_head_len= 128, and using BERT, the input will be first 128 tokens and last 90 tokens.epochs: default =2, type = int. Training epochs. This must be 2 or less.down_sample_frac: default =0.5, type = float. Rate of dropped sample when negative down sampling.lr: default =1.5e-5, type = float. Learning rate.batch_size: deafult =16, type = int. Batch size.accumulation_steps: default =4, type = int.gradient_accumulation_stepsinpytorch_pretrained_bert.warmup: default =0.05, type = float. Learning rate linearly increases from 0 tolroverwarmuprate of training steps, and linearly decreases fromlrto 0 over the rest steps.old_data: default =false, type = bool. Whether you use old toxic competition data as training data or not.old_fine_tuned: default =false, type = bool. Settrueif you use further fine-tuned weight with old toxic competition data as pre-trained weight,falseotherwise.device: default =cuda, type = str, options:cuda,cpu. Device used for running.seed: default =1234, type = int. The desired seed.dropout_rate: default =0.1, type = float. Dropout rate for GPT2.
max_len: default =220, type = int. Maximum length of tokens used as input.max_features: default =100000, type = int. Maximum number of tokens used as input overall.batch_size: default =512, type = int. Batch size.train_epochs: default =10, type = int. Training epochs.tolerance: default =10, type = int. When score does not improve overtoleranceepochs, training is aborted.num_folds: default =5, type = int. Number of folds of cross validation.lr: default =1e-3, type = float. Learning rate.loss_alpha: default =0.1, type = float. Coefficient for training weight.loss_beta: default =1.0, type = float. Coefficient for training weight.device: default =cuda, type = str, options:cuda,cpu. Device used for running.seed: default =1234, type = int. The desired seed.
max_len: default =220, type = int. Maximum length of tokens used as input.max_features: default =100000, type = int. Maximum number of tokens used as input overall.batch_size: default =512, type = int. Batch size.train_epochs: default =6, type = int. Training epochs.n_splits: default =5, type = int. Number of folds of cross validation.start_lr: default =1e-4, type = float. Initial learning rate of training.max_lr: default =5e-3, type = float. Maximum learning rate of training.last_lr: default =1e-3, type = float. Last learning rate of traiing.warmup: default =0.2, type = float. Learning rate increases fromstart_lrtomax_lroverwarmuprate of training steps following a cosine curve, and decreases frommax_lrtolast_lrover rest steps following a cosine curve.pseudo_label: default =true, type = bool. Whether you use pseudo labeling for training or not.mu: default =0.9, type = float. Rate of new weights in EMA.updates_per_epoch: default =10, type = int. How many times you update weights in EMA.lstm_gru: default ={}, type = dict. Hyper-parameters used in LSTM-GRU model.lstm_capsule_atten: default ={}, type = dict. Hyper-parameters used in LSTM-Capsule-Attention model.lstm_conv: default ={}, type = dict. Hyper-parameters used in LSTM-Conv model.device: default =cuda, type = str, options:cuda,cpu. Device used for running.seed: default =1234, type = int. The desired seed.
n_folds: default = 10, type = int. How many times you split validation data and run optuna.n_trials: default = 300, type = int. Number of trials of each running.threshold: default = 0.03, type = float. Acceptable error between train and valid score.
You need to specify which configuration JSON file to use.
$ python fine_tune_lm.py \
--config_file ./config/bert_large_cased.json
If you want to use further fine-tuned weights with old toxic competition data as pre-trained weights, first you need to run the script of further fine-tuning.
$ python fune_tune_lm.py \
--config_file ./config/bert_base_cased_old_fine_tune.json
And then run the script of main fine-tuning.
$ python fine_tune_lm.py \
--config_file ./config/bert_base_cased.json
You can run the script using
$ python train_lstm_f.py
You need to specify which model to use.
$ python train_lstm_s.py \
--lstm_model lstm_gru
You can choose from
lstm_grulstm_capsule_attenlstm_conv
First you need to train models and predict labels of validation data with argument --valid
$ python fine_tune_lm.py \
--config_file ./config/bert_large_cased.json
--valid
Then run a command
$ python compute_blending_weights.py