In get_s2s_data the training files are scanned to find the vocabulary, and only after that they are tokenized, using the vocabulary for converting the tokens into indeces.
|
src_dict, inv_src_dict = top_words_train_valid(src_train_path, src_valid_path) |
|
|
|
print("Tokenizing src_train_path") |
|
src_train_sent = tokenize_text(src_train_path, vocab=src_dict) |
I think the order should be -> tokenization -> vocabulary -> convert into indeces
In get_s2s_data the training files are scanned to find the vocabulary, and only after that they are tokenized, using the vocabulary for converting the tokens into indeces.
mxnet_seq2seq/utils.py
Lines 99 to 102 in c57d892
I think the order should be -> tokenization -> vocabulary -> convert into indeces