Vocabularies created before tokenization

In get_s2s_data the training files are scanned to find the vocabulary, and only after that they are tokenized, using the vocabulary for converting the tokens into indeces.
https://github.com/mkolod/mxnet_seq2seq/blob/c57d8920e2d7a07faa6c518f427b3dd2d90ef7a3/utils.py#L99-L102

I think the order should be -> tokenization -> vocabulary -> convert into indeces

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabularies created before tokenization #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	src_dict, inv_src_dict = top_words_train_valid(src_train_path, src_valid_path)

	print("Tokenizing src_train_path")
	src_train_sent = tokenize_text(src_train_path, vocab=src_dict)

Vocabularies created before tokenization #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions