* Python 2.7
* NumPy
* [DyNet]
Vocabulary may be loaded every time from a training sentence file, or it may be loaded from a JSON file, which is much faster. To learning the vocabulary from a training sentence file, try the command as following:
python src/main.py --train data/ctb/ctb.train.seg.append --write-vocab data/vocab.json
Trainging requires a file containing training sentences (--train) and a file containing validation sentence (--dev), which are parsed four times per training epoch to determine which model to keep. A file name must also be provided to store the saved model (--model). The following is an example of a command to train a model with all of the default settings:
python src/main.py --train data/ctb/ctb.train.seg.append --dynet-mem 2000 --dev data/ctb/ctb.dev.seg.append --vocab data/vocab.json --model data/my_model --epoch 3
The following table provides an overview of additional training options:
| Argument | Description | Default |
|---|---|---|
| --dynet-mem | Memory (MB) to allocate for DyNet | 2000 |
| --dynet-l2 | L2 regularization factor | 0 |
| --dynet-seed | Seed for random parameter initialization | random |
| --bigrams-dims | Word embedding dimensions | 50 |
| --unigrams-dims | POS embedding dimensions | 20 |
| --lstm-units | LSTM units (per direction, for each of 2 layers) | 200 |
| --hidden-units | Units for ReLU FC layer (each of 2 action types) | 200 |
| --epochs | Number of training epochs | 10 |
| --batch-size | Number of sentences per training update | 10 |
| --droprate | Dropout probability | 0.5 |
| --unk-param | Parameter z for random UNKing | 0.8375 |
| --np-seed | Seed for shuffling and softmax sampling | random |
There is also a facility to directly evaluate a model agaist a reference corpus, by supplying the --test argument:
python src/main.py --test data/ctb/ctb.test.seg.append --vocab data/vocab.json --model data/my_model2