An Encoder-Decoder Transformer-based translator from English to Russian.
This repository contains a Transformer implementation for the machine translation task.
The model is trained on a dataset consisting of sentence pairs English ↔ Russian.
The dataset is located in the data/raw_data folder and consists of two files:
data.en— English sentencesdata.ru— Russian sentences
Each line in these files is a translation pair.
For training you will need:
- Tokenizers (English and Russian)
- Files split into train/val/test in the format:
train.en,train.ru,val.en,val.ru,test.en,test.ru
git clone https://github.com/Lexus-FAMCS/Neural-Machine-Translation.git
cd Neural-Machine-Translationpython3 -m venv your_env
source your_env/bin/activate
pip install -r requirements.txtDownload the pretrained model weights from the link and run translation generation:
python3 translate.py --model_path path_to_pretrained_model --tokenizers_path your_repo/data/processed_data --model_max_len 64 --device your_device --num_layers 8 --d_model 1024 --num_heads 8 --d_hid 4096 --temperature 0.5You can view the training results of the pretrained model with:
tensorboard --logdir your_repo/runs/pretrained_modelYou can use the existing processed data and tokenizers in the data/processed_data folder.
Or, if you have your own dataset with English ↔ Russian sentence pairs, you can preprocess it yourself:
python3 data_prepare.py --data_dir your_repo/data/your_raw_data --output_dir your_repo/data/your_processed_dataRun training:
python3 train.py --data_dir your_repo/data/your_processed_data --batch_size 64 --epochs 10 --learning_rate 1e-4 --device your_device --num_layers 6 --d_model 512 --num_heads 8 --d_hid 2048 --dropout 0.1 --train_log_interval 150 --eval_log_interval 500 --output_dir your_output_dirThe trained model and logs will be saved to your_repo/runs/your_output_dir.
Launch TensorBoard to view training progress:
tensorboard --logdir your_repo/runs/your_output_dirFor help on each script:
python3 script.py --helpSaved in the translations.txt file.
The model struggles with individual words and short phrases (3–5 words). I believe this is due to the fact that it was trained on sentences of mean length ~7 words. In addition, the model has only ~265M parameters and was trained on a small dataset.
- Increase the model size
- Use a larger and more diverse dataset
- Pre-train the model on dictionary word pairs
📄 More about the architecture: Attention is All You Need