Neural Machine Translation 🇬🇧➡️🇷🇺

An Encoder-Decoder Transformer-based translator from English to Russian.

📚 Description

This repository contains a Transformer implementation for the machine translation task.
The model is trained on a dataset consisting of sentence pairs English ↔ Russian.

📂 Dataset

The dataset is located in the data/raw_data folder and consists of two files:

data.en — English sentences
data.ru — Russian sentences

Each line in these files is a translation pair.

🗂️ Preparing the dataset

For training you will need:

Tokenizers (English and Russian)
Files split into train/val/test in the format: train.en, train.ru, val.en, val.ru, test.en, test.ru

🚀 Quick Start

1️⃣ Clone the repository

git clone https://github.com/Lexus-FAMCS/Neural-Machine-Translation.git
cd Neural-Machine-Translation

2️⃣ Create an environment and install dependencies

python3 -m venv your_env
source your_env/bin/activate
pip install -r requirements.txt

3️⃣ Download the pretrained model weights

Download the pretrained model weights from the link and run translation generation:

python3 translate.py --model_path path_to_pretrained_model --tokenizers_path your_repo/data/processed_data --model_max_len 64 --device your_device --num_layers 8 --d_model 1024 --num_heads 8 --d_hid 4096 --temperature 0.5

You can view the training results of the pretrained model with:

tensorboard --logdir your_repo/runs/pretrained_model

⚙️ Advanced: Train on your own dataset

You can use the existing processed data and tokenizers in the data/processed_data folder.

Or, if you have your own dataset with English ↔ Russian sentence pairs, you can preprocess it yourself:

python3 data_prepare.py --data_dir your_repo/data/your_raw_data --output_dir your_repo/data/your_processed_data

Run training:

python3 train.py --data_dir your_repo/data/your_processed_data --batch_size 64 --epochs 10 --learning_rate 1e-4 --device your_device --num_layers 6 --d_model 512 --num_heads 8 --d_hid 2048 --dropout 0.1 --train_log_interval 150 --eval_log_interval 500 --output_dir your_output_dir

The trained model and logs will be saved to your_repo/runs/your_output_dir.

Launch TensorBoard to view training progress:

tensorboard --logdir your_repo/runs/your_output_dir

🔍 More about each script

For help on each script:

python3 script.py --help

✅ Results

Saved in the translations.txt file.

⚡ Conclusions

The model struggles with individual words and short phrases (3–5 words). I believe this is due to the fact that it was trained on sentences of mean length ~7 words. In addition, the model has only ~265M parameters and was trained on a small dataset.

🔭 Next steps

Increase the model size
Use a larger and more diverse dataset
Pre-train the model on dictionary word pairs

📄 More about the architecture: Attention is All You Need

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Machine Translation 🇬🇧➡️🇷🇺

📚 Description

📂 Dataset

🗂️ Preparing the dataset

🚀 Quick Start

1️⃣ Clone the repository

2️⃣ Create an environment and install dependencies

3️⃣ Download the pretrained model weights

⚙️ Advanced: Train on your own dataset

🔍 More about each script

✅ Results

⚡ Conclusions

🔭 Next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
runs/pretrained_model		runs/pretrained_model
README.md		README.md
data_prepare.py		data_prepare.py
requirements.txt		requirements.txt
train.py		train.py
transformer.py		transformer.py
translate.py		translate.py
translations.txt		translations.txt

Folders and files

Latest commit

History

Repository files navigation

Neural Machine Translation 🇬🇧➡️🇷🇺

📚 Description

📂 Dataset

🗂️ Preparing the dataset

🚀 Quick Start

1️⃣ Clone the repository

2️⃣ Create an environment and install dependencies

3️⃣ Download the pretrained model weights

⚙️ Advanced: Train on your own dataset

🔍 More about each script

✅ Results

⚡ Conclusions

🔭 Next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages