Skip to content

ayshushus/autocomplete

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

Learned Translation Autocomplete for Mozilla Pontoon

Comparing fuzzy, recurrent, and Transformer models for interactive translation autocomplete, trained on Mozilla Pontoon's translation memory.

APS360 — Applied Fundamentals of Deep Learning, University of Toronto.

Overview

Localizers on Pontoon spend most of their time translating large volumes of short, repetitive UI strings. Pontoon already surfaces exact and fuzzy translation-memory (TM) matches, but it has no learned, interactive autocomplete that predicts the rest of a translation as the localizer types.

This project builds and compares such a model. Given an English source string and the partial target typed so far, the system predicts the most likely completion:

source  = "Save your changes"
prefix  = "Enregistrer vos"
predict → "modifications"

The contribution is a controlled comparison across model families on the same task and data, under both abundant (en→fr) and scarce (en→hi) data.

Models compared

Model Type Role
Fuzzy TM match Edit-distance retrieval (no training) Baseline — what Pontoon does today
GRU seq2seq + attention Recurrent encoder–decoder Learned
LSTM seq2seq + attention Recurrent encoder–decoder Learned
Transformer Self-attention encoder–decoder Learned

All learned models are standard NMT seq2seq networks. Interactive autocomplete is obtained via prefix-constrained decoding — the decoder is force-decoded on the typed prefix and then generates the continuation — so a single trained model supports both full-string suggestion and autocomplete with no extra network.

Repository structure

.
├── data/
│   ├── raw/              # Pontoon SQL dump (not committed — see Data)
│   └── processed/        # cleaned (source, prefix) → suffix triples
├── src/
│   ├── data/             # cleaning, placeable stripping, BPE, splits
│   ├── models/           # fuzzy baseline, GRU, LSTM, Transformer
│   ├── train.py          # training loop (PyTorch)
│   ├── evaluate.py       # top-k accuracy, keystrokes saved, BLEU/chrF
│   └── decode.py         # prefix-constrained autocomplete decoding
├── notebooks/            # exploration + result plots
├── configs/              # per-model / per-locale hyperparameters
├── requirements.txt
└── README.md

Data

Source data is a SQL dump of Pontoon's database (source entities, translations, target locale, approval/quality flags), used with access and domain guidance from the Pontoon engineering team. The dump is not committed to this repo (size and licensing); see src/data/ for the pipeline that reproduces the processed dataset.

Cleaning pipeline:

  1. Extract approved (source, target) pairs per locale; drop rejected, empty, and duplicate rows.
  2. Strip / normalize Pontoon placeables — HTML tags, printf tokens (%s), variables ({$var}) — while preserving placeholder slots.
  3. Unicode-normalize and apply byte-pair-encoding subword tokenization.
  4. Split train/validation/test by source string to prevent near-duplicate leakage.
  5. Build autocomplete examples by truncating each target at random cut points into (source, typed-prefix) → remaining-suffix triples.

Locales: en→fr (high-resource) and en→hi (low-resource), to study how each architecture degrades as data shrinks.

Setup

git clone <this-repo-url>
cd <repo>
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Requires Python 3.10+ and PyTorch. A GPU is recommended for the Transformer and RNN runs.

Usage

# 1. Build the processed dataset from the raw Pontoon dump
python -m src.data.build --dump data/raw/pontoon.sql --locale fr --out data/processed/fr

# 2. Train a model
python src/train.py --config configs/transformer_fr.yaml

# 3. Evaluate against the fuzzy baseline
python src/evaluate.py --model transformer --locale fr

# 4. Try interactive autocomplete
python src/decode.py --model transformer --locale fr \
    --source "Save your changes" --prefix "Enregistrer vos"

Evaluation

Each model is reported against the fuzzy baseline using:

  • Top-k suffix accuracy — is the correct continuation in the model's top-k predictions?
  • Keystrokes saved — fraction of characters the localizer avoids typing.
  • BLEU / chrF — full-string translation quality.

Results are reported in the final report. This repo accompanies the APS360 project; metrics will be populated as experiments complete.

Ethical considerations

The TM is built from work contributed by volunteer localizers, so consent, attribution, and licensing are respected, and personal data in strings is scrubbed before training. Autocomplete can introduce automation bias — anchoring localizers and homogenizing translations, which is especially risky for low-resource locales. The model only learns from past translations and will reproduce biases already in the TM. A self-hosted, offline model avoids sending community strings to external commercial APIs.

Acknowledgements

Thanks to the Mozilla Pontoon engineering team for data access and domain guidance. All model design, data cleaning, and training in this repository are the author's own work.

References

Key prior work: seq2seq learning (Sutskever et al., 2014), attention (Bahdanau et al., 2015), LSTM (Hochreiter & Schmidhuber, 1997), GRU (Cho et al., 2014), the Transformer (Vaswani et al., 2017), and fuzzy translation memory (Koehn & Senellart, 2010). Full citations are in references.bib.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors