This repository accompanies the paper “Evaluating the Impact of Verbal Multiword Expressions on Machine Translation.”
It provides:
- the processed artifacts used in the paper, released under
preset/ - the scripts required to rebuild datasets from source materials
- the pipelines to rerun the VMWE and WMT experiments
The paper studies how verbal multiword expressions (VMWEs) affect machine translation, with a focus on:
- VID — verbal idioms, e.g. spill the beans
- VPC — verb-particle constructions, e.g. give up
- LVC — light verb constructions, e.g. take a walk
Main finding: VMWEs consistently reduce translation quality, and a substantial portion of that degradation is attributable to the VMWE itself rather than to overall sentence difficulty.
- Quick Start
- Environment Setup
- Required Credentials
- Experiment Scope
- Released Artifacts in
preset/ - Reproducing the Pipelines
- Citation
- License
If you only want to inspect the released paper artifacts and do not need to run the code, start in preset/.
Recommended entry points:
-
Main VMWE dataset results
preset/VMWE/MT_eval/ -
Paraphrase support analysis
preset/VMWE/MT_para_eval/ -
WMT summary tables
-
Per-example WMT subsets
For a file-by-file mapping to the paper, see Released Artifacts in preset/.
We recommend using a uv-managed virtual environment.
uv venv .venv
source .venv/bin/activate
uv pip install --torch-backend=auto -r requirements.txt
uv pip install --no-build-isolation flash-attn==2.7.4.post1If you prefer not to activate the environment manually, you can also use uv run.
Some resources are downloaded automatically on first use, including:
- Hugging Face model weights
- COMET / MetricX resources
nltk- spaCy's
en_core_web_sm
| Stage / Model | Recommended Minimum Hardware |
|---|---|
| Main MT models | 1 × NVIDIA RTX 6000 Ada (48 GB) per active translation job |
Paraphrase (Llama-3.3-70B) |
3 × NVIDIA RTX 6000 Ada (48 GB each) |
| MetricX evaluation | 2 × NVIDIA RTX 6000 Ada |
| XCOMET evaluation | 1 × NVIDIA RTX 6000 Ada |
| Dataset construction | CPU-only is sufficient |
Notes
- Memory requirements may vary by checkpoint,
transformersversion, and parallelization strategy. - If heavy stages are run sequentially, the same 3-GPU machine can usually be reused across paraphrase, MT, and QE by reallocating devices.
- If you cannot host
Llama-3.3-70B, use--paraphrase-model-idto substitute another paraphrasing backend.
Some experiments require external credentials or gated access.
Requires either:
GOOGLE_CLOUD_PROJECT, or--google-project-id
You must also have valid Google Cloud credentials configured. See the official Google Cloud authentication guide.
Requires:
OPENAI_API_KEY
Some checkpoints require:
- accepted model licenses
- an authenticated Hugging Face session
Datasets
LVCVPCVIDNon_VMWE(contrast set)
Language directions
en-csen-deen-esen-jaen-ruen-tren-zh
MT systems
GoogleGemmaX2LLaMAXphi4MadladM2M100opusseamless
QE models
MetricXxCOMET
Years
- WMT 2017–2024
Language pairs
en-csen-deen-ruen-zh
Comparison types
- Human comparisons
- MT system comparisons
GPT-4.1andGPT-5.1appear only as support experiments inpreset/VMWE/MT_eval/. They are included in the released artifacts but are not exposed as runnable public backends in the reproduction scripts.
The preset/ directory contains the processed outputs used in the paper, including dataset derivatives and experiment outputs produced by the MT, evaluation, extraction, and summary pipelines.
| Paper section | Released artifact |
|---|---|
| VMWE MT + QE results (Sec. 4, 6) | preset/VMWE/MT_eval/<model>/*.csv |
| Paraphrase analysis (Sec. 7) | preset/VMWE/MT_para_eval/<model>/*_{original,para,mixed}.csv |
| WMT candidate classification (Sec. 5) | preset/WMT/WMT_{LVC,VPC,VID}_Classified_2017_to_2024.csv |
| WMT MT summary results (Sec. 5, 6) | preset/WMT/WMT_MT.csv |
| WMT human summary (Sec. 5, 6) | preset/WMT/WMT_Human.csv |
| Per-example WMT MT | preset/WMT/MT/*.csv |
| Per-example WMT human | preset/WMT/Human/*.csv |
| Ranked MT systems | preset/WMT/WMT_system_rankings.csv |
Files follow:
<DATASET>_<PAIR>.csv
Example:
LVC_en-cs.csv
These files contain fields such as:
src- VMWE candidate columns
mtmetricx_scorexcomet_score
Files follow:
<DATASET>_<PAIR>_<VIEW>.csv
Where <VIEW> is one of:
original— original source → original MTpara— paraphrased source → paraphrased MTmixed— original source → paraphrased MT
If you want to rebuild the datasets and rerun the experiments from the provided scripts rather than using the released preset/ artifacts, follow the steps below.
Downloads source resources and constructs the core CSV datasets:
LVCVPCVIDNon_VMWEVID_dictionary
python scripts/build_vmwe_datasets.pyRuns the primary MT and evaluation pipeline.
python scripts/reproduce_vmwe_mt_eval.py \
--stage all \
--models GemmaX2 LLaMAX phi4 Madlad M2M100 opus seamless Google \
--pairs en-cs en-de en-es en-ja en-ru en-tr en-zh \
--datasets LVC VPC VID Non_VMWE \
--metrics metricx xcometUseful resource-management flags include:
--translation-gpus--metricx-gpus--xcomet-gpus--parallel-jobs--google-project-id
python scripts/reproduce_vmwe_para_mt_eval.py \
--stage all \
--models GemmaX2 LLaMAX phi4 Madlad M2M100 opus seamless Google \
--pairs en-cs en-de en-es en-ja en-ru en-tr en-zh \
--datasets LVC VPC VID \
--metrics metricx xcometDownloads and restructures WMT data into:
datasets/WMT/<year>/<Human|MT>/<pair>/<system>.csv
python scripts/build_wmt_datasets.py --years 2017 2018 2019 2020 2021 2022 2023 2024To reproduce the released paper-style outputs using the shipped preset classifications:
python scripts/build_wmt_vmwe_pipeline.py \
--classification preset \
--use-preset-final-summaryTo rerun WMT candidate classification from scratch using an API model:
python scripts/build_wmt_vmwe_pipeline.py \
--classification api \
--openai-model gpt-4oThis step requires OPENAI_API_KEY.
Outputs from API-based reclassification may not be byte-identical to the released preset files.
If you use this repository, please cite:
@misc{liu2025evaluatingimpactverbalmultiword,
title={Evaluating the Impact of Verbal Multiword Expressions on Machine Translation},
author={Linfeng Liu and Saptarshi Ghosh and Tianyu Jiang},
year={2025},
eprint={2508.17458},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.17458},
}This project is released under the MIT License.