Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.
By Dominik Macháček, Maike Züfle, and Ondřej Klejch. This project is a part of Live Credible Translation.
This repository contains the annotation protocol, dataset, baselines, and evaluation code for STEL: error span annotation + direct assessment for speech translation output, as described in our paper.
- Dataset: under
data/, also available on Hugging Face - Baselines: XCOMET and Qwen2.5-Omni, under
src/baselines/ - Evaluation: meta-evaluation of automatic systems against human annotations, under
src/meta-eval/ - Annotation protocol and UI: a fork of pearmut
This is work in progress. Don't hesitate to contact authors with any bug or missing documentation.
pip install -e .The Qwen2.5-Omni baseline requires a separate environment:
pip install -e ".[qwen]"src/ ← source code
align/ ← word alignment setup (awesome-align)
baselines/ ← baseline models (xcomet, qwen25omni)
data/ ← data preprocessing
meta-eval/ ← meta-evaluation
pretty-print/ ← pretty-printing for visual inspection
scripts/ ← pipeline entry points (run from project root)
01_prepare_data/ ← build annotated data files
02_run_baselines/ ← run XCOMET and Qwen2.5-Omni baselines
03_analysis/ ← meta-evaluation and result tables
04_statistics/ ← summary statistics and plots
05_results/ ← paper table/plot post-processing
data/ ← processed data files
<lang>_annotated_data.json ← base annotations (gold transcript)
<lang>_annotated_data_asr.json ← ASR input + ASR-aligned spans
<lang>_annotated_data_asr+spans-wer.json ← ASR data with per-span WER
<lang>_audio/, acl6060_short-audio/ ← source speech audio
pearmut/ ← raw pearmut annotation exports
robothon_asr/, acl6060.111_asr/ ← raw ASR transcripts per system
outputs/ ← model outputs and evaluation results
xcomet/ ← XCOMET predictions
qwen25omni/ ← Qwen2.5-Omni predictions (text, audio, textaudio)
meta-eval/ ← evaluation results
pretty-print/ ← visual inspection outputs
Languages: cs_en, en_cs, en_de, en_he.
Second annotation round available for en_cs and en_de only (annotation2).
Already done. Only re-run if source pearmut data or annotations change. The resulting data files are also available pre-built on Hugging Face.
bash scripts/01_prepare_data/01_prepare_pearmut.sh # build base annotated data files
bash scripts/01_prepare_data/02_merge_annotations.sh # merge second annotation roundAll scripts run all four language pairs. Each modality has variants for: gold vs ASR input, with/without severity labels,
and context sizes 0/1/2/5 (*_ctx_all.sh covers ctx=1/2/5).
# XCOMET
bash scripts/02_run_baselines/comet/run_xcomet*.sh
# Qwen2.5-Omni — text input
bash scripts/02_run_baselines/qwen/text/run_qwen25omni*.sh
# Qwen2.5-Omni — audio input
bash scripts/02_run_baselines/qwen/audio/run_qwen25omni_audio*.sh
# Qwen2.5-Omni — text+audio input
bash scripts/02_run_baselines/qwen/textaudio/run_qwen25omni_textaudio*.shSet ROUND=annotation1 or ROUND=annotation2 at the top of each script.
annotation2 is restricted to en_cs and en_de. ASR model outputs use
*_annotated_data_asr+spans-wer.json as the annotation source; non-ASR outputs
use *_annotated_data.json.
bash scripts/03_analysis/01_run_meta_eval.sh # compute F1 and correlations
bash scripts/03_analysis/02_print_results_table.sh # print summary table + CSVs
bash scripts/03_analysis/03_run_prettyprint.sh # visual inspection of predictions
bash scripts/03_analysis/04_run_wer_analysis.sh # WER-split meta-evalResults are written to outputs/meta-eval/ (annotation1) or
outputs/meta-eval/annotation2/ (annotation2).
@misc{macháček2026automaticlabellingspeechtranslation,
title={Automatic Labelling of Speech Translation Errors},
author={Dominik Macháček and Maike Züfle and Ondrej Klejch},
year={2026},
eprint={2606.06047},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.06047},
}