Speech Translation Error Labelling (STEL)

Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.

By Dominik Macháček, Maike Züfle, and Ondřej Klejch. This project is a part of Live Credible Translation.

Overview

This repository contains the annotation protocol, dataset, baselines, and evaluation code for STEL: error span annotation + direct assessment for speech translation output, as described in our paper.

Dataset: under data/, also available on Hugging Face
Baselines: XCOMET and Qwen2.5-Omni, under src/baselines/
Evaluation: meta-evaluation of automatic systems against human annotations, under src/meta-eval/
Annotation protocol and UI: a fork of pearmut

This is work in progress. Don't hesitate to contact authors with any bug or missing documentation.

Installation

pip install -e .

The Qwen2.5-Omni baseline requires a separate environment:

pip install -e ".[qwen]"

Project structure

src/                                        ← source code
  align/                                    ← word alignment setup (awesome-align)
  baselines/                                ← baseline models (xcomet, qwen25omni)
  data/                                     ← data preprocessing
  meta-eval/                                ← meta-evaluation
  pretty-print/                             ← pretty-printing for visual inspection

scripts/                                    ← pipeline entry points (run from project root)
  01_prepare_data/                          ← build annotated data files
  02_run_baselines/                         ← run XCOMET and Qwen2.5-Omni baselines
  03_analysis/                              ← meta-evaluation and result tables
  04_statistics/                            ← summary statistics and plots
  05_results/                               ← paper table/plot post-processing

data/                                       ← processed data files
  <lang>_annotated_data.json                ← base annotations (gold transcript)
  <lang>_annotated_data_asr.json            ← ASR input + ASR-aligned spans
  <lang>_annotated_data_asr+spans-wer.json  ← ASR data with per-span WER
  <lang>_audio/, acl6060_short-audio/       ← source speech audio
  pearmut/                                  ← raw pearmut annotation exports
  robothon_asr/, acl6060.111_asr/           ← raw ASR transcripts per system

outputs/                                    ← model outputs and evaluation results
  xcomet/                                   ← XCOMET predictions
  qwen25omni/                               ← Qwen2.5-Omni predictions (text, audio, textaudio)
  meta-eval/                                ← evaluation results
  pretty-print/                             ← visual inspection outputs

Languages: cs_en, en_cs, en_de, en_he. Second annotation round available for en_cs and en_de only (annotation2).

Pipeline

1. Prepare data

Already done. Only re-run if source pearmut data or annotations change. The resulting data files are also available pre-built on Hugging Face.

bash scripts/01_prepare_data/01_prepare_pearmut.sh   # build base annotated data files
bash scripts/01_prepare_data/02_merge_annotations.sh  # merge second annotation round

2. Run baselines

All scripts run all four language pairs. Each modality has variants for: gold vs ASR input, with/without severity labels, and context sizes 0/1/2/5 (*_ctx_all.sh covers ctx=1/2/5).

# XCOMET
bash scripts/02_run_baselines/comet/run_xcomet*.sh

# Qwen2.5-Omni — text input
bash scripts/02_run_baselines/qwen/text/run_qwen25omni*.sh

# Qwen2.5-Omni — audio input
bash scripts/02_run_baselines/qwen/audio/run_qwen25omni_audio*.sh

# Qwen2.5-Omni — text+audio input
bash scripts/02_run_baselines/qwen/textaudio/run_qwen25omni_textaudio*.sh

3. Analysis

Set ROUND=annotation1 or ROUND=annotation2 at the top of each script. annotation2 is restricted to en_cs and en_de. ASR model outputs use *_annotated_data_asr+spans-wer.json as the annotation source; non-ASR outputs use *_annotated_data.json.

bash scripts/03_analysis/01_run_meta_eval.sh        # compute F1 and correlations
bash scripts/03_analysis/02_print_results_table.sh  # print summary table + CSVs
bash scripts/03_analysis/03_run_prettyprint.sh      # visual inspection of predictions
bash scripts/03_analysis/04_run_wer_analysis.sh     # WER-split meta-eval

Results are written to outputs/meta-eval/ (annotation1) or outputs/meta-eval/annotation2/ (annotation2).

Citation

@misc{macháček2026automaticlabellingspeechtranslation,
      title={Automatic Labelling of Speech Translation Errors},
      author={Dominik Macháček and Maike Züfle and Ondrej Klejch},
      year={2026},
      eprint={2606.06047},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.06047},
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
data		data
outputs		outputs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Translation Error Labelling (STEL)

Overview

Installation

Project structure

Pipeline

1. Prepare data

2. Run baselines

3. Analysis

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Speech Translation Error Labelling (STEL)

Overview

Installation

Project structure

Pipeline

1. Prepare data

2. Run baselines

3. Analysis

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages