Skip to content

CSTR-Edinburgh/STEL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Translation Error Labelling (STEL)

arXiv Hugging Face Dataset

Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.

By Dominik Macháček, Maike Züfle, and Ondřej Klejch. This project is a part of Live Credible Translation.

Overview

This repository contains the annotation protocol, dataset, baselines, and evaluation code for STEL: error span annotation + direct assessment for speech translation output, as described in our paper.

This is work in progress. Don't hesitate to contact authors with any bug or missing documentation.

Installation

pip install -e .

The Qwen2.5-Omni baseline requires a separate environment:

pip install -e ".[qwen]"

Project structure

src/                                        ← source code
  align/                                    ← word alignment setup (awesome-align)
  baselines/                                ← baseline models (xcomet, qwen25omni)
  data/                                     ← data preprocessing
  meta-eval/                                ← meta-evaluation
  pretty-print/                             ← pretty-printing for visual inspection

scripts/                                    ← pipeline entry points (run from project root)
  01_prepare_data/                          ← build annotated data files
  02_run_baselines/                         ← run XCOMET and Qwen2.5-Omni baselines
  03_analysis/                              ← meta-evaluation and result tables
  04_statistics/                            ← summary statistics and plots
  05_results/                               ← paper table/plot post-processing

data/                                       ← processed data files
  <lang>_annotated_data.json                ← base annotations (gold transcript)
  <lang>_annotated_data_asr.json            ← ASR input + ASR-aligned spans
  <lang>_annotated_data_asr+spans-wer.json  ← ASR data with per-span WER
  <lang>_audio/, acl6060_short-audio/       ← source speech audio
  pearmut/                                  ← raw pearmut annotation exports
  robothon_asr/, acl6060.111_asr/           ← raw ASR transcripts per system

outputs/                                    ← model outputs and evaluation results
  xcomet/                                   ← XCOMET predictions
  qwen25omni/                               ← Qwen2.5-Omni predictions (text, audio, textaudio)
  meta-eval/                                ← evaluation results
  pretty-print/                             ← visual inspection outputs

Languages: cs_en, en_cs, en_de, en_he. Second annotation round available for en_cs and en_de only (annotation2).

Pipeline

1. Prepare data

Already done. Only re-run if source pearmut data or annotations change. The resulting data files are also available pre-built on Hugging Face.

bash scripts/01_prepare_data/01_prepare_pearmut.sh   # build base annotated data files
bash scripts/01_prepare_data/02_merge_annotations.sh  # merge second annotation round

2. Run baselines

All scripts run all four language pairs. Each modality has variants for: gold vs ASR input, with/without severity labels, and context sizes 0/1/2/5 (*_ctx_all.sh covers ctx=1/2/5).

# XCOMET
bash scripts/02_run_baselines/comet/run_xcomet*.sh

# Qwen2.5-Omni — text input
bash scripts/02_run_baselines/qwen/text/run_qwen25omni*.sh

# Qwen2.5-Omni — audio input
bash scripts/02_run_baselines/qwen/audio/run_qwen25omni_audio*.sh

# Qwen2.5-Omni — text+audio input
bash scripts/02_run_baselines/qwen/textaudio/run_qwen25omni_textaudio*.sh

3. Analysis

Set ROUND=annotation1 or ROUND=annotation2 at the top of each script. annotation2 is restricted to en_cs and en_de. ASR model outputs use *_annotated_data_asr+spans-wer.json as the annotation source; non-ASR outputs use *_annotated_data.json.

bash scripts/03_analysis/01_run_meta_eval.sh        # compute F1 and correlations
bash scripts/03_analysis/02_print_results_table.sh  # print summary table + CSVs
bash scripts/03_analysis/03_run_prettyprint.sh      # visual inspection of predictions
bash scripts/03_analysis/04_run_wer_analysis.sh     # WER-split meta-eval

Results are written to outputs/meta-eval/ (annotation1) or outputs/meta-eval/annotation2/ (annotation2).

Citation

@misc{macháček2026automaticlabellingspeechtranslation,
      title={Automatic Labelling of Speech Translation Errors},
      author={Dominik Macháček and Maike Züfle and Ondrej Klejch},
      year={2026},
      eprint={2606.06047},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.06047},
}

About

Speech Translation Error Labelling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors