Skip to content
Rana Faraz edited this page Jun 23, 2026 · 1 revision

VoxTutor

CI Live demo License: MIT

A pronunciation-assessment benchmark for spoken-language tutoring.

VoxTutor synthesizes learner utterances from a known phoneme inventory with mispronunciations injected at known positions, then scores four scorers on whether they recover them. The result is a clean 2x2 dissociation: forced alignment (DTW) and Goodness-of-Pronunciation (GOP) normalization each buy robustness to a different real-world distortion, and you need both to be robust everywhere.

  • Speaking-rate variation breaks naive fixed-window segmentation; DTW forced alignment fixes it.
  • Channel noise inflates raw frame-to-template distances; GOP normalization (variance subtraction) fixes it.
  • A scrambled-label null confirms every AUROC signal is real.
  • No acoustic model, no speech corpus, no API keys -- numpy only.
flowchart LR
    subgraph input["Utterance synthesis"]
        ph["Phoneme inventory\n(known templates)"]
        mp["Mispronunciation injection\n(known positions)"]
    end
    subgraph scorers["Scorers = segmentation x normalization"]
        naive["naive\nfixed + raw"]
        aligned["aligned\nDTW + raw"]
        normalized["normalized\nfixed + GOP"]
        gop["gop\nDTW + GOP"]
    end
    subgraph eval["Evaluation"]
        auroc["AUROC vs ground truth"]
        gate["Dissociation gate (CI)"]
    end
    ph --> mp --> scorers --> auroc --> gate
Loading

Quick start

pip install -e ".[dev]"
voxtutor compare --regime warped
voxtutor compare --regime noisy
python -m evals.harness
pytest -q

Wiki pages

  • Architecture -- utterance synthesis, scorer implementations, 2x2 design
  • Evaluation -- benchmark setup, results table, regime descriptions
  • Configuration -- env vars, .env.example
  • Development -- setup, tests, how to add a scorer or regime

Clone this wiki locally