Home

VoxTutor

A pronunciation-assessment benchmark for spoken-language tutoring.

VoxTutor synthesizes learner utterances from a known phoneme inventory with mispronunciations injected at known positions, then scores four scorers on whether they recover them. The result is a clean 2x2 dissociation: forced alignment (DTW) and Goodness-of-Pronunciation (GOP) normalization each buy robustness to a different real-world distortion, and you need both to be robust everywhere.

Speaking-rate variation breaks naive fixed-window segmentation; DTW forced alignment fixes it.
Channel noise inflates raw frame-to-template distances; GOP normalization (variance subtraction) fixes it.
A scrambled-label null confirms every AUROC signal is real.
No acoustic model, no speech corpus, no API keys -- numpy only.

flowchart LR
    subgraph input["Utterance synthesis"]
        ph["Phoneme inventory\n(known templates)"]
        mp["Mispronunciation injection\n(known positions)"]
    end
    subgraph scorers["Scorers = segmentation x normalization"]
        naive["naive\nfixed + raw"]
        aligned["aligned\nDTW + raw"]
        normalized["normalized\nfixed + GOP"]
        gop["gop\nDTW + GOP"]
    end
    subgraph eval["Evaluation"]
        auroc["AUROC vs ground truth"]
        gate["Dissociation gate (CI)"]
    end
    ph --> mp --> scorers --> auroc --> gate

Quick start

pip install -e ".[dev]"
voxtutor compare --regime warped
voxtutor compare --regime noisy
python -m evals.harness
pytest -q

Wiki pages

Architecture -- utterance synthesis, scorer implementations, 2x2 design
Evaluation -- benchmark setup, results table, regime descriptions
Configuration -- env vars, .env.example
Development -- setup, tests, how to add a scorer or regime

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

VoxTutor

Quick start

Wiki pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally