▶ Live demo: voxtutor.dexdevs.com — run it in your browser, free offline backend. Browse all 10 portfolio demos via the all demos link.
A pronunciation-assessment benchmark for spoken-language tutoring. VoxTutor synthesizes learner utterances from a known phoneme inventory with mispronunciations injected at known positions, then scores pronunciation scorers on whether they recover them. The result is a clean 2×2 dissociation: forced alignment and Goodness-of-Pronunciation normalization each buy robustness to a different real-world distortion, and you need both — speaking-rate variation breaks naive segmentation, channel noise breaks raw distance, and a control regime proves each failure comes from the missing ingredient, not the scorer in general.
Why it matters: a computer-assisted pronunciation tutor has to tell a genuine mispronunciation apart from a learner who simply spoke fast or recorded on a noisy mic. VoxTutor turns that folklore into a measured, reproducible result against a known answer — no acoustic model to train, no labelled speech corpus, no API keys.
$ voxtutor compare --regime warped # speaking-rate variation
scorer ingredients auroc
------------------------------------------
random baseline 0.4731
naive fixed+raw 0.6163
aligned align+raw 1.0000 # forced alignment is unaffected
normalized fixed+gop-norm 0.6486
gop align+gop-norm 1.0000
$ voxtutor compare --regime noisy # heteroscedastic channel noise
naive fixed+raw 0.7936
aligned align+raw 0.7932 # raw distance now collapses too
normalized fixed+gop-norm 0.9988 # GOP normalization is unaffected
gop align+gop-norm 0.9987Each utterance is a sequence of canonical phonemes rendered as acoustic feature frames around per-phoneme templates. A subset of positions is mispronounced (the template is nudged by a small, confusable amount). A scorer must flag those positions. It is built from two independent ingredients, giving a 2×2:
flowchart LR
subgraph regimes["regimes (known ground truth)"]
clean["clean<br/>(control)"]
warped["warped<br/>(speaking rate)"]
noisy["noisy<br/>(channel noise)"]
end
subgraph scorers["scorers = segmentation x normalization"]
naive["naive<br/>fixed + raw"]
aligned["aligned<br/>DTW + raw"]
normalized["normalized<br/>fixed + GOP"]
gop["gop<br/>DTW + GOP"]
end
regimes -->|score each<br/>position| scorers
scorers -->|AUROC vs<br/>ground truth| gate["eval gate<br/>asserts the<br/>dissociation"]
- segmentation —
fixedsplits the frames into equal chunks (correct only at constant speaking rate);DTWforced alignment re-segments the frames against the canonical sequence, so it survives rate variation. - normalization —
rawis the mean frame-to-template distance; by the bias–variance identity that equals the real error plus the within-phone variance, so channel noise inflates it.GOPscores the distance of the position's centroid to the template, subtracting the variance — a likelihood-ratio / Goodness-of-Pronunciation style correction.
Mean AUROC over 12 seeds, 40 utterances/cell, against the known mispronunciations
(1.0 = perfect, 0.5 = chance). Reproduce with python -m evals.harness.
| scorer | ingredients | clean (control) | warped | noisy |
|---|---|---|---|---|
| random | baseline | 0.491 | 0.509 | 0.508 |
| naive | fixed + raw | 1.000 | 0.620 ⤵ | 0.730 ⤵ |
| aligned | align + raw | 1.000 | 1.000 | 0.728 ⤵ |
| normalized | fixed + gop-norm | 1.000 | 0.662 ⤵ | 0.999 |
| gop | align + gop-norm | 1.000 | 1.000 | 0.999 |
- Effect 1 — forced alignment beats speaking-rate variation. The fixed-segmentation
scorers drop on
warped(naive1.000 → 0.620,normalized1.000 → 0.662); the forced-alignment scorers stay1.000. - Effect 2 — GOP normalization beats channel noise. The raw scorers drop on
noisy(naive1.000 → 0.730,aligned1.000 → 0.728); the normalized scorers stay~1.0. - Each ablation fails only its own regime, and
naive(neither ingredient) fails on both — so onlygopis robust everywhere. The collapse is the missing ingredient, not the scorer. - Scrambled-label null: shuffle the ground truth and every scorer falls to ~0.50,
confirming the AUROC is real (full table in
evals/RESULTS.md).
pip install -e ".[dev]" # numpy only; no API keys, no downloads
voxtutor compare --regime warped # all scorers on one regime
voxtutor compare --regime noisy
voxtutor score --method gop --regime noisy
voxtutor regimes # describe the distortion regimes
python -m evals.harness # write evals/RESULTS.md
python -m evals.gate # assert the dissociation (CI gate)
pytest -q # 56 testsConfigure via env vars (see .env.example): VOXTUTOR_METHOD,
VOXTUTOR_REGIME, VOXTUTOR_LABELS, VOXTUTOR_SAMPLES, VOXTUTOR_SEED.
docker build -t voxtutor .
docker run --rm voxtutor # runs the full offline benchmarkpip install -e ".[scipy]" # then the skipped tests runRecomputes the DTW frame-template cost matrix with scipy.spatial.distance.cdist and the
AUROC with scipy.stats.mannwhitneyu, and asserts both match the hand-rolled numpy core.
- Offline & deterministic. numpy is the only runtime dependency; every number is produced
from
np.random.default_rngwith a fixed salt, so CI reproduces the table bit-for-bit across Python 3.10–3.12. - Synthesized, not recorded. Utterances are generated from a known phoneme inventory with mispronunciations injected at known positions — the ground truth is exact, no speech corpus and no "is this label right?" ambiguity.
- The experiment is tuned, never the scorer. Distortion strengths are chosen to make the failure visible; the scorers are textbook (uniform segmentation, DTW forced alignment, GOP).
See docs/ARCHITECTURE.md and docs/DECISIONS.md
for the full design.
MIT — see LICENSE.
