-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Rana Faraz edited this page Jun 23, 2026
·
1 revision
A pronunciation-assessment benchmark for spoken-language tutoring.
VoxTutor synthesizes learner utterances from a known phoneme inventory with mispronunciations injected at known positions, then scores four scorers on whether they recover them. The result is a clean 2x2 dissociation: forced alignment (DTW) and Goodness-of-Pronunciation (GOP) normalization each buy robustness to a different real-world distortion, and you need both to be robust everywhere.
- Speaking-rate variation breaks naive fixed-window segmentation; DTW forced alignment fixes it.
- Channel noise inflates raw frame-to-template distances; GOP normalization (variance subtraction) fixes it.
- A scrambled-label null confirms every AUROC signal is real.
- No acoustic model, no speech corpus, no API keys -- numpy only.
flowchart LR
subgraph input["Utterance synthesis"]
ph["Phoneme inventory\n(known templates)"]
mp["Mispronunciation injection\n(known positions)"]
end
subgraph scorers["Scorers = segmentation x normalization"]
naive["naive\nfixed + raw"]
aligned["aligned\nDTW + raw"]
normalized["normalized\nfixed + GOP"]
gop["gop\nDTW + GOP"]
end
subgraph eval["Evaluation"]
auroc["AUROC vs ground truth"]
gate["Dissociation gate (CI)"]
end
ph --> mp --> scorers --> auroc --> gate
pip install -e ".[dev]"
voxtutor compare --regime warped
voxtutor compare --regime noisy
python -m evals.harness
pytest -q- Architecture -- utterance synthesis, scorer implementations, 2x2 design
- Evaluation -- benchmark setup, results table, regime descriptions
- Configuration -- env vars, .env.example
- Development -- setup, tests, how to add a scorer or regime