This is the official repository for the paper
"VOSSA: Voiceprint Optimization for Streaming Speech Architectures"
by Mu-Ruei Tseng, Waris Quamer, Ghady Nasrallah, Ricardo Gutierrez-Osuna
Department of Computer Science & Engineering, Texas A&M University
2026: VOSSA accepted at Interspeech 2026.
Real-time voice conversion (VC) systems commonly rely on pretrained speaker embeddings from automatic speaker verification (ASV) models. While effective for speaker discrimination, these embeddings are trained to remain stable across phonetic and prosodic variations within-speaker, which may conflict with frame-level acoustic generation in streaming constraints. To address this issue, we propose VOSSA, a speaker representation framework that extracts speaker information from intermediate content encoder layers and aggregates using attentive statistics pooling. The embedding is trained jointly with VC objectives, removing the need for a separate speaker encoder.
For more information, please check out our Demo Page.
- 19% fewer parameters than TVTSyn (132.4M vs. 162.8M) by eliminating the external speaker encoder
- Real-time streaming with RTF ≈ 0.25 and end-to-end latency ≈ 73 ms
- Best normalized target-speaker similarity across six evaluation datasets
- Improved pitch accuracy — lowest pitch MAE and highest Pearson CC among all baselines
- Better vowel preservation — lowest Wasserstein distance to ground-truth F1 distributions for high, mid, and low vowels
Source code coming soon.
BibTeX will be available upon official publication at Interspeech 2026.
