Tools and frameworks for multimodal speech generation and dialogue
SwanBench-Speech is a comprehensive benchmark designed to evaluate the performance of long-form speech generation models. SwanBench-Speech has three key properties.:
- Rich speech scenarios; 2)Comprehensive evaluation dimensions; 3) Valuable Insights
DiTReducio is a training-free acceleration framework that compresses computations in DiT-based TTS models through a progressive calibration process.
Make-An-Audio is a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms.
Code: https://github.com/Text-to-Audio/Make-An-Audio
FastSpeech propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS.
Code: https://github.com/ming024/FastSpeech2
MRSAudio is a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios.
Code: https://github.com/MRSAudio/MRSAudio_Main