VoiceHub is a local-first speech toolkit that combines Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) in a clean Gradio app. It supports file uploads and live microphone streaming, console-style progress readouts, an in-app Log Panel, and optional Ollama integration to pre-chunk and punctuate text for smoother TTS.
Version 0.2.0 keeps XTTS as the safe default TTS family, upgrades ASR to faster-whisper + whisper-large-v3-turbo, and adds Qwen3-TTS 1.7B / 0.6B as optional backends with automatic fallback to XTTS when Qwen-TTS is unavailable or the detected language is not supported.
I kept version 0.1.5 saved in another branch if you prefer falling back to the previous version.
-
ASR default upgraded to faster-whisper + turbo.
-
Two TTS families:
-
XTTS-v2 as the default and fallback backend.
-
Qwen3-TTS as an optional TTS backend.
-
Qwen3-TTS 1.7B / 0.6B support.
-
-
Voice clone cache for Qwen:
-
ASR transcript is generated from the uploaded reference audio.
-
transcript is saved as
.txt. -
metadata is saved as
.json. -
repeated use of the same reference audio reuses the cached transcript.
-
-
Backend-aware voice dropdown such as
Ryan (Qwen)andAaron Dreschner (XTTS). -
Model-aware chunking:
-
XTTS keeps strict conservative chunking.
-
Qwen uses softer chunking.
-
-
Lazy loading:
- Qwen is not loaded or downloaded until you select Qwen.
-
Personal note: I still think XTTS is the better choice if you want speed. Qwen-TTS can sound better in some cases, but it is considerably slower. Depending on your application, XTTS might be more than enough and much faster.
Features from VoiceHub 0.1.5 and 0.2.0.
-
Two-way speech pipeline.
-
Speech → Text (ASR) via faster-whisper (GPU/CPU) with VAD and streaming mic capture; OpenAI Whisper is still available as an alternative backend.
-
Text → Speech (TTS) via Coqui XTTS-v2 with speaker discovery, speed control, optional reference-voice cloning, and optional Qwen3-TTS backends.
-
Preferences stored in a normal JSON config file with migration from the older
~/.voicehub/config.jsonlayout. -
Config tab for per-model defaults (ASR, TTS, Ollama) with Save and Reset to recommended defaults.
-
Log Panel tab that mirrors stdout/stderr into an in-app textbox.
-
Console-style progress bars (single-line, printed to the log/prompt). I avoid multiple Gradio progress widgets to keep the UI clean.
-
Optional Ollama integration:
-
Pre-chunker for TTS: refine punctuation and split long text into TTS-friendly segments.
-
Translator for ASR: translate recognized text into another language directly from the ASR tab.
-
UI to refresh models, test connectivity, and Set as default model. Public fallback model is
gemma3:12b, and you can persist a different choice.
-
A fresh Python environment is strongly recommended for 0.2.0.
-
Recommended Python: 3.12 for VoiceHub 0.2.0.
-
GPU is optional, but strongly recommended for a smoother experience.
-
VoiceHub 0.2.0 now has split requirement files so Qwen stays optional:
-
requirements.txt→ lightweight / XTTS-first install. -
requirements_xtts.txt→ XTTS-only install. -
requirements_full.txt→ XTTS + optional Qwen install. -
Keep version 0.1.5 in a separate branch / separate environment if you want a safe fallback path.
requirements*.txtintentionally do not include PyTorch. Install PyTorch first (GPU or CPU), then install the rest of the dependencies.
Conda (recommended)
# from repository root
conda create --name voicehub_020 python=3.12 -y
conda activate voicehub_020
OR: venv (pip)
python -m venv .venv
# Windows: .venv\Scripts\activate
source .venv/bin/activate
On Windows, my own workflow to make this project work with faster Qwen inference was:
-
install a prebuilt FlashAttention wheel from the community Windows wheel page:
-
https://huggingface.co/ussoewwin/Flash-Attention-2_for_Windows
-
then install the wheel locally, for example:
pip install flash_attn-2.8.2%2Bcu129torch2.8.0cxx11abiTRUE-cp312-cp312-win_amd64.whl
Otherwise, the usual direct attempt is:
pip install flash-attn --no-build-isolation
This is optional and only relevant if you want Qwen-TTS to run faster. XTTS does not need it.
pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu126
pip install --index-url https://download.pytorch.org/whl/cpu torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1
Notes
-
A CUDA-capable GPU is recommended for a smoother experience.
-
CPU mode should work, but it will be much slower.
-
Apple Silicon (M1/M2/M3): use the CPU wheels — PyTorch will use Metal (MPS) automatically.
-
If you already installed a different Torch build in your env, these commands will reinstall the specified version.
Quick check
python - << 'PY'
import torch
print("torch:", torch.__version__, "cuda:", torch.cuda.is_available())
PY
Here you have two main options: XTTS only, or XTTS + Qwen.
XTTS-focused install
pip install -r requirements_xtts.txt
Full install with optional Qwen
pip install -r requirements_full.txt
requirements.txt is also kept as a lightweight XTTS-first install if you want the simpler default path.
Qwen may require SoX to be available on your system PATH.
For audio conversion utilities, install ffmpeg. A simple option is:
conda install -c conda-forge ffmpeg
conda install -c conda-forge sox
Either use python app.py or run.sh / run.bat.
python app.py
# windows
run.bat
# linux
run.sh
By default the app binds to 127.0.0.1:7870. You can override with:
SERVER_NAME=127.0.0.1 SERVER_PORT=7860 python app.py
VoiceHub can be customized at launch via environment variables.
Set them temporarily in your shell, or permanently in your run.sh / run.bat.
-
SERVER_NAME– interface to bind the Gradio app. Default:127.0.0.1 -
set to
0.0.0.0for LAN access. -
SERVER_PORT– port number. Default:7870 -
MAX_FILE_SIZE– max file upload size. Default:300mb
-
VOICEHUB_PREFS_DIR– folder where preferences (for exampleconfig.json) are stored. -
default:
~/.voicehub/preferences/
-
ASR_MODEL— default isturbo -
ASR_INT8— set to1to useint8_float16 -
ASR_BACKEND— UI still exposes faster-whisper / OpenAI Whisper choices directly
TTS_MODEL— override XTTS model id if needed
-
QWEN_CUSTOM_MODEL— defaults to the selected Qwen model size, for exampleQwen/Qwen3-TTS-12Hz-1.7B-CustomVoice -
QWEN_CLONE_MODEL— defaults to the selected Qwen model size, for exampleQwen/Qwen3-TTS-12Hz-1.7B-Base -
QWEN_TTS_MAX_CHARS— default shared Qwen chunk cap is512 -
QWEN_MAX_NEW_TOKENS— default internal generation guard is1024
-
OLLAMA_ENABLE–1to enable Ollama integration (default0) -
OLLAMA_MODEL/OLLAMA_MODEL_DEFAULT– which Ollama model to use -
OLLAMA_HOST– defaulthttp://127.0.0.1:11434 -
OLLAMA_TIMEOUT– timeout in seconds (default:30) -
OLLAMA_MAX_SEG_CHARS– max characters per segment Ollama should return (default:200)
Linux / macOS (bash):
SERVER_PORT=7860 OLLAMA_ENABLE=1 python app.py
Windows (cmd / run.bat):
@echo off
set SERVER_PORT=7860
set OLLAMA_ENABLE=1
python app.py
Windows (PowerShell):
$env:SERVER_PORT=7860
$env:OLLAMA_ENABLE=1
python app.py
-
Bootstrap happens in
app.py. It makessrc/importable, queues Gradio for streaming events, increases max upload size, and mutes specific non-critical errors/warnings. -
UI lives in
src/voicehub/ui.py: builds tabs, wires buttons, and manages component state. Mic streaming usesAudio.start_recording/stream/stop_recording. -
Preferences are JSON under the user prefs directory (default
~/.voicehub/preferences/config.json). -
User settings (beam size, XTTS/Qwen chunk caps, clone reference caps, mic hard cap, etc.) are centralized and persisted via the Config tab.
-
Log Panel tees stdout/stderr so you can glance at everything from inside the UI.
-
Stored under
~/.voicehub/preferences/config.jsonby default. -
You can relocate them with the
VOICEHUB_PREFS_DIRenv var. -
A legacy
~/.voicehub/config.jsonis migrated automatically on first run.
Update global defaults for:
-
Whisper (ASR): temperature, beam size, condition on previous text, microphone stream hard cap.
-
TTS (XTTS / Qwen): model family, Qwen model size, chunk-size controls, clone reference-audio caps, default max output minutes, and Qwen style prompt.
-
Ollama (optional): temperature, top-p, token cap, optional stop sequences.
-
Backends:
-
faster-whisper (recommended): GPU/CPU, supports STOP and progress.
-
OpenAI Whisper: available as an alternative backend.
-
Mic streaming: the browser streams chunks; VoiceHub buffers them and enforces a hard cap by minutes (configurable), trimming the last chunk precisely when the cap is hit. On stop, it saves one WAV for preview and runs the normal transcription path.
-
Upload mode: provide audio and hit Transcribe.
-
Translate (optional): use Ollama to translate the transcript from the ASR Advanced accordion. Includes Refresh models and Test Ollama.
-
ASR STOP button: best experience is with faster-whisper.
-
Engines:
-
Coqui XTTS-v2 (default / fallback)
-
Qwen3-TTS (optional)
-
Language & voice: choose TTS language, pick a backend-aware voice, adjust speed, and optionally provide a reference audio file to clone or bias the voice depending on the backend.
-
Voice cloning caps: clone reference audio is automatically trimmed if it exceeds the configured backend cap.
-
XTTS default cap: 300 seconds
-
Qwen default cap: 50 seconds
-
Chunking: the TTS chunker uses a library-backed sentence splitter with the legacy in-repo chunker kept as fallback. Optional Ollama pre-chunker can refine punctuation first, but VoiceHub rejects refinements that change the original content/order. Progress is printed line-by-line; output audio is concatenated through a hardened join path that validates sample rates and smooths chunk boundaries.
-
Qwen routing: when Qwen is selected, VoiceHub tries Qwen first and falls back to XTTS if Qwen is unavailable or the target language is unsupported.
-
Warnings: if a backend can’t move to GPU, VoiceHub falls back as safely as it can and keeps going.
-
Enable it from TTS › Advanced and ASR › Advanced.
-
You can Refresh models, Test Ollama, and Set as default model from the UI.
-
Default model precedence:
-
Saved user preference (
ollama_model_default) if present. -
OLLAMA_MODELorOLLAMA_MODEL_DEFAULTenv vars. -
Public fallback
gemma3:12b.
- Pre-chunk prompt: helps punctuation, splitting, and cleanup before TTS.
-
Log Panel tab mirrors the real console and includes Clear logs.
-
Debug (dev) tab (hidden unless
DEBUG_TOOLS=1): inspect the full TTS chunking pipeline — raw → optional Ollama → sentences → chunks — plus language detection output.
.
├─ app.py # entrypoint; Gradio launch; startup filters
├─ run.sh / run.bat # convenience launchers
├─ requirements.txt # lightweight / XTTS-first install
├─ requirements_xtts.txt # XTTS-only install
├─ requirements_full.txt # XTTS + optional Qwen install
├─ environment.yml
├─ data/ # example samples
├─ docs/ # screenshots
└─ src/voicehub/
├─ ui.py # UI, tabs, wiring, STOP buttons, mic streaming
├─ asr.py # faster-whisper / Whisper backends; stream buffer & hard cap
├─ tts.py # XTTS + Qwen synth orchestration; chunking; progress; STOP
├─ tts_router.py # backend routing helpers
├─ qwen_backend.py # Qwen model wrappers / loaders
├─ voice_clone_cache.py # Qwen transcript / metadata cache
├─ config.py # language lists, model names, backends & defaults
├─ config_ui.py # Config tab (save/reset)
├─ user_settings.py # persisted defaults (per model)
├─ prefs.py # user prefs path + migration helpers
├─ ollama_config.py # Ollama defaults + preference helpers
├─ ollama_utils.py # list models, test link, refine/translate
├─ chunking.py # chunking helpers
├─ audio_utils.py # robust audio concat helpers
├─ progress_utils.py # console-style progress helpers
├─ log_panel.py # in-app log tee with Clear
├─ debug_ui.py # developer pipeline inspector
├─ lang_detect.py # TTS language auto-detect helper
└─ __init__.py
Useful knobs when launching:
-
Server:
SERVER_NAME(default127.0.0.1),SERVER_PORT(default7870). -
Uploads:
MAX_FILE_SIZE(for example300mb). -
Preferences dir:
VOICEHUB_PREFS_DIR(defaults to~/.voicehub/preferences/). -
Ollama:
OLLAMA_ENABLE,OLLAMA_MODEL,OLLAMA_MODEL_DEFAULT,OLLAMA_HOST,OLLAMA_TIMEOUT,OLLAMA_MAX_SEG_CHARS. -
Debug tab:
DEBUG_TOOLS=1to show the developer tab.
-
XTTS voice cloning complains about TorchCodec: install
torchcodecin the same env. -
Qwen is too slow: XTTS is still the safe default. If you really want Qwen speedups, use a CUDA setup and optionally FlashAttention.
-
Qwen needs SoX / ffmpeg: install them and make sure they are on your PATH.
-
XTTS won’t use GPU: VoiceHub tries GPU first and can fall back to CPU.
-
ASR mic recording stops early: increase ASR microphone (minutes) in Config. A hard cap is enforced; the last chunk is trimmed to fit.
-
STOP is not equally strong on every backend: faster-whisper and XTTS are the better-supported paths. Qwen stop is more best-effort and still depends on the underlying generation call.
-
Progress bars: console-style only (by design) to avoid UI clutter.
-
Chunking: sentence-first assembly with conservative caps; Ollama pre-chunker is optional and tunable.
-
Qwen: optional and slower; best treated as an extra backend, not the only reason to use the app.
-
Download: sample_audio_1.wav
-
Download: sample_audio_2.wav
-
Text: sample_text.txt
This project is licensed under the MIT License.
conda create -n voicehub_020 python=3.12 -y
conda activate voicehub_020
pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
run.bat
# open http://127.0.0.1:7870
-
TTS: paste text → pick language/voice → Synthesize → optional Ollama pre-chunker → STOP if needed.
-
ASR: upload audio or use Microphone → Transcribe → STOP → optional Translate via Ollama.



