Refinery helps find the best reference clip combinations for cloning a voice with Fish Audio and Fish-Speech voice cloning models.
Refinery uses an intuitive, human-in-the-loop iterative process: it generates candidate ref combinations, renders them against the same phrases/styles, lets you favorite the best outputs, and then biases the next round toward those refs.
Same speaker (the bundled public-domain LJSpeech voice), same phrase, same model — only the reference combination differs between rounds. Listen to round 1 (naive random pick) versus round 3 (refined favorite-weighted pick) on the project site.
- Random K-of-N ref search
- Favorite-weighted refinement
- Side-by-side audio comparison
- Customizable test phrases
- Bracket-style prompt matrix (S2 tags)
- Model, latency, sampling, prosody, chunk, and long-form controls
- In-memory TTS cache
- JSON recipe export
Python 3.11+ with uv and/or Docker Compose.
The preferred local Python workflow uses a project-local virtual environment at .venv/.
Clone and create a local env file:
git clone https://github.com/mikeharty/refinery.git
cd refinery
cp .env.example .envCreate the local virtual environment and install dev dependencies:
uv sync --devVS Code is configured to use .venv/bin/python automatically.
| Variable | Default | Used by | Notes |
|---|---|---|---|
REFERENCE_ROOT |
./refs |
Native Refinery | Reference-set root |
FISH_TTS_URL |
http://fish-speech:8080/v1/tts |
Both | TTS endpoint used by both native and Docker runs |
FISH_API_KEY |
unset | Both | Optional bearer token |
FISH_MODEL |
s2-pro |
Both | Model header sent with TTS requests |
FISH_TTS_TIMEOUT_SECONDS |
0 |
Both | Generation/read timeout; 0 waits indefinitely |
FISH_CONNECT_TIMEOUT_SECONDS |
10 |
Both | Connect timeout for unreachable Fish endpoints |
MAX_TTS_CACHE_ITEMS |
256 |
Both | Set 0 to disable in-memory audio cache |
PORT |
5055 |
Native Refinery | Web UI port for python app.py |
REFINERY_PORT |
5055 |
Docker Refinery | Host port mapped to container port 5055 |
See .env.example for more options and details.
You can choose between hosted Fish Audio or self-hosted Fish-Speech. The latter can run in Docker or natively on Apple Silicon.
| Backend | Use when | Endpoint |
|---|---|---|
| Hosted Fish Audio | You have an API key and want the simplest path | https://api.fish.audio/v1/tts |
| Docker Fish-Speech | Linux/WSL with NVIDIA CUDA | http://fish-speech:8080/v1/tts inside Compose |
| Native macOS Fish-Speech | Apple Silicon with MPS | http://127.0.0.1:8080/v1/tts |
Set FISH_TTS_URL to one of the endpoints above, or a custom endpoint if you have a different setup. If you use the hosted API, set FISH_API_KEY to your API key and optionally configure FISH_MODEL if you want a different model than the default s2-pro.:
FISH_TTS_URL=https://api.fish.audio/v1/tts
FISH_API_KEY=your_api_key_here
FISH_MODEL=s2-proFor Linux/WSL with a CUDA-capable NVIDIA GPU.
Local S2-Pro inference is heavy; Fish's docs currently recommend at least 24GB VRAM.
docker compose --profile download run --rm fish-models
docker compose --profile fish up --buildThis downloads fishaudio/s2-pro into ./fish-checkpoints/s2-pro, starts Fish-Speech on port 8080, and starts Refinery on port 5055.
To run only Fish-Speech in Docker and run Refinery natively:
docker compose --profile fish up fish-speechIf Fish-Speech is running on your Mac, keep this in .env:
FISH_TTS_URL=http://127.0.0.1:8080/v1/ttsDocker Desktop on macOS does not expose Apple Metal/MPS to Linux containers. On Apple Silicon, use the project-local native scripts:
scripts/install-fish-macos.sh --install-brew-deps
scripts/start-fish-macos.shThe installer clones Fish-Speech into .local/fish-speech, installs with uv, and downloads fishaudio/s2-pro into that project-local directory. The --install-brew-deps flag installs missing native audio dependencies (ffmpeg, sox, and portaudio) with Homebrew; omit it if you already have them. It does not touch global Fish-Speech clones or system Python environments.
Useful install/start options:
scripts/install-fish-macos.sh --update
FISH_SPEECH_API_PORT=8081 scripts/start-fish-macos.sh
scripts/start-fish-macos.sh --cpu
scripts/start-fish-macos.sh --no-half
FISH_SPEECH_EXTRA_ARGS="--workers 1" scripts/start-fish-macos.shRemove only the project-local Fish-Speech install:
scripts/uninstall-fish-macos.shThe uninstaller removes only the project-local Fish-Speech install.
Native:
uv run uvicorn app:app --host 0.0.0.0 --port 5055 --reloadDocker:
docker compose up --build refineryIf the Refinery container needs to call Fish-Speech on your host machine, set:
FISH_TTS_URL=http://host.docker.internal:8080/v1/ttsOpen http://localhost:5055.
Each reference set is any folder under refs/ that directly contains paired .wav and .lab files:
refs/
ljspeech_linda_johnson/
LJ001-0001.wav
LJ001-0001.lab
LJ001-0003.wav
LJ001-0003.lab
my_voice/
clip_01.wav
clip_01.lab
bender-moods/
angry/
angry_01.wav
angry_01.lab
tired/
tired_01.wav
tired_01.lab
Rules:
.labmust contain the exact transcript for its paired.wav.- Nested folders are supported. A grouping folder such as
refs/bender-moods/does not need clips of its own; each mood folder below it appears as a selectable reference set such asbender-moods/angry. - Mood folders are not merged into one parent pool. Choose one mood set at a time from the reference-set picker.
- Local audio and
.labtranscript files are ignored by git so private refs are not committed accidentally. refs/ljspeech_linda_johnsonis bundled public-domain sample material from The LJ Speech Dataset.- Only use voices you have permission to clone.
Fish S2 style tags are useful for prompting, but they are not always enough. If a cloned voice becomes unstable, loses the original speaker, or turns cartoonish when asked for a mood, use mood-specific reference sets instead: group clips by the same original speaker in that actual mood, then let Refinery search combinations inside that folder. The tag can still describe the desired delivery, but the refs carry the acoustic evidence.
For reference clips that do not already have trusted transcripts, use the batch transcription script. It scans .wav files by default because Refinery loads .wav/.lab pairs, writes each transcript to the matching .lab, and stores backups plus a JSONL manifest under output/transcriptions/.
Preview the files that would be processed:
scripts/transcribe-ref-labs-local.sh refs/my_voice --dry-runThe local wrapper installs what it needs on first run. On Apple Silicon macOS it uses mlx-whisper; elsewhere it uses faster-whisper. Local transcription is free after the model download and keeps audio on your machine. Local providers default to large Whisper models.
Run local speech-to-text:
scripts/transcribe-ref-labs-local.sh refs/my_voice --language enOverride the local model when you want a different quality/runtime tradeoff:
scripts/transcribe-ref-labs-local.sh refs/my_voice --language en --mlx-model mlx-community/whisper-small.en-mlx
scripts/transcribe-ref-labs-local.sh refs/my_voice --language en --provider faster-whisper --local-model medium.enOpenAI transcription is available as an equivalent hosted provider:
OPENAI_API_KEY=your_api_key_here uv run python scripts/transcribe_ref_labs.py refs/my_voice --provider openai --language enThe underlying Python script still supports --provider auto, --provider faster-whisper, --provider mlx-whisper, --provider whisper-cli, and --provider openai. Use --missing-only to fill only missing labels or --limit 5 for a small trial run.
Edit texts.json or use the UI.
[
"Your first test phrase here.",
"Another phrase to compare voices with.",
"A third phrase for good measure."
]Keep phrase and style counts small when using a paid endpoint. Refinery renders every variant against every phrase/style combination.
Generate variants:
- Load
.wav/.labpairs from the selected reference set. - Create random K-of-N reference combinations.
- Expand phrases across optional S2 style tags.
- Render audio for each variant/sample pair.
Refine:
- Give references from favorited variants 2x selection weight.
- Ensure at least half of new variants include a favorited reference.
Repeat until the best combination is clear enough to export.
Recipe export and caching make it easy to compare future ranking approaches, pairwise scoring, or audio-quality preflight checks.
Contributions are welcome!
Start with CONTRIBUTING.md.
Security reports should follow SECURITY.md.
Release steps live in docs/RELEASE_CHECKLIST.md.
Community visibility notes live in docs/OSS_LAUNCH.md.
If you use Refinery in research, cite it via CITATION.cff.