Kokoro Style Optimization

This project implements Style Optimization for the Kokoro-82M TTS model. It allows you to generate speech with specific emotional tones (e.g., anger, happiness) by optimizing the style vector during inference.

Features

Emotion Steering: Generate speech that matches a target emotion embedding.
Optimization Method: Uses Particle Swarm Optimization (PSO) to search for the best style vector.
Dual Output: Generates both a baseline (neutral) version and the steered (emotional) version for comparison.

Installation

git clone https://github.com/eryawww/kokoro_hack.git
pip install -r requirements.txt

Usage

The main entry point is main.py.

Basic Usage

Generate audio with a specific emotion. This will output two files:

I am very angry right now!.wav (Baseline, Zero-shot)
I am very angry right now!_anger.wav (Steered)

python main.py --text "I am very angry right now!" --emotion anger

Advanced Options

python main.py \
  --text "I am feeling very sleepy." \
  --emotion sleepiness \
  --iters 100 \
  --early_stopping 15 \
  --stft_loss_weight 0.5

Arguments

Argument	Description	Default
`--text`	Text to speak (required). Used as the filename prefix.	-
`--emotion`	Target emotion. Supported: `amused`, `anger`, `disgust`, `neutral`, `sleepiness`.	-
`--iters`	Number of PSO iterations.	`80`
`--early_stopping`	Stop optimization if no improvement for N iterations.	`10`
`--stft_loss_weight`	Weight of the STFT loss component (vs Cosine similarity). Higher for more realistic audio.	`0.7`
`--embedding_path`	Path to emotion embeddings file.	`per_emotion_embedding_centroid.pt`

How It Works

Emotion Encoder: Uses a pre-trained emotion encoder (Wav2Vec2-based) to extract embeddings from audio.
Style Optimization:
- The Kokoro model accepts a style vector.
- PSO maintains a swarm of particles (style vectors) that explore the style space.
- It minimizes a loss function combining Cosine Similarity (to the target emotion centroid) and Multi-Resolution STFT Loss (to preserve audio quality relative to the baseline).
Output: Produces the final audio using the best found style vector.

Acknowledgements

Based on Kokoro-82M by hexgrad.
StyleTTS 2 architecture.

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
.github		.github
demo		demo
examples		examples
kokoro.js		kokoro.js
kokoro		kokoro
neural_steer		neural_steer
output		output
tests		tests
voices		voices
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
implementation.ipynb		implementation.ipynb
main.py		main.py
per_emotion_embedding_centroid.pt		per_emotion_embedding_centroid.pt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
rnd.ipynb		rnd.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kokoro Style Optimization

Features

Installation

Usage

Basic Usage

Advanced Options

Arguments

How It Works

Acknowledgements

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Kokoro Style Optimization

Features

Installation

Usage

Basic Usage

Advanced Options

Arguments

How It Works

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages