Skip to content

eryawww/kokoro_hack

Repository files navigation

Kokoro Style Optimization

This project implements Style Optimization for the Kokoro-82M TTS model. It allows you to generate speech with specific emotional tones (e.g., anger, happiness) by optimizing the style vector during inference.

Features

  • Emotion Steering: Generate speech that matches a target emotion embedding.
  • Optimization Method: Uses Particle Swarm Optimization (PSO) to search for the best style vector.
  • Dual Output: Generates both a baseline (neutral) version and the steered (emotional) version for comparison.

Installation

  1. git clone https://github.com/eryawww/kokoro_hack.git

  2. pip install -r requirements.txt

Usage

The main entry point is main.py.

Basic Usage

Generate audio with a specific emotion. This will output two files:

  1. I am very angry right now!.wav (Baseline, Zero-shot)
  2. I am very angry right now!_anger.wav (Steered)
python main.py --text "I am very angry right now!" --emotion anger

Advanced Options

python main.py \
  --text "I am feeling very sleepy." \
  --emotion sleepiness \
  --iters 100 \
  --early_stopping 15 \
  --stft_loss_weight 0.5

Arguments

Argument Description Default
--text Text to speak (required). Used as the filename prefix. -
--emotion Target emotion. Supported: amused, anger, disgust, neutral, sleepiness. -
--iters Number of PSO iterations. 80
--early_stopping Stop optimization if no improvement for N iterations. 10
--stft_loss_weight Weight of the STFT loss component (vs Cosine similarity). Higher for more realistic audio. 0.7
--embedding_path Path to emotion embeddings file. per_emotion_embedding_centroid.pt

How It Works

  1. Emotion Encoder: Uses a pre-trained emotion encoder (Wav2Vec2-based) to extract embeddings from audio.
  2. Style Optimization:
    • The Kokoro model accepts a style vector.
    • PSO maintains a swarm of particles (style vectors) that explore the style space.
    • It minimizes a loss function combining Cosine Similarity (to the target emotion centroid) and Multi-Resolution STFT Loss (to preserve audio quality relative to the baseline).
  3. Output: Produces the final audio using the best found style vector.

Acknowledgements

  • Based on Kokoro-82M by hexgrad.
  • StyleTTS 2 architecture.

About

Style Optimization for the Kokoro-82M TTS model. It allows you to generate speech with specific emotional tones (e.g., anger, happiness) by optimizing the style vector during inference

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

 
 
 

Contributors