Skip to content

BadrinathanTV/OpenVoice-AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

31 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™ OpenVoice AI

A real-time, multi-agent voice assistant powered by LangGraph Swarm architecture. Speak naturally to specialized AI agents β€” each with their own personality, voice, and expertise β€” through a stunning animated web interface.


🌟 Features

Feature Description
πŸ—£οΈ Real-time Voice Speak and listen in real-time via the browser using WebRTC β€” VAD detects speech, ASR transcribes, LLM responds, TTS speaks back
πŸ€– Multi-Agent Swarm 3 specialized agents (Customer Care, Shopper, Order Ops) with seamless handoffs via LangGraph
🎨 Animated Agent Orbs Each agent has a unique color identity with animated orb visualizations that glow when speaking
πŸ’¬ Dual Mode Switch between Voice mode (mic) and Text mode (typing) on the fly
πŸ”„ Live Agent Handoffs Agents transfer conversations to each other based on context β€” orb morphs colors during handoff
🧠 Conversation Memory In-memory checkpointer (MemorySaver) persists conversation state across the session for zero latency
⚑ GPU Accelerated VAD (Silero), ASR (Qwen3-0.6B), and TTS (Piper) all run on CUDA when available
🎀 Interrupt Support Interrupt the AI mid-sentence β€” it remembers where it was cut off
πŸ”‡ Noise Reduction Real-time audio denoising using DeepFilterNet3 ONNX model (toggleable via UI)
🌐 Language Guard Production-grade English language enforcement gate using langid to drop spurious noise/translations
⏱️ Latency Tracking Granular latency metrics tracking (TTFT, TTFA, and agent switching)

πŸ—οΈ Architecture

graph TB
    subgraph Frontend["React Frontend (Vite)"]
        UI[Animated UI Orbs]
        Audio[Web Audio API]
        RTC_Client[WebRTC Client]
        WS_Client["WebSocket Client (Text Chat)"]
        
        Audio <--> RTC_Client
        RTC_Client <--> UI
        WS_Client <--> UI
    end

    subgraph Backend["FastAPI Backend"]
        WS_Server["WebSocket Endpoint"]
        WebRTC["WebRTC Endpoint"]
        Pipeline[WebVoicePipeline]
        
        subgraph Models["ML Models"]
            VAD["Silero VAD"]
            ASR["Qwen3 ASR"]
            TTS["Piper TTS"]
        end
        
        subgraph Agents["LangGraph Swarm"]
            State[("MemorySaver Checkpointer")]
            CC["CustomerCare (Alpha)"]
            Shop["Shopper (Gamma)"]
            Ops["OrderOps (Beta)"]
            
            CC <--> State
            Shop <--> State
            Ops <--> State
        end
        
        WS_Server <--> Pipeline
        WebRTC <--> Pipeline
        Pipeline <--> Models
        Pipeline <--> Agents
    end

    WS_Client <-->|ws://chat| WS_Server
    RTC_Client <-->|WebRTC Data Channels| WebRTC
Loading

πŸ€– The Agents

Agent Voice Color Specialization Tools
Customer Care (Alpha) πŸ‡¬πŸ‡§ Alba (British) 🟣 #4B8DFF Returns, refunds, policies, general help lookup_policy, transfer tools
Shopper (Gamma) πŸ‡ΊπŸ‡Έ Bryce (American) 🟒 #00C9A7 Product search, recommendations, catalog search_catalog, transfer tools
Order Ops (Beta) πŸ‡ΊπŸ‡Έ HFC Female πŸ”΄ #FF6FAE Order tracking, delivery status, operations check_order_status, transfer tools

Each agent can transfer seamlessly to another via LangGraph tool calls. The user never notices the handoff β€” the orb simply morphs its color.


πŸ“‚ Project Structure

OpenVoice AI/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ api/                    # 🌐 FastAPI + WebSocket layer
β”‚   β”‚   β”‚   β”œβ”€β”€ server.py           # FastAPI app, WS endpoints, REST routes
β”‚   β”‚   β”‚   └── web_pipeline.py     # WebSocket-adapted voice pipeline
β”‚   β”‚   β”œβ”€β”€ agents/                 # πŸ€– LangGraph agent system
β”‚   β”‚   β”‚   β”œβ”€β”€ session.py          # VoiceSession β€” LangGraph graph builder
β”‚   β”‚   β”‚   β”œβ”€β”€ state.py            # VoiceState TypedDict
β”‚   β”‚   β”‚   └── specialized/        # Individual agent definitions
β”‚   β”‚   β”‚       β”œβ”€β”€ customer_care.py
β”‚   β”‚   β”‚       β”œβ”€β”€ shopper.py
β”‚   β”‚   β”‚       └── order_ops.py
β”‚   β”‚   β”œβ”€β”€ asr/                    # 🎀 Automatic Speech Recognition
β”‚   β”‚   β”‚   └── whisper.py          # Qwen3-ASR-0.6B model wrapper
β”‚   β”‚   β”œβ”€β”€ audio/                  # πŸ”Š Audio processing & I/O
β”‚   β”‚   β”‚   β”œβ”€β”€ io.py               # sounddevice mic/speaker (CLI only)
β”‚   β”‚   β”‚   └── denoiser.py         # DeepFilterNet3 ONNX audio denoiser
β”‚   β”‚   β”œβ”€β”€ core/                   # βš™οΈ Core abstractions
β”‚   β”‚   β”‚   β”œβ”€β”€ interfaces.py       # IVAD, IASR, ILLM, ITTS interfaces
β”‚   β”‚   β”‚   └── pipeline.py         # Original CLI voice pipeline
β”‚   β”‚   β”œβ”€β”€ llm/                    # 🧠 LLM client
β”‚   β”‚   β”‚   └── client.py           # LLMModel β€” wraps VoiceSession
β”‚   β”‚   β”œβ”€β”€ tts/                    # πŸ—£οΈ Text-to-Speech
β”‚   β”‚   β”‚   └── piper.py            # Piper TTS (ONNX, GPU-accelerated)
β”‚   β”‚   β”œβ”€β”€ utils/                  # πŸ› οΈ Utilities
β”‚   β”‚   β”‚   β”œβ”€β”€ chunker.py          # SentenceChunker for TTS streaming
β”‚   β”‚   β”‚   └── language_guard.py   # English language enforcement gate
β”‚   β”‚   └── vad/                    # 🎯 Voice Activity Detection
β”‚   β”‚       └── silero.py           # Silero VAD (PyTorch, GPU)
β”‚   β”œβ”€β”€ models/                     # πŸ“¦ Downloaded TTS voice models
β”‚   └── .env                        # API keys (not committed)
β”‚
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/             # βš›οΈ React components
β”‚   β”‚   β”‚   β”œβ”€β”€ VoiceOrb.jsx        # Animated agent orb + particles
β”‚   β”‚   β”‚   β”œβ”€β”€ AgentLabel.jsx      # Agent name + status badge
β”‚   β”‚   β”‚   β”œβ”€β”€ MicButton.jsx       # Mic toggle with pulse animation
β”‚   β”‚   β”‚   β”œβ”€β”€ TranscriptPanel.jsx # Conversation sidebar
β”‚   β”‚   β”‚   β”œβ”€β”€ TextInputBar.jsx    # Text chat input
β”‚   β”‚   β”‚   β”œβ”€β”€ ConnectionStatus.jsx# WebSocket status dot
β”‚   β”‚   β”‚   β”œβ”€β”€ NoiseReductionToggle.jsx # UI toggle for audio denoiser
β”‚   β”‚   β”‚   └── ModeToggle.jsx      # Voice ↔ Text switch
β”‚   β”‚   β”œβ”€β”€ hooks/                  # πŸͺ Custom React hooks
β”‚   β”‚   β”‚   β”œβ”€β”€ useWebSocket.js     # WebSocket connection management (Text)
β”‚   β”‚   β”‚   β”œβ”€β”€ useWebRTC.js        # WebRTC connection management (Voice)
β”‚   β”‚   β”‚   β”œβ”€β”€ useAudio.js         # Mic capture + TTS playback
β”‚   β”‚   β”‚   └── useVoicePipeline.js # Orchestration hook
β”‚   β”‚   β”œβ”€β”€ config/
β”‚   β”‚   β”‚   └── agents.js           # Agent metadata constants
β”‚   β”‚   β”œβ”€β”€ App.jsx                 # Root component
β”‚   β”‚   β”œβ”€β”€ main.jsx                # React entry point
β”‚   β”‚   └── index.css               # Full design system + animations
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ vite.config.js
β”‚   └── package.json
β”‚
β”œβ”€β”€ pyproject.toml                  # Python dependencies (uv)
β”œβ”€β”€ uv.lock                         # Locked dependency versions
β”‚
└── .gitignore

πŸš€ Getting Started

Prerequisites

Tool Version Purpose
Python 3.12.x Backend runtime
uv Latest Python package manager
Node.js 18+ Frontend tooling
CUDA 11.8+ GPU acceleration (optional)

1. Clone the Repository

git clone https://github.com/BadrinathanTV/OpenVoice-AI.git
cd "OpenVoice AI"

2. Backend Setup

cd backend

# Create .env from the example
cp .env.example .env

# Edit .env with your API keys
nano .env

Required .env values:

OPENAI_API_KEY=sk-your-openai-key
GROQ_API_KEY=gsk_your-groq-key
ASR_MODEL_PATH=/path/to/Qwen3-ASR-0.6B
ASR_BACKEND=transformers
ASR_STREAMING_CHUNK_SIZE_SEC=0.64
DATABASE_URL=mongodb://localhost:27017/

Install dependencies and start with uv only:

# Run this from the repository root
uv sync

# Then start the backend from backend/
cd backend
uv run --project .. uvicorn src.api.server:app --host 0.0.0.0 --port 8000

.python-version pins the repo to Python 3.12, so cloud machines and client machines should use that same interpreter line for the most reliable install.

The first run will:

  • Download Qwen3-ASR-0.6B model
  • Download Piper TTS voice models (~50MB each)
  • Install all Python dependencies

To enable Qwen streaming ASR, switch to the vLLM backend:

ASR_BACKEND=vllm
ASR_STREAMING_CHUNK_SIZE_SEC=0.64

Install the streaming stack with uv before starting the backend:

uv sync --extra streaming-asr

The streaming-asr extra is intended for Linux GPU environments, which matches the current CUDA-based deployment path for this project.

3. Frontend Setup

cd frontend

# Install Node.js dependencies
npm install

# Start development server
npm run dev

4. Open in Browser

Navigate to http://localhost:5173 β€” you'll see the animated orb UI.

  • Click the 🎀 mic button to start talking
  • Or switch to πŸ’¬ Text mode to type messages
  • Watch the orb glow and pulse when the AI speaks

🎨 Frontend Animation States

The agent orb transitions through visual states:

State Animation When
Idle Gentle breathing pulse Waiting for user input
Listening Concentric ring ripples Mic active, capturing audio
Processing Spinning orbital rings ASR transcribing speech
Thinking Color desaturation + spin Waiting for LLM response
Speaking Full glow burst + particles TTS audio playing back
Handoff Color morph crossfade Agent transferring to another

πŸ”Œ API Endpoints

REST

Method Path Description
GET /api/health Health check, shows if models are loaded
GET /api/agents Returns list of available agents with metadata

Realtime Transport (WebRTC / WebSocket)

Path Mode Protocol
/api/webrtc/offer Voice mode (primary) WebRTC Data Channels (Binary PCM + JSON)
/ws/voice Voice mode (fallback) WebSocket (Binary PCM + JSON)
/ws/chat Text mode WebSocket (JSON only)

Control message types (server β†’ client):

{"type": "session", "threadId": "...", "agent": "CustomerCare"}
{"type": "status", "value": "recording|processing|thinking|speaking|idle"}
{"type": "agent", "name": "Shopper"}
{"type": "transcript", "role": "user|ai", "text": "...", "agent": "...", "partial": true|false}
{"type": "audio", "data": "<base64 PCM>", "sampleRate": 22050}

πŸ› οΈ Tech Stack

Backend

Component Technology
Runtime Python 3.11+
Web Server FastAPI + Uvicorn
WebRTC aiortc
Agent Framework LangGraph Swarm
LLM OpenAI GPT-4o-mini
ASR Qwen3-ASR-0.6B (GPU)
TTS Piper TTS (ONNX, GPU)
VAD Silero VAD (PyTorch, GPU)
Denoising DeepFilterNet3 (ONNX)
Language ID langid
Database In-memory (MemorySaver)
Package Manager uv

Frontend

Component Technology
Framework React 19 + Vite
Styling Vanilla CSS (design tokens)
Audio Web Audio API
Communication WebRTC (Data Channels) / WebSocket
Font Inter (Google Fonts)

🧱 SOLID Principles

The codebase follows SOLID design principles:

  • Single Responsibility β€” Each component, hook, and module has one job (e.g., VoiceOrb only renders, useAudio only handles audio)
  • Open/Closed β€” Agent config in agents.js is extendable without modifying components. Add a new agent by adding an entry.
  • Liskov Substitution β€” All backend modules implement abstract interfaces (IVAD, IASR, ILLM, ITTS). Swap implementations freely.
  • Interface Segregation β€” useVoicePipeline exposes a clean API without leaking WebRTC/WebSocket or Audio internals to components.
  • Dependency Inversion β€” React components receive data via props from hooks, not from globals. Backend pipeline depends on interfaces, not concrete classes.

πŸ“„ Voice Pipeline Flow

flowchart TD
    subgraph Browser["Browser Client"]
        Mic["Microphone API"]
        Speak["Audio Playback"]
        Mic -->|32ms PCM chunks| RTC["WebRTC Data Channels"]
        RTC -->|Base64 PCM| Speak
    end

    subgraph Backend["FastAPI Backend"]
        RTC --> VAD{Silero VAD}
        VAD -->|Noise| Drop[Discard]
        VAD -->|"speech (vol > 0.005)"| Buf[Audio Buffer]
        
        Buf -->|Complete Phrase| ASR["Qwen3 ASR"]
        ASR -->|Text| Swarm{LangGraph Swarm}
        
        Swarm -->|Token Stream| Chunker[Sentence Chunker]
        Chunker -->|Complete Sentences| TTS["Piper TTS"]
        TTS -->|Audio Bytes| RTC
    end
    
    subgraph Agents["LangGraph Agents"]
        Swarm <--> CC["Customer Care (Alpha)"]
        Swarm <--> SH["Shopper (Gamma)"]
        Swarm <--> OO["Order Ops (Beta)"]
    end
Loading

πŸ”§ Configuration

Environment Variables

Variable Required Default Description
OPENAI_API_KEY βœ… β€” OpenAI API key for GPT-4o-mini
GROQ_API_KEY ❌ β€” Groq API key (alternative LLM provider)
ASR_MODEL_PATH ❌ Qwen3-ASR-0.6B Path to local ASR model
ASR_BACKEND ❌ transformers ASR backend: transformers or vllm
ASR_STREAMING_CHUNK_SIZE_SEC ❌ 0.64 Streaming ASR decode chunk size in seconds
ASR_ALLOW_BACKEND_FALLBACK ❌ true Fall back to transformers if vLLM ASR fails to initialize
ASR_ENFORCE_ENGLISH ❌ true Enable backend language enforcement gate
ASR_ENGLISH_CONFIDENCE_THRESHOLD ❌ 0.80 Language ID confidence threshold
DATABASE_URL ❌ mongodb://localhost:27017/ MongoDB connection URL

Adding a New Agent

  1. Create backend/src/agents/specialized/your_agent.py with a system prompt, tools, and get_your_agent() function
  2. Register it in backend/src/agents/session.py (add node + routing)
  3. Add TTS voice in backend/src/api/web_pipeline.py
  4. Add agent metadata in frontend/src/config/agents.js

πŸ“œ License

This project is for educational and research purposes.


Built with ❀️ by The Three !

About

Real-time, multi-agent AI voice assistant with animated visual feedback. Built with LangGraph Swarm, FastAPI, WebSocket streaming, and React.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors