A real-time, multi-agent voice assistant powered by LangGraph Swarm architecture. Speak naturally to specialized AI agents β each with their own personality, voice, and expertise β through a stunning animated web interface.
| Feature | Description |
|---|---|
| π£οΈ Real-time Voice | Speak and listen in real-time via the browser using WebRTC β VAD detects speech, ASR transcribes, LLM responds, TTS speaks back |
| π€ Multi-Agent Swarm | 3 specialized agents (Customer Care, Shopper, Order Ops) with seamless handoffs via LangGraph |
| π¨ Animated Agent Orbs | Each agent has a unique color identity with animated orb visualizations that glow when speaking |
| π¬ Dual Mode | Switch between Voice mode (mic) and Text mode (typing) on the fly |
| π Live Agent Handoffs | Agents transfer conversations to each other based on context β orb morphs colors during handoff |
| π§ Conversation Memory | In-memory checkpointer (MemorySaver) persists conversation state across the session for zero latency |
| β‘ GPU Accelerated | VAD (Silero), ASR (Qwen3-0.6B), and TTS (Piper) all run on CUDA when available |
| π€ Interrupt Support | Interrupt the AI mid-sentence β it remembers where it was cut off |
| π Noise Reduction | Real-time audio denoising using DeepFilterNet3 ONNX model (toggleable via UI) |
| π Language Guard | Production-grade English language enforcement gate using langid to drop spurious noise/translations |
| β±οΈ Latency Tracking | Granular latency metrics tracking (TTFT, TTFA, and agent switching) |
graph TB
subgraph Frontend["React Frontend (Vite)"]
UI[Animated UI Orbs]
Audio[Web Audio API]
RTC_Client[WebRTC Client]
WS_Client["WebSocket Client (Text Chat)"]
Audio <--> RTC_Client
RTC_Client <--> UI
WS_Client <--> UI
end
subgraph Backend["FastAPI Backend"]
WS_Server["WebSocket Endpoint"]
WebRTC["WebRTC Endpoint"]
Pipeline[WebVoicePipeline]
subgraph Models["ML Models"]
VAD["Silero VAD"]
ASR["Qwen3 ASR"]
TTS["Piper TTS"]
end
subgraph Agents["LangGraph Swarm"]
State[("MemorySaver Checkpointer")]
CC["CustomerCare (Alpha)"]
Shop["Shopper (Gamma)"]
Ops["OrderOps (Beta)"]
CC <--> State
Shop <--> State
Ops <--> State
end
WS_Server <--> Pipeline
WebRTC <--> Pipeline
Pipeline <--> Models
Pipeline <--> Agents
end
WS_Client <-->|ws://chat| WS_Server
RTC_Client <-->|WebRTC Data Channels| WebRTC
| Agent | Voice | Color | Specialization | Tools |
|---|---|---|---|---|
| Customer Care (Alpha) | π¬π§ Alba (British) | π£ #4B8DFF |
Returns, refunds, policies, general help | lookup_policy, transfer tools |
| Shopper (Gamma) | πΊπΈ Bryce (American) | π’ #00C9A7 |
Product search, recommendations, catalog | search_catalog, transfer tools |
| Order Ops (Beta) | πΊπΈ HFC Female | π΄ #FF6FAE |
Order tracking, delivery status, operations | check_order_status, transfer tools |
Each agent can transfer seamlessly to another via LangGraph tool calls. The user never notices the handoff β the orb simply morphs its color.
OpenVoice AI/
βββ backend/
β βββ src/
β β βββ api/ # π FastAPI + WebSocket layer
β β β βββ server.py # FastAPI app, WS endpoints, REST routes
β β β βββ web_pipeline.py # WebSocket-adapted voice pipeline
β β βββ agents/ # π€ LangGraph agent system
β β β βββ session.py # VoiceSession β LangGraph graph builder
β β β βββ state.py # VoiceState TypedDict
β β β βββ specialized/ # Individual agent definitions
β β β βββ customer_care.py
β β β βββ shopper.py
β β β βββ order_ops.py
β β βββ asr/ # π€ Automatic Speech Recognition
β β β βββ whisper.py # Qwen3-ASR-0.6B model wrapper
β β βββ audio/ # π Audio processing & I/O
β β β βββ io.py # sounddevice mic/speaker (CLI only)
β β β βββ denoiser.py # DeepFilterNet3 ONNX audio denoiser
β β βββ core/ # βοΈ Core abstractions
β β β βββ interfaces.py # IVAD, IASR, ILLM, ITTS interfaces
β β β βββ pipeline.py # Original CLI voice pipeline
β β βββ llm/ # π§ LLM client
β β β βββ client.py # LLMModel β wraps VoiceSession
β β βββ tts/ # π£οΈ Text-to-Speech
β β β βββ piper.py # Piper TTS (ONNX, GPU-accelerated)
β β βββ utils/ # π οΈ Utilities
β β β βββ chunker.py # SentenceChunker for TTS streaming
β β β βββ language_guard.py # English language enforcement gate
β β βββ vad/ # π― Voice Activity Detection
β β βββ silero.py # Silero VAD (PyTorch, GPU)
β βββ models/ # π¦ Downloaded TTS voice models
β βββ .env # API keys (not committed)
β
βββ frontend/
β βββ src/
β β βββ components/ # βοΈ React components
β β β βββ VoiceOrb.jsx # Animated agent orb + particles
β β β βββ AgentLabel.jsx # Agent name + status badge
β β β βββ MicButton.jsx # Mic toggle with pulse animation
β β β βββ TranscriptPanel.jsx # Conversation sidebar
β β β βββ TextInputBar.jsx # Text chat input
β β β βββ ConnectionStatus.jsx# WebSocket status dot
β β β βββ NoiseReductionToggle.jsx # UI toggle for audio denoiser
β β β βββ ModeToggle.jsx # Voice β Text switch
β β βββ hooks/ # πͺ Custom React hooks
β β β βββ useWebSocket.js # WebSocket connection management (Text)
β β β βββ useWebRTC.js # WebRTC connection management (Voice)
β β β βββ useAudio.js # Mic capture + TTS playback
β β β βββ useVoicePipeline.js # Orchestration hook
β β βββ config/
β β β βββ agents.js # Agent metadata constants
β β βββ App.jsx # Root component
β β βββ main.jsx # React entry point
β β βββ index.css # Full design system + animations
β βββ index.html
β βββ vite.config.js
β βββ package.json
β
βββ pyproject.toml # Python dependencies (uv)
βββ uv.lock # Locked dependency versions
β
βββ .gitignore
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.12.x | Backend runtime |
| uv | Latest | Python package manager |
| Node.js | 18+ | Frontend tooling |
| CUDA | 11.8+ | GPU acceleration (optional) |
git clone https://github.com/BadrinathanTV/OpenVoice-AI.git
cd "OpenVoice AI"cd backend
# Create .env from the example
cp .env.example .env
# Edit .env with your API keys
nano .envRequired .env values:
OPENAI_API_KEY=sk-your-openai-key
GROQ_API_KEY=gsk_your-groq-key
ASR_MODEL_PATH=/path/to/Qwen3-ASR-0.6B
ASR_BACKEND=transformers
ASR_STREAMING_CHUNK_SIZE_SEC=0.64
DATABASE_URL=mongodb://localhost:27017/Install dependencies and start with uv only:
# Run this from the repository root
uv sync
# Then start the backend from backend/
cd backend
uv run --project .. uvicorn src.api.server:app --host 0.0.0.0 --port 8000.python-version pins the repo to Python 3.12, so cloud machines and client machines should use that same interpreter line for the most reliable install.
The first run will:
- Download Qwen3-ASR-0.6B model
- Download Piper TTS voice models (~50MB each)
- Install all Python dependencies
To enable Qwen streaming ASR, switch to the vLLM backend:
ASR_BACKEND=vllm
ASR_STREAMING_CHUNK_SIZE_SEC=0.64Install the streaming stack with uv before starting the backend:
uv sync --extra streaming-asrThe streaming-asr extra is intended for Linux GPU environments, which matches the current CUDA-based deployment path for this project.
cd frontend
# Install Node.js dependencies
npm install
# Start development server
npm run devNavigate to http://localhost:5173 β you'll see the animated orb UI.
- Click the π€ mic button to start talking
- Or switch to π¬ Text mode to type messages
- Watch the orb glow and pulse when the AI speaks
The agent orb transitions through visual states:
| State | Animation | When |
|---|---|---|
| Idle | Gentle breathing pulse | Waiting for user input |
| Listening | Concentric ring ripples | Mic active, capturing audio |
| Processing | Spinning orbital rings | ASR transcribing speech |
| Thinking | Color desaturation + spin | Waiting for LLM response |
| Speaking | Full glow burst + particles | TTS audio playing back |
| Handoff | Color morph crossfade | Agent transferring to another |
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Health check, shows if models are loaded |
GET |
/api/agents |
Returns list of available agents with metadata |
| Path | Mode | Protocol |
|---|---|---|
/api/webrtc/offer |
Voice mode (primary) | WebRTC Data Channels (Binary PCM + JSON) |
/ws/voice |
Voice mode (fallback) | WebSocket (Binary PCM + JSON) |
/ws/chat |
Text mode | WebSocket (JSON only) |
Control message types (server β client):
{"type": "session", "threadId": "...", "agent": "CustomerCare"}
{"type": "status", "value": "recording|processing|thinking|speaking|idle"}
{"type": "agent", "name": "Shopper"}
{"type": "transcript", "role": "user|ai", "text": "...", "agent": "...", "partial": true|false}
{"type": "audio", "data": "<base64 PCM>", "sampleRate": 22050}| Component | Technology |
|---|---|
| Runtime | Python 3.11+ |
| Web Server | FastAPI + Uvicorn |
| WebRTC | aiortc |
| Agent Framework | LangGraph Swarm |
| LLM | OpenAI GPT-4o-mini |
| ASR | Qwen3-ASR-0.6B (GPU) |
| TTS | Piper TTS (ONNX, GPU) |
| VAD | Silero VAD (PyTorch, GPU) |
| Denoising | DeepFilterNet3 (ONNX) |
| Language ID | langid |
| Database | In-memory (MemorySaver) |
| Package Manager | uv |
| Component | Technology |
|---|---|
| Framework | React 19 + Vite |
| Styling | Vanilla CSS (design tokens) |
| Audio | Web Audio API |
| Communication | WebRTC (Data Channels) / WebSocket |
| Font | Inter (Google Fonts) |
The codebase follows SOLID design principles:
- Single Responsibility β Each component, hook, and module has one job (e.g.,
VoiceOrbonly renders,useAudioonly handles audio) - Open/Closed β Agent config in
agents.jsis extendable without modifying components. Add a new agent by adding an entry. - Liskov Substitution β All backend modules implement abstract interfaces (
IVAD,IASR,ILLM,ITTS). Swap implementations freely. - Interface Segregation β
useVoicePipelineexposes a clean API without leaking WebRTC/WebSocket or Audio internals to components. - Dependency Inversion β React components receive data via props from hooks, not from globals. Backend pipeline depends on interfaces, not concrete classes.
flowchart TD
subgraph Browser["Browser Client"]
Mic["Microphone API"]
Speak["Audio Playback"]
Mic -->|32ms PCM chunks| RTC["WebRTC Data Channels"]
RTC -->|Base64 PCM| Speak
end
subgraph Backend["FastAPI Backend"]
RTC --> VAD{Silero VAD}
VAD -->|Noise| Drop[Discard]
VAD -->|"speech (vol > 0.005)"| Buf[Audio Buffer]
Buf -->|Complete Phrase| ASR["Qwen3 ASR"]
ASR -->|Text| Swarm{LangGraph Swarm}
Swarm -->|Token Stream| Chunker[Sentence Chunker]
Chunker -->|Complete Sentences| TTS["Piper TTS"]
TTS -->|Audio Bytes| RTC
end
subgraph Agents["LangGraph Agents"]
Swarm <--> CC["Customer Care (Alpha)"]
Swarm <--> SH["Shopper (Gamma)"]
Swarm <--> OO["Order Ops (Beta)"]
end
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
β | β | OpenAI API key for GPT-4o-mini |
GROQ_API_KEY |
β | β | Groq API key (alternative LLM provider) |
ASR_MODEL_PATH |
β | Qwen3-ASR-0.6B |
Path to local ASR model |
ASR_BACKEND |
β | transformers |
ASR backend: transformers or vllm |
ASR_STREAMING_CHUNK_SIZE_SEC |
β | 0.64 |
Streaming ASR decode chunk size in seconds |
ASR_ALLOW_BACKEND_FALLBACK |
β | true |
Fall back to transformers if vLLM ASR fails to initialize |
ASR_ENFORCE_ENGLISH |
β | true |
Enable backend language enforcement gate |
ASR_ENGLISH_CONFIDENCE_THRESHOLD |
β | 0.80 |
Language ID confidence threshold |
DATABASE_URL |
β | mongodb://localhost:27017/ |
MongoDB connection URL |
- Create
backend/src/agents/specialized/your_agent.pywith a system prompt, tools, andget_your_agent()function - Register it in
backend/src/agents/session.py(add node + routing) - Add TTS voice in
backend/src/api/web_pipeline.py - Add agent metadata in
frontend/src/config/agents.js
This project is for educational and research purposes.
Built with β€οΈ by The Three !