conversational-ai

A terminal-first TTS/STT platform built on mlx-audio for Apple Silicon. The primary interface is the cai CLI — speak text, transcribe speech, watch files, and run two-way voice dialogues from the terminal. A companion HTTP API (cai serve) exposes the same TTS/STT models to browser-based clients that can't invoke the CLI directly.

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.12+
uv package manager

Quick start

# Clone and enter the project
git clone <repo-url>
cd conversational_ai

# Install dependencies (creates .venv automatically)
uv sync

# Try the CLI (downloads models on first run ~500 MB)
uv run python cli.py speak "Hello, world!"

# Or start the HTTP API for browser clients
uv run python cli.py serve

The HTTP server (when cai serve is running) listens at http://127.0.0.1:4114. Visit /docs for the interactive OpenAPI UI.

Installation (persistent `cai` command)

install.sh copies the app to ~/.local/share/conversational_ai/ and creates a cai launcher at ~/.local/bin/cai.

# mlx-audio must be checked out as a sibling directory: ../mlx-audio
bash install.sh

Ensure ~/.local/bin is in your PATH (add to ~/.zshrc or ~/.bashrc if not):

export PATH="$HOME/.local/bin:$PATH"

Updating: re-run install.sh at any time to sync the latest source and dependency changes to the installed copy.

Configuration

The config file lives at ~/.config/conversational_ai/config.toml and is auto-created with the default template on first run. Edit it to change defaults:

[server]
host = "127.0.0.1"
port = 4114

[tts]
model     = "mlx-community/Kokoro-82M-bf16"
voice     = "af_heart"
speed     = 1.0
lang_code = "a"

[stt]
model = "mlx-community/whisper-large-v3-turbo-asr-fp16"

[models]
models_dir = "~/.lmstudio/models"

[dialogue]
speak_file  = "~/.local/share/conversational_ai/speak.txt"
listen_file = "~/.local/share/conversational_ai/listen.txt"
barge_in    = true   # VAD rising edge cancels in-flight TTS
full_duplex = true   # mic stays hot while TTS is playing

[mic]
rms_threshold          = 0.01   # RMS above which a chunk counts as speech
silence_seconds        = 1.5    # trailing silence that ends an utterance
min_speech_seconds     = 0.15   # sustained speech required to latch
calibrate_noise        = false  # sample room tone at startup (opt-in)
calibration_seconds    = 1.0
calibration_multiplier = 3.0

[wake_word]
enabled         = false       # gate STT output on a trigger word (listen/dialogue)
word            = "computer"  # trigger; must be followed by punctuation or EOL
include_trigger = false       # keep the trigger word in the emitted line
timeout_seconds = 30.0        # re-arm after this much silence
alert_sound     = true        # play a short chime on activation

[limits]
max_text_length     = 5000      # characters
max_audio_file_size = 26214400  # bytes (25 MB)

[log]
log_dir      = "~/.local/state/conversational_ai"
max_age_days = 7

See Dialogue duplex modes for the barge_in / full_duplex matrix.

Any value can be overridden at launch with a CLI flag:

cai --voice af_sky --speed 1.2 speak "Good morning"
cai --tts-model mlx-community/Kokoro-82M-bf16 \
    --stt-model mlx-community/whisper-large-v3-turbo-asr-fp16 \
    transcribe

Run cai --help for the full global flag list, or cai <subcommand> --help for per-subcommand options.

API

All responses include X-Limit-Max-Text-Length and X-Limit-Max-Audio-File-Size headers so clients always know the active limits.

`POST /v1/tts`

Convert text to speech. Returns a WAV audio file.

Request (application/json):

{
  "text": "Hello, can you hear me?",
  "voice": "af_heart",
  "speed": 1.0,
  "lang_code": "a"
}

voice, speed, and lang_code are optional — server defaults apply when omitted.

Response: audio/wav binary

curl -X POST http://127.0.0.1:4114/v1/tts \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, world!"}' \
  -o speech.wav

`POST /v1/stt`

Transcribe an audio file. Returns JSON with the transcribed text.

Request: multipart form upload, field name file.
Accepted types: WAV, MP3, MP4, OGG, FLAC, WebM, AAC.

Response (application/json):

{
  "text": "Hello, world!",
  "segments": [...],
  "language": "en"
}

curl -X POST http://127.0.0.1:4114/v1/stt \
  -F "file=@speech.wav;type=audio/wav"

`GET /v1/health`

{"status": "ok", "tts_loaded": true, "stt_loaded": true}

status is "ok" when both models are loaded, "degraded" when one is, "unavailable" when neither is.

`GET /v1/models`

{
  "tts": {"name": "mlx-community/Kokoro-82M-bf16", "loaded": true},
  "stt": {"name": "mlx-community/whisper-large-v3-turbo-asr-fp16", "loaded": true}
}

CORS

The server allows requests from http://localhost:* and http://127.0.0.1:* only. External origins are blocked.

Development

# Run tests (220 as of 2026-04-18)
uv run pytest

# Lint / format
uv run ruff check src tests
uv run ruff format src tests

See CONTRIBUTING.md for architecture details and contribution guidelines.

CLI usage

cai is the unified entry point for both the API server and direct terminal TTS/STT.

Subcommands

Command	Description
`cai serve`	Start the HTTP API server
`cai speak [TEXT]`	Speak text via TTS (arg, `--file`, or stdin)
`cai transcribe`	Record from mic → print transcription
`cai watch FILE`	Watch a file — speak any new content appended to it
`cai listen FILE`	Continuous mic → append transcriptions to FILE
`cai dialogue --speak-file A --listen-file B`	Watch + listen simultaneously

Examples

# Speak text directly
cai speak "Hello, world!"

# Speak a file
cai speak --file notes.txt

# Transcribe one utterance to stdout
cai transcribe

# Transcribe to a file
cai transcribe -o transcript.txt

# Speak whatever gets appended to a file (Ctrl+C to stop)
cai watch /tmp/tts.txt

# Append mic transcriptions to a file (Ctrl+C to stop)
cai listen /tmp/stt.txt

# Two-way: speak from a.txt, transcribe mic to b.txt
cai dialogue --speak-file a.txt --listen-file b.txt

# Start the HTTP server
cai serve

Dialogue duplex modes

cai dialogue runs TTS (file → speaker) and STT (mic → file) at the same time. Two orthogonal flags in the [dialogue] section of config.toml cover the four useful combinations:

[dialogue]
speak_file  = "~/.local/share/conversational_ai/speak.txt"
listen_file = "~/.local/share/conversational_ai/listen.txt"
barge_in    = true   # VAD rising edge cancels in-flight TTS
full_duplex = true   # mic stays hot while TTS is playing

`barge_in`	`full_duplex`	Mode	When to use it
`true`	`true`	Full-duplex + barge-in (default)	Headphones. Natural conversation — start talking and TTS stops mid-sentence.
`true`	`false`	Speaker-safe half-duplex	Open speakers, no headphones. Mic is gated while TTS plays so the model never hears itself; your next utterance still interrupts the following TTS reply.
`false`	`true`	Loopback / self-dialogue	Intentional TTS → mic → STT chains. The agent speaks, transcribes its own output, and continues — the feedback loop is the feature.
`false`	`false`	Walkie-talkie	Predictable turn-taking. Strict half-duplex, TTS always finishes, no interrupts. Simplest model when you want zero surprises.

Example — running dialogue in speaker-safe mode on a laptop without headphones:

# ~/.config/conversational_ai/config.toml
[dialogue]
barge_in    = true
full_duplex = false

cai dialogue --speak-file a.txt --listen-file b.txt
# Startup banner shows the active mode:
#   Dialogue active [barge_in=True full_duplex=False] — watching …

Loopback mode for agent-talks-to-itself workflows — point both files at the same path and let the agent drive its own conversation:

[dialogue]
barge_in    = false
full_duplex = true

cai dialogue --speak-file scratch.txt --listen-file scratch.txt

Global options

All subcommands accept these options before the subcommand name:

--config PATH        Path to TOML config file (overrides XDG path)
--tts-model MODEL    Override TTS model
--stt-model MODEL    Override STT model
--voice VOICE        Override TTS voice
--speed SPEED        Override TTS speed (0.1–5.0)
--lang-code CODE     Override TTS language code
--models-dir DIR     Local models directory (default: ~/.lmstudio/models)
--no-tts             Skip loading the TTS model
--no-stt             Skip loading the STT model

Mic flags (transcribe / listen / dialogue)

These subcommands share a common set of per-command flags for tuning the voice-activity detector and opting into noise calibration:

--mic-threshold FLOAT                   Override [mic].rms_threshold
--mic-silence SECONDS                   Override [mic].silence_seconds
--mic-min-speech SECONDS                Override [mic].min_speech_seconds
--calibrate-noise / --no-calibrate-noise
                                        Sample room tone at startup

Example:

# Tighten the gate and calibrate before a noisy-kitchen dictation session
cai listen --mic-threshold 0.03 --mic-min-speech 0.25 --calibrate-noise out.txt

Wake-word flags (listen / dialogue)

By default every transcribed utterance flows through to the sink file. Pass --wake-word WORD to require a trigger (followed by punctuation or end-of-utterance) before anything is written. Silence past --wake-timeout seconds re-arms the gate.

--wake-word WORD                        Enable gating; forces [wake_word].enabled=true
--no-wake-word                          Disable gating regardless of config
--wake-timeout SECONDS                  Override [wake_word].timeout_seconds
--include-trigger / --strip-trigger     Keep or strip the trigger word
--wake-alert / --no-wake-alert          Play or suppress the activation chime

The trigger must be distinct from normal sentence use — Whisper adds punctuation on pauses, so "Computer, hello" opens the gate and emits "hello", while "Computer science is cool" is rejected. On trigger, a [wake] 'computer' heard — listening line goes to stderr; if --wake-alert is set (default), a short two-tone chime also plays.

Example — wake-word dictation to a file:

cai listen --wake-word computer out.txt
# "Computer, take a note" → "take a note\n"
# "continue writing"      → "continue writing\n"   (window still open)
# …30 s of silence…
# "computer science rocks" → rejected, gate re-armed

Project layout

conversational_ai/
├── cli.py               # `cai` entry point — re-exports src.cli.cli
├── main.py              # FastAPI app factory (used by `cai serve`)
├── src/
│   ├── config.py        # XDG TOML + CLI settings (Pydantic)
│   ├── models.py        # ModelManager — TTS/STT loader and inference
│   ├── audio.py         # WAV encoding, upload validation, temp files
│   ├── schemas.py       # Pydantic request/response models
│   ├── middleware.py    # X-Limit-* response headers
│   ├── logging_setup.py # Log rotation + setup
│   ├── routes/
│   │   ├── tts.py       # POST /v1/tts
│   │   ├── stt.py       # POST /v1/stt
│   │   └── system.py    # GET /v1/health, GET /v1/models
│   └── cli/
│       ├── __init__.py  # Click group, shared startup (config + models)
│       ├── audio_io.py  # Streaming TTS playback + mic recording (VAD, calibration)
│       ├── serve.py     # `cai serve`
│       ├── speak.py     # `cai speak`
│       ├── transcribe.py# `cai transcribe`
│       ├── watch.py     # `cai watch`
│       ├── listen.py    # `cai listen`
│       ├── dialogue.py  # `cai dialogue`
│       └── wake_word.py # WakeWordGate + build_wake_gate helper
└── tests/               # pytest test suite (220 tests)

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.claude/skills		.claude/skills
skills		skills
src		src
tasks		tasks
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PRD.md		PRD.md
README.md		README.md
cli.py		cli.py
config.toml		config.toml
input.txt		input.txt
install.sh		install.sh
main.py		main.py
pyproject.toml		pyproject.toml
speak-dialogue.md		speak-dialogue.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

conversational-ai

Requirements

Quick start

Installation (persistent `cai` command)

Configuration

API

`POST /v1/tts`

`POST /v1/stt`

`GET /v1/health`

`GET /v1/models`

CORS

Development

CLI usage

Subcommands

Examples

Dialogue duplex modes

Global options

Mic flags (transcribe / listen / dialogue)

Wake-word flags (listen / dialogue)

Project layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

conversational-ai

Requirements

Quick start

Installation (persistent cai command)

Configuration

API

POST /v1/tts

POST /v1/stt

GET /v1/health

GET /v1/models

CORS

Development

CLI usage

Subcommands

Examples

Dialogue duplex modes

Global options

Mic flags (transcribe / listen / dialogue)

Wake-word flags (listen / dialogue)

Project layout

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Installation (persistent `cai` command)

`POST /v1/tts`

`POST /v1/stt`

`GET /v1/health`

`GET /v1/models`

Packages