Autonomous Conversational Voice Agent. A local, production-grade C++ voice assistant designed for multi-hour conversations on a single workstation. All inference runs locally; no audio or transcripts leave the machine.
Code-complete through M8C. All eleven implementation milestones (M0 → M1 → M2 → M3 → M4 → M4B → M5 → M6 → M6B → M7 → M7B → M8A → M8B → M8C) have shipped. The full speech-to-speech loop runs end-to-end against the Compose stack: PortAudio → resample → AEC → VAD → streaming STT → dialogue FSM → LLM (SSE) → sentence splitter → TTS → playback queue → speakers. Barge-in cancels mid-speech; SQLite-backed turn lifecycle survives crashes; Prometheus + Grafana observability stack ships alongside; warm restart preserves session continuity. Test suite: 376 unit cases (no external deps) + 13 integration cases (real Silero + live Speaches), all green.
Stretch packaging (AUR PKGBUILD, .deb, image-digest pinning, observability project-label cleanup, man page, fresh-VM bare-metal acceptance) is folded into plans/m13_distribution_hardening/p13.0_m8c_packaging_stretch/ — none MVP-blocking. Full milestone history in docs/history/MVP/milestones/; architectural revisions and resolved questions in docs/history/MVP/open_questions.md §L. Post-MVP roadmap in plans/.
The product target is always-listening conversational mode, not press-button-and-ask. Wake-word (M8C) ships off-by-default; the conversational gate ("is the user addressing me?") moves to M10 address detection — see docs/history/MVP/open_questions.md §L8.
A configurable voice assistant that:
- runs entirely on local hardware (RTX 4060 8 GB target)
- speaker mode primary, with system-AEC for reliable barge-in
- supports multilingual conversation with per-utterance language detection
- streams partial STT and starts the LLM speculatively for sub-2-second latency (streaming session shipped in M5; speculation lifted to M9)
- holds 30-minute to multi-hour conversations without crashes, leaks, or latency drift
- recovers cleanly from mid-turn crashes and survives warm restarts
- is observable from day one — structured JSON logs, per-turn traces, Prometheus metrics, opt-in OTLP, Grafana dashboard
Mic → Resample → APM (AEC/NS/AGC) → VAD → Endpointer → Utterance Buffer → STT (streaming + final)
│
▼
Dialogue FSM ↔ Memory (SQLite WAL)
│
▼
LLM (SSE)
│
▼
Sentence Splitter → TTS → Playback Queue → Resample → Speaker
▲
└── Loopback (AEC reference, when in-process APM is on)
- C++ orchestrator runs as a host CLI binary and owns audio I/O, dialogue state, memory, cancellation, and observability.
- Model backends run as separate processes, never forked from the orchestrator:
llama.cpp/llama-serverfor LLM, Speaches for STT + TTS (one OpenAI-API-compatible surface that consolidates faster-whisper + Piper). Default deployment is Docker Compose; systemd units are an alternative production path. - Realtime audio path is isolated: lock-free SPSC ring between the audio callback and the processing thread; no blocking, no allocation, no I/O on the audio thread.
- Cancellation is structural: every long-running operation carries a turn ID; barge-in invalidates the turn ID and stale work is rejected at every queue boundary.
- AEC is system-level by default: PipeWire
module-echo-cancelupstream of acva delivers 25–46 dB of speech-band cancellation on the dev workstation. The in-process WebRTC APM stays compiled but disabled; seedocs/reports/aec_report.mdfor the full M6 + M6B analysis.
See docs/project_design.md for the complete architecture and docs/diagrams/observability.svg for the runtime topology.
| Concern | Choice |
|---|---|
| Language | C++23 (no modules, no Cobalt — see docs/history/MVP/open_questions.md §L2) |
| Async / threads | std::thread + dedicated workers; SPSC ring across audio boundary |
| Audio I/O | PortAudio + soxr (resample) |
| AEC / NS / AGC | PipeWire module-echo-cancel (default) + WebRTC APM (in-process, opt-in) |
| VAD | Silero VAD v5 via ONNX Runtime |
| Wake-word (off-by-default) | openWakeWord 3-stage ONNX (mel + embedding + classifier) |
| LLM | llama.cpp / llama-server (CUDA build) + Qwen 7-8B family |
| STT | Speaches (faster-whisper-large-v3-turbo, OpenAI-API + WebRTC realtime) |
| TTS | Speaches → Piper voices |
| Storage | SQLite (WAL mode) on a dedicated memory thread |
| HTTP — non-streaming | cpp-httplib (vendored single-header) |
| HTTP — SSE / WebRTC | libcurl (LLM SSE), libdatachannel (Speaches realtime) |
| Config / JSON | glaze (JSON + YAML) |
| Logs / metrics | spdlog (JSON sink) + prometheus-cpp |
| Dashboard | Grafana 11.3 + Prometheus 2.55 (auto-provisioned) |
| Tracing (opt-in) | opentelemetry-cpp (OTLP/HTTP) |
| Tests | doctest |
| Build | CMake + presets, Ninja |
src/ C++ source. Per-subsystem subdirs: audio/, cli/, config/,
demos/, dialogue/, event/, http/, llm/, log/, memory/,
metrics/, observability/, orchestrator/, pipeline/,
playback/, stt/, supervisor/, tts/.
src/main.cpp ~546 lines of linear orchestration: parse args → load
config → demo dispatch → build per-subsystem stacks via
orchestrator/ helpers → main loop with reload + warm-
restart drain → orderly shutdown.
src/orchestrator/ host-side glue, organised into 5 subdirs:
stacks/ tts/capture/stt/dialogue stacks (RAII)
boot/ bootstrap, system_aec, model-controller
handoff, startup_runner
admin/ hot-reload, privacy commands, restart,
session resume
observability/ event_tracer, status_extra,
barge_in_metrics, supervisor_setup
io/ stdin_reader (text-driven sessions)
src/demos/ 23 `acva demo <name>` smoke / debug commands.
tests/ doctest suites: acva_unit_tests (376 cases, no deps)
+ acva_integration_tests (13 cases, real Silero + Speaches).
config/default.yaml runtime config + `models:` registry (LLM/STT/TTS/VAD/
wake-word aliases) + per-personality overrides.
cmake/ Dependencies.cmake, Warnings.cmake.
third_party/cpp-httplib/ vendored single-header HTTP server.
scripts/ one-shot dev shell scripts (dev-up, dev-down, soak,
aec_analyze, barge-in-probe, validate-bargein,
wake-word-offline, ...).
tools/acva-models Python CLI driving the model registry: list / install /
sync / select / verify / status.
tools/train-wake-word Python driver for custom openwakeword classifiers.
packaging/
compose/ docker-compose.yml — `llama` + `speaches`.
systemd/ user-systemd units — `acva-llama`, `acva-speaches`,
`acva.service`, `acva.target`.
observability/ docker-compose.yml for Prometheus + Grafana.
grafana/ auto-provisioned dashboard (acva.json).
model-controller/ Go sidecar — owns the docker socket, swaps llama models
via `POST /llm/load`. Operator opt-in.
plans/
project_design.md source of truth for architecture, milestones, risks.
open_questions.md resolved / open decisions; §L holds implementation-
driven revisions that supersede earlier sections.
milestones/m{0..8}_*.md per-milestone implementation plans.
docs/
troubleshooting.md symptom-first guide; routes failures to the right demo.
aec_report.md M6 + M6B AEC analysis.
observability.{dot,svg,pdf} runtime topology diagram.
build.sh ./build.sh [dev|debug|release].
run_tests.sh unit suite (no external deps).
run_integration_tests.sh integration suite (Silero + live Speaches).
CMakeLists.txt, CMakePresets.json
README.md, CLAUDE.md, LICENSE, .editorconfig, .gitignore
- GPU: RTX 4060 8 GB (CUDA 12+). Tested on 595 driver / CUDA 13.2.
- CPU: modern x86_64 (≥ 8 cores).
- OS: Linux. Manjaro/Arch is primary, Ubuntu 24.04 LTS secondary.
- Audio: USB or built-in mic + speakers. PipeWire ≥ 1.0 recommended for the system-AEC path.
VRAM budget on the 4060: llama-7B Q4_K_M (~5 GB) + faster-whisper-large-v3-turbo (~1.6 GB) + Piper voices (~250 MB) ≈ 6.8 GB resident, leaving ~1.2 GB headroom. The Speaches WHISPER__TTL=-1 pin is non-negotiable on this card — the default 5-min auto-evict combined with faster-whisper's #992 leaks ~300 MB per unload cycle and OOMs after a few reloads. See docs/history/MVP/open_questions.md §L7 for the rationale.
The orchestrator runs on the host as a CLI binary; backends run in containers.
Manjaro / Arch:
# Build toolchain + system libs
sudo pacman -S --needed \
base-devel cmake ninja pkgconf git \
sqlite curl portaudio soxr \
onnxruntime spdlog fmt \
webrtc-audio-processing \
libdatachannel \
nvidia-container-toolkit
# AUR (yay / paru)
yay -S --needed prometheus-cpp opentelemetry-cpp glaze cpp-httplib doctestDebian / Ubuntu 24.04+:
sudo apt install --no-install-recommends \
build-essential gcc-13 g++-13 cmake ninja-build pkg-config git \
libsqlite3-dev libcurl4-openssl-dev portaudio19-dev libsoxr-dev \
libonnxruntime-dev libspdlog-dev libfmt-dev \
libwebrtc-audio-processing-dev \
libdatachannel-dev
# Build from source under ~/src or vendor as submodules:
# prometheus-cpp, opentelemetry-cpp, glaze (header-only), cpp-httplib (header-only)NVIDIA Container Toolkit needs a CDI spec for GPU passthrough; regenerate after every driver upgrade:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yamlIf cuInit returns "unknown error" inside the container even though nvidia-smi works, the /dev/nvidia-uvm major drifted out of sync with the CDI spec — bounce dockerd after regenerating. Same playbook covers the post-upgrade case where the kernel module reloaded with a new dynamic major.
./build.sh # = dev preset; output under _build/dev/
./build.sh release # -DNDEBUG, tests off
./run_tests.sh # unit suite, ~5 s
./run_integration_tests.sh # integration suite, needs Speaches upThe build.sh wrapper caps parallelism so it doesn't saturate the box. Raw cmake --preset dev && cmake --build --preset dev works too.
./scripts/dev-up.shThis wraps cd packaging/compose && docker compose -p acva up -d, polls until acva-llama and acva-speaches are both healthy, and runs tools/acva-models status to surface any missing model assets. On first run it materializes .env from .env.example (replacing the __SET_HOME__ sentinel).
tools/acva-models sync # installs every alias the active config referencesDefault footprint ≈ 6.5 GB:
| File | Size | Source |
|---|---|---|
llama.cpp/<active LLM>.gguf |
~4.7 GB | HuggingFace via direct download |
speaches/... faster-whisper-large-v3-turbo |
~1.6 GB | HuggingFace via Speaches POST /v1/models/<id> |
speaches/... Piper voices (en + ru × 4) |
~250 MB | HuggingFace via Speaches |
silero/silero_vad.onnx |
~2 MB | snakers4/silero-vad |
The downloader is idempotent / resumable. To swap an LLM:
tools/acva-models select llm dialog # alias from config/default.yaml./_build/dev/acva
# or
./_build/release/acvaacva resolves config + DB paths via XDG (~/.config/acva/default.yaml, ~/.local/share/acva/acva.db) — see CLAUDE.md for full path resolution rules. Control plane:
curl -sS http://127.0.0.1:9876/status | jq
curl -sS http://127.0.0.1:9876/metrics | headDemos covering individual paths:
./_build/dev/acva demo # list every demo
./_build/dev/acva demo health # probe each backend's /health
./_build/dev/acva demo chat # text-in → speech-out smoke
./_build/dev/acva demo capture # mic + VAD endpointing report
./_build/dev/acva demo transcribe # mic + streaming STT
./_build/dev/acva demo bargein # auto-injected interrupt cascade./scripts/dev-down.sh # stop, keep volumes
./scripts/dev-down.sh --wipe # also clear anonymous volumesThe ACVA_MODELS_DIR host bind-mount is never wiped automatically; remove it manually if you really want a fresh model store.
cd packaging/observability && docker compose -p acva-obs up -d
# Grafana on http://127.0.0.1:3000 (anonymous viewer auth; admin/admin first login)
# Prometheus on http://127.0.0.1:9090The dashboard auto-loads from packaging/grafana/acva.json — 7 panels covering FSM state, Speaches VRAM + wedged classifier, TTS first-audio P50/P95, service health, playback queue + underruns, watchdog, and pipeline state.
For unattended deployments without Docker, four user-systemd units ship in packaging/systemd/:
| File | Role |
|---|---|
acva-llama.service |
LLM backend (llama.cpp llama-server) on 127.0.0.1:8081 |
acva-speaches.service |
STT + TTS backend (Speaches) on 127.0.0.1:8090 (Compose) / :8000 (bare-metal) |
acva.service |
Orchestrator (control plane on 127.0.0.1:9876) |
acva.target |
Convenience target — brings up all three in dependency order |
Install:
./scripts/install-systemd.sh # copies units to ~/.config/systemd/user/
systemctl --user daemon-reload
systemctl --user start acva.targetpackaging/systemd/README.md has the full runbook including bare-metal binary layout, system-wide install variant (root-owned), and the loginctl enable-linger step for headless workstations. The optional sd-bus extension to the supervisor is gated by -DACVA_ENABLE_SDBUS=ON at build time and cfg.supervisor.bus_kind: user at runtime.
nvidia-cdi-refresh.service is provided as a one-shot helper that regenerates /etc/cdi/nvidia.yaml on boot — useful on rolling-release distros where the spec drifts after kernel-module dynamic-major changes.
Two suites, two binaries, two scripts:
./run_tests.sh # unit — no external deps, ~5 s
./run_tests.sh dev --test-case='paths*' # filter pass-through to doctest
./run_integration_tests.sh # integration — Silero + live SpeachesThe integration suite resolves model paths via the same XDG defaults acva itself uses, so on the dev workstation no env vars are needed. Missing assets cause a clean per-case skip, never a fail.
For longer runs:
./_build/dev/acva demo soak-mini # 60 s FakeDriver smoke
./scripts/soak.sh # 4-hour acceptance run with CSV trace + reportdocs/guide/troubleshooting.md is the symptom-first guide — routes failures to the right acva demo <name> and reads its output. Common entry points:
| Symptom | Start with |
|---|---|
| Can't connect to a backend | acva demo health |
| Mic captures silence | scripts/wake-word-offline.sh (records raw + AEC, runs both through STT + wake-word) |
| TTS audio cuts in/out | playback queue depth + underruns metrics; acva demo tts |
| Barge-in doesn't fire | scripts/barge-in-probe.py, acva demo bargein-validation |
| AEC isn't cancelling | acva demo aec-record + scripts/aec_analyze.py |
| Speaches stuck at full VRAM | acva demo wedge reproduces the CUDA-OOM wedge case |
cuInit "unknown error" inside container |
regenerate CDI spec + bounce dockerd; see Quickstart §1 |
docs/project_design.md— architecture source of truth (sections referenced throughout the codebase comments).docs/history/MVP/open_questions.md— resolved + open decisions; §L holds implementation-driven revisions.docs/history/MVP/milestones/m{0..8}_*.md— per-milestone implementation history.docs/reports/aec_report.md— M6 + M6B AEC analysis.docs/guide/troubleshooting.md— symptom-first guide.docs/diagrams/observability.{svg,pdf}— runtime topology diagram.CLAUDE.md— guidance for Claude Code working in this repo (also useful as a deeper reading-list for humans).
See LICENSE.