Fully on-device, live bilingual speech-to-text for church outreach at Stark Road Gospel Hall (Farmington Hills, MI). English/Spanish, real-time mic input, browser display. No cloud APIs, no internet required at runtime.
Two-Pass Pipeline
================
Mic (48kHz) ──> Resample 16kHz (<1ms) ──> Silero VAD (<1ms) ──┐
│
┌─────────────────────────────────────────────────────┘
│
├─ PARTIAL (every 0.6s of new speech, while speaker is talking)
│ Whisper Turbo + W16 LoRA via CT2 (~353ms p50 / ~413ms p95)
│ MarianMT EN↔ES PyTorch (~250ms) ← italic in UI
│ Total: ~600ms p50 / ~660ms p95
│
└─ FINAL (on 0.5s silence gap or 8s max utterance)
Whisper Turbo + W16 LoRA via CT2 (~353ms p50 / ~413ms p95)
Gemma 4 E4B Q4_K_M via llama.cpp (~470ms) ← replaces partial
├─ Piper TTS (~40ms/word EN, --tts) ← audio output
Gemma 4 E2B Q4_K_M via llama.cpp (~280ms) ← low-VRAM fallback
Total: ~820ms (E4B) / ~630ms (E2B)
STT measurements: A2000 Ada Mobile (165 GB/s BW), beam_size=1, 41-clip
bench on 1-15s utterances. WER 11.00% overall / 8.70% on theological
terms (W16 fine-tune drops WER 19% / 43% relative vs off-the-shelf).
Full bench: docs/archive/v2026.7/STT_BENCHMARK.md.
Faster GPUs scale ~linearly with memory bandwidth (RTX 3060 12GB ≈ ~190ms p95).
Pipeline overlap: translation runs on utterance N
while STT runs on utterance N+1, hiding translation latency.
│
▼
WebSocket (0.0.0.0:8765)
HTTP (0.0.0.0:8080)
│
┌──────────┬───────────┼───────────┐
▼ ▼ ▼ ▼
Audience A/B/C Mobile CSV +
Display Compare Display Diagnostics
(projector) (operator) (QR code) (JSONL)
Runs on Apple Silicon (MLX) or NVIDIA GPUs (CUDA) via --backend auto|mlx|cuda.
# Mac (Apple Silicon)
brew install ffmpeg portaudio
uv tool install 'stark-translate[mlx]' # or: pip install '.[mlx]'
huggingface-cli login # Required for TranslateGemma
stark-translate setup # Download all models
stark-translate operator # Open the operator UI
# NVIDIA (Linux)
uv tool install 'stark-translate[cuda]' # or: pip install '.[cuda]'
stark-translate setup
stark-translate operatorThe legacy requirements-mac.txt / requirements-nvidia.txt install path is
still supported for the WSL training environment but deprecated for inference;
see CLAUDE-windows.md for training-side instructions.
Key flags: --lang es (Spanish speaker mode), --tts (audio output), --ab (A/B comparison), --dry-run-text "test" (no mic), --vad-threshold 0.3, --log-level DEBUG.
| Component | Model | Size | Latency (CUDA) |
|---|---|---|---|
| STT default (v2026.7) | Whisper Large-V3-Turbo + W16 LoRA → CTranslate2 int8_float16 | ~1.5 GB | see docs/archive/v2026.7/STT_BENCHMARK.md |
| STT (off-the-shelf fallback) | faster-whisper large-v3-turbo (CT2) | ~1.5 GB | ~500ms (pre-W16) |
| STT (alt path) | HF Whisper + distil-large-v3.5 spec decode | ~3 GB | varies |
| Translation (partials) | MarianMT opus-mt-en-es / es-en | ~298 MB | ~250ms |
| Translation default (CUDA, finals) | Gemma 4 E4B Q4_K_M (llama.cpp) | 5.0 GB GGUF / 4.9 GB VRAM | ~470ms |
| Translation low-VRAM (CUDA, finals) | Gemma 4 E2B Q4_K_M (llama.cpp) | 3.2 GB GGUF / 3.5 GB VRAM | ~280ms |
| Translation (Mac MLX) | TranslateGemma 4B / 12B 4-bit | ~2.5 GB / ~7 GB | ~550ms / ~2.1s |
| TTS | Piper EN/ES (ONNX) | ~63 MB | ~40ms/word |
| VAD | Silero VAD | ~2 MB | <1ms |
CUDA path now uses llama.cpp via engines/llamacpp_engine.py (v2026.5+) for ~5–9× speedup and ~4× VRAM reduction vs HF NF4. Caller starts llama-server (see start_server.sh). MLX/Mac path unchanged. Pipeline overlap hides translation latency by running translation(N) concurrent with STT(N+1). See docs/archive/v2026.5/BENCHMARK.md for the full Phase 1A benchmark across TG4B/TG12B/E2B/E4B × HF/GGUF.
Five browser-based displays served over LAN. Phones connect via QR code on the audience display.
| Display | Purpose |
|---|---|
audience_display.html |
Projector: EN/ES side-by-side, fading context, fullscreen, QR overlay |
ab_display.html |
Operator: A (default) / MarianMT / B comparison with latency stats |
mobile_display.html |
Phone/tablet: responsive, model toggle, Spanish-only mode |
church_display.html |
Simplified church layout |
obs_overlay.html |
Transparent overlay for OBS Studio streaming |
Fine-tuning runs on Windows/WSL (A2000 Ada 16GB). Adapters transfer to Mac for inference.
Translation (TranslateGemma QLoRA): S1-S9 ablation sweep complete. S6 winner: balanced 1:1 verse/sermon ratio, COMET proximity to 12B base = -0.0002. Trained on ~155K biblical verse pairs (public domain KJV/ASV/WEB/BBE/YLT paired with RVR1909) + DeepL-augmented sermon pairs.
STT (Whisper LoRA): W12 data scaling run on 198K Deepgram-aligned chunks from 328 sermons. Baseline WER on fresh eval: 21.41%. W15 hard example mining pipeline for curriculum learning: mine chunks by WER, filter by difficulty bounds, train with --init-from for adapter weight initialization.
Data pipeline: Deepgram Nova-3 oracle transcription (50 theological keyterms), tiered glossary (50 boost + 229 master terms), sharded Arrow alignment (12GB memory cap), SHA-256 data lockfile, stratified eval sets.
| Target | RAM/VRAM | Config |
|---|---|---|
| Mac (M1-M4) 8 GB+ | ~4.3 GB | TranslateGemma 4B-only (--no-ab) |
| Mac (M1-M4) 18 GB+ | ~11.3 GB | Full A/B (--ab) |
| NVIDIA 6 GB+ (RTX 3060/4060) | ~5 GB | Whisper + Gemma 4 E2B Q4_K_M (llama.cpp) |
| NVIDIA 8 GB+ | ~6 GB | Whisper + Gemma 4 E4B Q4_K_M (llama.cpp) |
| Training (A2000 Ada 16GB) | ~8-12 GB | LoRA/QLoRA fine-tuning |
CUDA HF NF4 not recommended: Gemma 4 E2B/E4B HF NF4 occupy 14–15 GB on GPU due to bf16 Per-Layer Embeddings. Use llama.cpp Q4_K_M instead (4× less VRAM). See
docs/archive/v2026.5/BENCHMARK.md.
pytest tests/ -v # 855 tests, no GPU required
ruff check . && ruff format --check .
mypy engines/ settings.pySeven CI workflows: lint, test (3.11 + 3.12), security (pip-audit), release, label, commitlint, stale. CalVer versioning (YYYY.M.W.PATCH), Codecov coverage, Dependabot.
dry_run_ab.py Main pipeline: mic → VAD → STT → translate → display
settings.py Unified config (pydantic-settings, STARK_ prefix)
(deleted in v2026.7 cleanup; replaced by `stark-translate setup`)
engines/ STT + translation + TTS engine layer
base.py ABCs and result dataclasses
mlx_engine.py Apple Silicon (MLX) implementations
cuda_engine.py NVIDIA CUDA HF implementations (streaming, prompt cache)
llamacpp_engine.py NVIDIA CUDA via llama.cpp HTTP (recommended for production, v2026.5+)
factory.py Auto-detect backend and create engines
start_server.sh Launch llama-server with default Gemma 4 GGUFs
displays/ 5 browser display modes (static HTML/CSS/JS)
training/ Windows/WSL training scripts
train_whisper.py Whisper LoRA (curriculum learning, --init-from)
train_gemma.py TranslateGemma QLoRA
align_deepgram_chunks.py Deepgram-Whisper alignment (sharded Arrow)
mine_hard_examples.py Hard example mining (batched fp16, Tier 1 detection)
build_hard_subset.py WER-bounded filtering with stratified caps
benchmark_gemma4.py TranslateGemma vs Gemma 4 comparison (HF only)
scripts/benchmarks/bench_translate_t1_t4.py Phase 1A benchmark: HF vs llama.cpp, all 4 models, VRAM sampler
tools/ Monitoring & validation
live_caption_monitor.py YouTube caption comparison
translation_qe.py 3-tier translation quality estimation
validate_session.py Post-session validation pipeline
manage_adapters.py Adapter lifecycle (register, activate, rollback)
health_check.py 5-canary-sentence adapter verification
glossary.py Tiered glossary (50 boost + 229 master terms)
features/ Post-processing (not yet integrated with live pipeline)
diarize.py Speaker diarization
extract_verses.py Bible verse reference extraction
summarize_sermon.py Post-sermon summary
| Doc | Contents |
|---|---|
CLAUDE.md |
Project overview, architecture, CI/CD, phase checklist |
CLAUDE-macbook.md |
Mac inference environment |
CLAUDE-windows.md |
Windows/WSL training environment |
engines/CLAUDE.md |
Engine layer: MLX thread safety, CUDA streaming, VRAM tiers |
training/CLAUDE.md |
Fine-tuning: data pipeline, LoRA/QLoRA configs, ablation results |
tools/CLAUDE.md |
Monitoring: YouTube comparison, translation QE, adapter deployment |
displays/CLAUDE.md |
Display modes, WebSocket protocol, operator SPA |
features/CLAUDE.md |
Diarization, sermon summary, verse extraction |
docs/operator_runbook.md |
Day-of-event workflow for non-technical operators |
docs/roadmap.md |
Full project roadmap and metrics |
Done: Bidirectional EN/ES inference (MLX + CUDA), two-pass pipeline with overlap, 5 audience display modes, Piper TTS, TranslateGemma S1-S9 ablation (S6 winner), Whisper W12 data scaling (198K chunks), W15 hard mining pipeline, W16 corrective run (7.25% WER), Deepgram oracle (35 sermons), data integrity pipeline. v2026.5: llama.cpp engine (5–9× faster CUDA, 4× less VRAM), Gemma 4 E4B Q4_K_M as production default, Phase 1D wired into dry_run_ab.py. v2026.6: Operator control plane at http://host:9000/operator/ — FastAPI + vanilla JS, pre-flight gating, mid-session controls, live observability sparklines, audio device hotplug, verse highlights, summary trigger, systemd/launchd/bootstrap.sh. 935 tests, 7 CI workflows.
Next: Deferred patches 9.4.1 (multi-channel TTS routing) and 9.6.1 (live diarization). Then: deploy v2026.6 adapters to Mac for live A/B (Phase 5/7), active learning feedback loop (Phase 6/8), Hindi & Chinese adapters (Phase 8). Phase 1C EAGLE-3 deferred — single-GPU spec decode showed no benefit on this hardware.
Private project. All Bible translation training data uses public domain or CC-licensed sources only.