Speaking is just easier.
Speak freely, type instantly on Wayland (X11 compatible) — 100% local voice dictation for Linux with 25+ languages, 5 translation backends, speaker diarization, and real-time visual feedback. Text appears right where your cursor is.
What is dictee? • System requirements • Quick start • Features • Installation • Configuration • Usage • Post-processing • Limitations • Roadmap • Wiki
dictee is a complete voice dictation system for Linux. Press a shortcut, speak, and the text is typed directly into the active application — any application, any window, any text field.
Transcription is performed 100% locally by default: no audio ever leaves your machine unless you explicitly choose a cloud translation backend.
- 100% local processing by default — no audio leaves the machine unless you explicitly enable a cloud translation backend. Frozen ONNX models, no training on your data.
- 4 ASR backends to choose from — Parakeet-TDT and Canary run as native Rust binaries (ONNX Runtime, low GPU latency), faster-whisper (99 languages) and Vosk (lightweight CPU) in Python. Transparent switching via Unix socket depending on language, latency or hardware. → 4 ASR backends
- 5 translation backends to choose from — from fully local (Canary, LibreTranslate, Ollama) to cloud (Google, Bing), with an explicit privacy table for each option. → Translation backends
- No duration limit on audio files — the chunked pipeline shipped in v1.3 (
dictee-transcribe) diarizes a 54-min keynote in 122 s on an 8 GB GPU, where direct mel loading caps at 10-15 min. Ideal for meeting minutes and long interviews. - Native Linux integration — KDE Plasma 6 plasmoid + PyQt6 system tray (compatible with GNOME, XFCE, Sway via AppIndicator fallback).
| Backend | Min RAM | CPU mode | GPU | Disk |
|---|---|---|---|---|
| Parakeet-TDT (default) | 4 GB | Yes — ~0.8 s per utterance (recent CPU) | NVIDIA 4 GB+ VRAM (~5× faster) | 3 GB |
| Canary-1B v2 | 6 GB | No — encoder too heavy | NVIDIA 6 GB+ VRAM required | 6 GB |
| faster-whisper | 4 GB | Yes — turbo or small |
NVIDIA 4 GB+ VRAM (large-v3) |
3 GB |
| Vosk | 2 GB | Yes — by design | — | 50 MB |
Distributions tested: Ubuntu 22.04 / 24.04 · Debian 12 · Fedora 40 / 44 · openSUSE Tumbleweed · Arch Linux · KDE Neon.
Desktop environments: KDE Plasma 6 (full integration via native plasmoid widget) · GNOME, Xfce, Cinnamon (system tray only — GNOME requires the AppIndicator extension).
Three steps to go from zero to dictation in under two minutes:
1. Install
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bashPrefer to audit before running?
install.shandinstall.sh.sha256are published as release assets — download both, verify withsha256sum -c install.sh.sha256, read the script, then run it.
2. Configure
The first-run wizard walks you through backend selection, model download and keyboard shortcut binding. Re-run anytime with dictee --setup.
3. Speak
Press your shortcut (default F9), speak, release. The transcription appears at your cursor.
For detailed install paths (manual .deb/.rpm, GPU prerequisites, AUR, from source), see Installation below or the wiki's Installation and GPU-Setup pages.
| Backend | Languages | Model size | Warm latency | Notes |
|---|---|---|---|---|
| Parakeet-TDT 0.6B v3 | 25 | ~2.5 GB | ~0.8s CPU · ~0.16s GPU | Default, native punctuation |
| Canary-1B v2 | 25 | ~5 GB | ~0.7s GPU | Built-in translation (25 ↔ EN, 48 pairs) |
| faster-whisper | 99 | ~500 MB–3 GB | ~0.3s | Wide language coverage |
| Vosk | 20+ | ~50 MB | ~1.5s | Lightweight, strict offline |
Each backend runs as a systemd user service with the same Unix socket protocol — switching is transparent. → ASR-Backends wiki
dictee uses Parakeet-TDT 0.6B v3 by default. On the Open ASR Leaderboard, it outperforms Whisper-large-v3 on multilingual transcription while being significantly smaller and faster:
| Model | Size | English WER | FLEURS multilingual (avg) | Relative speed |
|---|---|---|---|---|
| Parakeet-TDT 0.6B v3 (dictee default) | 600M | ~6.5 % | 12.0 % | ~10× Whisper-large-v3 |
| Whisper-large-v3 | 1.55B | 7.4 % | 12.6 % | baseline |
| Canary-1B v2 (also bundled) | 1B | 7.2 % | – | ~5× Whisper-large-v3 |
| Whisper-large-v3-turbo | 809M | ~7.8 % | – | ~3-4× |
| Vosk (CPU fallback) | 50 MB | ~12-18 % | – | – |
Parakeet-TDT v3 wins particularly on French, Greek, Estonian and Maltese. For maximum language coverage (99 languages), switch to faster-whisper; for built-in translation, switch to Canary-1B.
Sources: NVIDIA Parakeet-TDT v3 · Open ASR Leaderboard 2025.
| Backend | Privacy | Speed | Quality | Languages |
|---|---|---|---|---|
| Canary-1B | 🔒 Local | Built-in | Excellent | 4 |
| LibreTranslate | 🔒 Local | 0.1–0.3s | Good | 30+ |
| Ollama | 🔒 Local | 2–3s | Excellent | Any (LLM) |
| Google Translate | 🌐 Cloud | 0.2–0.7s | Excellent | 130+ |
| Bing Translator | 🌐 Cloud | 1.7–2.2s | Very good | 100+ |
→ Translation wiki · Ollama-Setup
A 12-step configurable pipeline transforms raw ASR output before it hits your cursor:
- Regex rules + dictionary — 7 languages, ASR variants, voice commands → Rules-and-Dictionary
- LLM correction — optional fluency polish via local Ollama (first / last / hybrid position) → LLM-Correction
- Numbers & dates — cardinal, ordinal, versions, decimals, French times → Numbers-Dates-Continuation
- Continuation buffer — continue a sentence across dictations with last-word memory
- Short-text keepcaps — per-language exceptions for acronyms and names (new in v1.3)
Answer "who spoke when?" in multi-speaker recordings via NVIDIA's Sortformer model. Up to 4 speakers, ideal for meeting notes and interviews. Triggered via Meeting mode or dictee --meeting. → Diarization wiki
Push-to-talk is dictee's main flow, but the bundled dictee-transcribe window also handles offline transcription of any audio or video file you already have. Multi-tab interface, audio player synchronised with the timeline, per-tab translation and LLM analysis, export to PDF / SRT / JSON / Markdown.
- Any input format (mp3, mp4, wav, opus, flac, mkv…) — auto-converted via ffmpeg
- Multi-tab — keep the original transcription side-by-side with translations and LLM analyses (summary, chapters, ASR cleanup…)
- Speaker diarization built-in — toggle on, get up to 4 speakers labelled and renamable
- LLM analysis — 14 providers configurable side by side (Ollama, OpenAI, Claude, Gemini, Mistral, DeepSeek, Groq, Cerebras, OpenRouter…)
- Per-tab translation — Canary / LibreTranslate / Ollama / Google / Bing
- KDE Plasma 6 widget — native QML plasmoid, 5 animation styles, live state → Plasmoid-Widget
- System tray icon — PyQt6, works on GNOME/XFCE/Sway (AppIndicator fallback) → Tray-Icon
- animation-speech (external) — fullscreen overlay on
wlr-layer-shellcompositors
All three share state via a filesystem watcher — any change is reflected instantly across interfaces (multi-user safe with UID suffix).
animation-speech is a standalone project that provides a fullscreen visual animation during recording, with cancellation via the Escape key. It works on any Wayland compositor supporting wlr-layer-shell (KDE Plasma, Sway, Hyprland…).
sudo dpkg -i animation-speech_1.2.0_all.debDownload: animation-speech releases
Note: animation-speech is not compatible with GNOME (no
wlr-layer-shellsupport). GNOME users can rely ondictee-trayfor visual feedback. Contributions for a GNOME Shell extension are welcome — see the plasmoid source for reference architecture.
Auto-detects distro and GPU, adds the NVIDIA CUDA repo if needed, installs the right package:
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bashSupported: Ubuntu, Debian, Fedora, openSUSE, Arch Linux. Other distros fall back to the tarball path.
Options (after --):
# Force CPU (skip GPU detection)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --cpu
# Force GPU (CUDA)
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --gpu
# Pin a specific version
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --version 1.3.4
# Non-interactive
curl -fsSL https://raw.githubusercontent.com/rcspam/dictee/master/install.sh | bash -s -- --non-interactiveDownload from Releases.
Ubuntu / Debian (CPU):
sudo apt install ./dictee-cpu_1.3.4_amd64.debUbuntu / Debian (GPU): requires the NVIDIA CUDA APT repo — see GPU-Setup for the one-time setup, then:
sudo apt install ./dictee-cuda_1.3.4_amd64.debFedora / openSUSE (CPU):
sudo dnf install ./dictee-cpu-1.3.4-1.x86_64.rpmFedora / openSUSE (GPU): add the CUDA repo first (see GPU-Setup), then dictee-cuda-1.3.4-1.x86_64.rpm.
Arch Linux (AUR): PKGBUILD in the repo root (x86_64 + aarch64). Clone + makepkg -si.
aarch64 / Jetson: no pre-built package — build from source. CUDA limited to NVIDIA Jetson boards.
Other distros (tarball):
tar xzf dictee-1.3.4_amd64.tar.gz
cd dictee-1.3.4
sudo ./install.shThe tarball ships binaries only — system dependencies must be installed beforehand via your distro's package manager. Names vary by distro; pick the equivalents in yours:
python3(≥3.10),python3-pip,python3-venvpython3-evdev,python3-pyqt6(+qtmultimedia+qtsvg),python3-numpypulseaudio-utils,pipewire(oralsa-utils),libnotify(-bin),soxwl-clipboard(Wayland),xclip(X11)translate-shell,curl
Example for Debian-derived distros (Mint, MX, Pop!_OS…):
sudo apt install python3 python3-pip python3-venv python3-evdev \
python3-pyqt6 python3-pyqt6.qtmultimedia python3-pyqt6.qtsvg python3-numpy \
pulseaudio-utils pipewire libnotify-bin sox wl-clipboard xclip \
translate-shell curlFrom source: cargo build --release --features sortformer then sudo ./install.sh. See Developer-Guide for full Cargo features and package build scripts.
First launch triggers a setup wizard (backend, model, shortcuts).
Reconfigure anytime from the application menu, tray icon, Plasma widget, or by running:
dictee --setup# Show current backends
dictee-switch-backend status
# Switch ASR (parakeet · canary · whisper · vosk)
dictee-switch-backend asr canary
# Switch translation (canary · libretranslate · ollama · google · bing)
dictee-switch-backend translate ollamaThe tray and plasmoid include backend sub-menus — no terminal required.
For detailed configuration (all ASR backends, translation matrix, plasmoid settings, keyboard shortcuts on tiling WMs), see the wiki:
- ASR-Backends · Translation
- Plasmoid-Widget · Tray-Icon
- Keyboard-Shortcuts (KDE/GNOME/Sway/i3/Hyprland)
# Simple dictation — transcribe and type
dictee
# Dictate + translate (default: system language → English)
dictee --translate
dictee --translate --ollama # 100% local via Ollama
# Change target language
DICTEE_LANG_TARGET=es dictee --translate # → Spanish
# Meeting mode (diarization, up to 4 speakers)
dictee --meeting
# Cancel ongoing dictation
dictee --cancel
# Test post-processing rules live
dictee-test-rules # interactive
dictee-test-rules --loop # continuous loop
dictee-test-rules --wav file.wav # from audio file→ Full command reference: CLI-Reference wiki
dictee runs a configurable 12-step pipeline after transcription and before paste:
- ASR variants normalization
- Dictionary substitution
- Numbers & dates conversion
- Continuation buffer merge
- Regex rules (pre-LLM)
- LLM correction (optional, first position)
- Regex rules (post-LLM)
- Short-text exceptions (keepcaps)
- Extended match mode
- Final capitalization
- Translation (optional)
- Paste / inject
Configure via dictee --setup → Post-processing tab, or test rules live with dictee-test-rules.
→ Deep dives: Post-Processing-Overview · Rules-and-Dictionary · LLM-Correction · Numbers-Dates-Continuation
- Long-file diarization: the chunked pipeline shipped in v1.3 (used by
dictee-transcribe) lifts the VRAM cap (54-min keynote diarized in 122 s on 8 GB). In continuous live dictation (push-to-talk held without releasing), a single utterance > 10-15 min on an 8 GB GPU may still OOM — rare in practice, split the file or switch to the CPU backend. → Diarization wiki - AMD / Intel GPUs are not currently supported — dictee falls back to CPU.
- No real-time streaming — Parakeet-TDT and Canary require the full utterance; only Nemotron (EN-only, via Rust binary) streams natively.
For bug reports and workarounds, see Troubleshooting.
v1.3.4 (current) — Universal chunked transcription + dictee-transcribe UX hardening:
- Universal chunked transcription in
dictee-transcribe— files of any length now split automatically into 180 s chunks on every host (CPU and GPU). New per-backend cap on live-dictation recording duration (Canary 2:30, Parakeet 4:30; Whisper / Vosk uncapped) to prevent silent crashes. - Five-site target-tab UI hardening in
dictee-transcribe— text editor, rename panel, timeline markers, audio player swap, and transcription render now only update the global UI when the target tab is visible. No more cross-tab corruption when transcribing one file while reviewing another. - Translate skip surfacing — silent translate-skip cases now show a colored status message (i18n in 6 languages: fr / de / es / it / pt / uk).
- Diarize falls back to standalone Parakeet + Sortformer when the PTT daemon is Canary — avoids silent mistranscription on files whose language ≠
DICTEE_LANG_SOURCE(Canary daemon is locked at startup). Standalone binary costs ~5–10 s extra model load. - Socket read timeout bumped 30 → 120 s for large files.
- GPU-fallback warning suppressed when stderr is piped.
- Default cheatsheet shortcut now "Same key + Shift" (was Disabled).
- PTT / dictee stale-state cleanup on next F9 keypress (recovers from a daemon killed mid-flight, OOM, signal).
v1.3.0 → v1.3.3 — The v1.3 series. Major additions over v1.2:
dictee-transcribe— dedicated window for offline transcription of audio/video files (timeline player, multi-tab, per-tab translation and LLM analysis, export to PDF / SRT / JSON / Markdown).- Speaker diarization up to 4 speakers via NVIDIA Sortformer, plus a chunked pipeline that lifts the VRAM cap on long files (54-min keynote diarized in 122 s).
- LLM analysis on diarized transcripts — synthesis, chapters, ASR cleanup; 14 providers configurable side by side (Ollama, OpenAI, Claude, Gemini, Mistral, DeepSeek, Groq, Cerebras, OpenRouter…).
- Canary-1B v2 ASR backend (NVIDIA AED) with built-in translation on 12 native pairs.
- Portable CUDA libs via pip venv at postinst — no NVIDIA repo required.
- CUDA → CPU runtime fallback +
DICTEE_FORCE_CPU=1override (v1.3.2). - Cross-distro packaging consistency — Arch
.installgroup hooks,python-evdevhard depends,sg docker/sg inputwrappers, udev0660direct (v1.3.3, closes #5 + #6). - Short-text keepcaps exceptions (7 languages), extended match mode, version-number dictation, multi-user safe state, plasmoid cross-process toggles, 682 postprocess + 148 pipeline tests (v1.3.0).
v1.4+ (planned)
- Hotword boosting — bias ASR decoding toward custom names (shallow fusion on TDT logits, Parakeet only)
- Whisper translate — multi-target translation via
task="translate"(EN-only, offline) - Moonshine CPU backend
- CLI speech-to-text — pipe audio, get text
- VAD — hands-free dictation without push-to-talk
- Streaming transcription with live text display
- Built-in overlay — replace external
animation-speech - AppImage / Flatpak packaging
- COSMIC / GNOME Shell applets (contributions welcome!)
→ Full history: Changelog wiki
The transcription engine builds on parakeet-rs by Enes Altun — Rust library for NVIDIA Parakeet inference via ONNX Runtime. The Rust Canary implementation was originally ported from onnx-asr by Ivan Stupakov and is now fully self-contained. Parakeet and Canary ONNX models are provided by NVIDIA (downloaded separately from HuggingFace, not redistributed by this project).
Keyboard input simulation uses dotool by geb (GPL-3.0).
This project is distributed under the GPL-3.0-or-later license (see LICENSE).
The original parakeet-rs code by Enes Altun is under the MIT license (see LICENSE-MIT). dotool is bundled under GPL-3.0.









