Tap Caps Lock, speak any language, get any other. Whisper.cpp, no cloud, no GPU rental.
demo.mp4
You speak in any language. Whatever language you picked in the menu bar โ that's what comes out. Real examples:
| You said (any language goes) | Active mode | What got pasted at the cursor |
|---|---|---|
| ๐ท๐บ ยซะัะธะฒะตั, ะบะฐะบ ั ัะตะฑั ะดะตะปะฐ?ยป | ๐ฌ๐ง en |
Hello, how are you doing? |
| ๐ฌ๐ง "Let's ship it on Friday" | ๐ท๐บ ru |
ะะฐะฒะฐะน ะฒัะบะฐัะธะผ ะฒ ะฟััะฝะธัั |
| ๐ฉ๐ช "Kรถnnen wir morgen reden?" | ๐ฏ๐ต ja |
ๆๆฅ่ฉฑใใพใใ๏ผ |
| ๐ฐ๐ท "์๋ ํ์ธ์, ๋ง๋์ ๋ฐ๊ฐ์ต๋๋ค" | ๐ธ๐ฆ ar |
ู ุฑุญุจูุงุ ุชุดุฑูุช ุจููุงุฆู |
| ๐ฏ๐ต ใใณใผใใฌใใฅใผใใใใจใใ | ๐บ๐ฆ uk |
ะัะบัั ะทะฐ ัะตะฒ'ั ะบะพะดั |
| anything | ๐ โ English |
always English โ flagship mode |
Why this works at all. Whisper's encoder produces a language-agnostic representation of audio โ meaning, not words. The decoder writes that meaning down in whichever language you asked for. Swap the language token, get a different output language. Same speech in, different writing out.
This is something the native task=translate flag can't do on large-v3-turbo โ that model was fine-tuned without translation data and the flag is broken. We sidestep it.
16 modes in the menu bar: 15 target languages + the flagship ๐ โ English (from any) shortcut. Click to switch, next dictation lands in the new language.
- Multilingual teams โ speak Russian to your dev chat, English to your PR description, German to your designer Slack โ without changing keyboard layouts.
- Coding while talking โ narrate the logic out loud, get clean prose in your PR, RFC, commit message, or Notion doc.
- Faster than typing for non-English natives โ your brain composes in your native language, the text lands in whichever language the chat needs.
- Voice notes during meetings โ instant text in Notes / Obsidian, no ยซrecord now, transcribe laterยป loop.
- Translating quotes / tweets / headlines โ read out loud in any language, get any other language back.
Caps Lock (tap) โ ๐๏ธ recordingโฆ
Caps Lock (tap) โ whisper.cpp (localhost:8080) โ clipboard โ Cmd+V
Tap Caps Lock โ speak โ tap again โ text appears at your cursor in any app: Slack, Notes, VS Code, your browser, terminal. The previous clipboard contents are saved and restored automatically.
| Stage | What happens |
|---|---|
| Hotkey | Caps Lock (keycode 57) via Quartz CGEventTap. Toggle: 1st tap starts, 2nd tap stops. Doesn't block input to other applications. |
| Capture | sounddevice at 16 kHz mono float32. WAV stays in memory (io.BytesIO) โ never written to disk. |
| STT | POST to localhost:8080/inference (whisper.cpp). The large-v3-turbo-q5_0 model runs on Metal GPU. |
| Paste | pbcopy + Cmd+V via CGEvent. The previous clipboard is saved and restored. |
Latency: ~0.3-0.5 s for 10 s of speech on Apple Silicon (Metal GPU). Privacy: zero network egress โ audio never leaves your Mac. Cost: zero โ model is downloaded once (~550 MB), inference is free forever.
Platform: macOS (Apple Silicon flagship; Intel Mac works without Metal). Windows / Linux backends โ TBD.
Requires Python 3.10+, cmake, and git. Everything else (whisper.cpp + the model) is installed by one command:
cd cream_typer
make setupWhat make setup does:
- Creates
venv/and installs the package in editable mode with macOS- and dev-extras (pip install -e '.[macos,dev]'). - Clones whisper.cpp into
vendor/whisper.cppand buildswhisper-servervia cmake (Metal is enabled automatically on Apple Silicon). - Downloads the
ggml-large-v3-turbo-q5_0.binmodel (~550 MB) intovendor/whisper.cpp/models/.
Already have whisper.cpp installed elsewhere? Override the paths via env (export them or pass them to make):
WHISPER_DIR=~/code/whisper.cpp make whisper
# or per-file:
WHISPER_SERVER=/path/to/whisper-server WHISPER_MODEL=/path/to/ggml.bin make whisperSubtargets if something specific failed: make install, make whisper-build, make whisper-model. Full wipe โ make distclean.
In two terminals:
# Terminal 1 โ whisper server (run once, keep it up)
make whisper
# Terminal 2 โ the app itself
make runA ๐ icon appears in the menu bar. Press Caps Lock, speak, press again โ text is pasted wherever the cursor is.
make lint # ruff check + format check
make fmt # ruff format + ruff check --fix
make test # pytestCI runs the same commands on every PR (see .github/workflows/ci.yml). Architecture notes and how to add your own backend live in CONTRIBUTING.md.
| Permission | Where to enable | Why |
|---|---|---|
| Input Monitoring | Settings โ Privacy โ Input Monitoring | CGEventTap (Caps Lock interception) |
| Microphone | Settings โ Privacy โ Microphone | Audio capture |
| Accessibility | Settings โ Privacy โ Accessibility | CGEventPost (Cmd+V simulation) |
Add Terminal (or iTerm) โ not Python itself โ since the app inherits permissions from its parent.
macOS will pop up the permission dialogs automatically the first time you run the app.
cream_typer/
โโโ src/ # imported as `cream_typer` (see pyproject.toml)
โ โโโ __init__.py # __version__
โ โโโ __main__.py # `python -m cream_typer`
โ โโโ app.py # business logic, NO platform-specific code
โ โโโ config.py # constants and transcription modes
โ โโโ recorder.py # sounddevice โ WAV in memory (io.BytesIO)
โ โโโ transcriber.py # HTTP client for the whisper.cpp server
โ โโโ backend/ # platform adapters (hotkey / paste / tray)
โ โโโ __init__.py # dispatch by sys.platform
โ โโโ _base.py # Protocol contracts for contributors
โ โโโ _macos.py # Quartz CGEventTap + Cmd+V + rumps โ
โ โโโ _windows.py # pynput + pystray ๐ง TBD
โ โโโ _linux.py # pynput + pystray (X11) ๐ง TBD
โโโ tests/ # pytest smoke + transcriber mocks
โโโ scripts/
โ โโโ whisper_server.sh # alternative to `make whisper`
โโโ .github/
โ โโโ workflows/ci.yml # lint + tests on macOS
โ โโโ ISSUE_TEMPLATE/ # bug / feature
โ โโโ PULL_REQUEST_TEMPLATE.md
โโโ pyproject.toml # build / deps / ruff / pytest config
โโโ Makefile # setup / run / lint / fmt / test / whisper / distclean
โโโ CHANGELOG.md # Keep a Changelog
โโโ CONTRIBUTING.md # how to contribute
โโโ LICENSE # MIT
Pick a mode by clicking inside the ๐ Languages submenu in the menu bar. The active one is checkmarked. The current set:
๐ โ English (from any) โ flagship "translate anything to English" shortcut
๐ฌ๐ง English ๐บ๐ฆ ะฃะบัะฐัะฝััะบะฐ ๐ช๐ธ Espaรฑol
๐ฉ๐ช Deutsch ๐ซ๐ท Franรงais ๐ฎ๐น Italiano
๐ต๐น Portuguรชs ๐ณ๐ฑ Nederlands ๐ต๐ฑ Polski
๐ฏ๐ต ๆฅๆฌ่ช ๐จ๐ณ ไธญๆ ๐ฐ๐ท ํ๊ตญ์ด
๐น๐ท Tรผrkรงe ๐น๐ญ เนเธเธข ๐ป๐ณ Tiแบฟng Viแปt
๐ธ๐ฆ ุงูุนุฑุจูุฉ ๐ท๐บ ะ ัััะบะธะน
Want a different set? Edit MODES, MODE_LABELS, and MENU_MODES in src/config.py. Whisper supports 99 languages โ adding any of them is a single line in three dicts, no UI code required.
Note on Thai (th): the large-v3-turbo model has noticeably degraded performance on Thai compared to large-v3 (Whisper's original quality matrix). It still works, but expect more errors. Vietnamese is fine.
Everything lives in src/config.py:
HOTKEY_KEYCODE = 57 # Caps Lock. 60=Right Shift, 61=Right Option, 54=Right Cmd
SAMPLE_RATE = 16000 # whisper.cpp expects 16 kHz
WHISPER_URL = "http://localhost:8080/inference"
DEFAULT_MODE = "en" # "en" / "ru" / "translate" / "uk" / ...
MIN_RECORDING_SEC = 0.3 # shorter taps are ignored
CLIPBOARD_RESTORE_DELAY = 0.15 # delay before the previous clipboard is restoredThe whisper-server's default language lives in the Makefile under WHISPER_LANG, but it's only a fallback: the client always passes language explicitly from MODES.
| Symptom | Cause | Fix |
|---|---|---|
| Hotkey doesn't fire | No Input Monitoring permission | Settings โ Privacy โ Input Monitoring โ Terminal |
| Text isn't pasted | No Accessibility permission | Settings โ Privacy โ Accessibility โ Terminal |
โ ๏ธ Whisper not running in the menu |
Server isn't up | make whisper in a separate terminal |
Empty transcription / โ ๏ธ Silence |
Too quiet, or shorter than MIN_RECORDING_SEC |
Speak louder / hold the tap longer than 0.3s |
โ ๏ธ Too short |
Caps Lock tap shorter than MIN_RECORDING_SEC |
Hold longer โ this guards against accidental taps |
| Double-fire on Shift+Caps | macOS clears AlphaShift on shift+caps | Already handled in src/backend/_macos.py โ events with Shift are ignored |
| Wrong output language | Wrong mode active | Menu bar โ ๐ Languages โ pick the right one |
The space is crowded; here's what I'd pick depending on what you need:
| Project | Stack | Hotkey | Backend | Notable |
|---|---|---|---|---|
| cream_typer (this) | Python + rumps | Caps Lock toggle | whisper.cpp HTTP | On-the-fly translation by swapping language instead of translate=true |
| foges/whisper-dictation | Python + rumps | Cmd+Option (toggle) | openai-whisper (PyTorch) | The well-known reference โ but it loads the model into RAM every time |
| pindrop | Swift native | hold-to-talk | WhisperKit (Core ML) | Fully native, the best perf/battery story |
| vocamac | Swift | hold-to-talk | WhisperKit | Tiny model bundled in the box, works out of the box |
| open-wispr | Electron | hold Globe (๐) | whisper.cpp | Friendly onboarding, but Electron |
| openwhispr | Electron, cross-platform | hold | local + cloud (BYOK) | Mac/Windows/Linux in one binary |
| GoWhisper | Go | hold | whisper.cpp | Built around terminal / Claude Code |
| AudioWhisper | Swift | hold | OpenAI API / Gemini | Not local โ sends audio to the cloud |
Pick cream_typer when:
- You want a toggle (tap-talk-tap) instead of holding a key.
- You already have a whisper.cpp build and don't want yet another process bundled.
- You want dictation in any language with auto-translation to English (and back).
- You care about codebase size โ this is ~300 lines, the whole thing reads in 10 minutes.
Pick something else when:
- You want a native macOS app without Python and cmake โ pindrop or vocamac.
- You need Windows/Linux today โ openwhispr.
- You'd rather not deal with config โ open-wispr (better onboarding).
This project stands on the shoulders of others:
- ggerganov/whisper.cpp โ the C++ inference engine that makes local Whisper fast enough on consumer hardware. The entire reason this project is possible.
- OpenAI Whisper โ the speech-recognition model architecture and the language-agnostic encoder we exploit for translation.
- rumps โ Pythonic macOS menu-bar bindings.
- The dictation OSS scene โ foges/whisper-dictation, pindrop, open-wispr, vocamac โ for showing the path.
Built by NeCL โ AI engineering studio shipping local-first AI: production RAG, real-time voice agents, on-prem deployment.
Need something custom? neclco.com ยท Telegram @ownerai ยท neclcompany@gmail.com