Skip to content

omachala/diction

Repository files navigation

Diction

Speak to Type and Edit

Voice keyboard for iOS. Works in every app.
On-device, cloud, or self-hosted - no limits.

Download on the App Store

WebsiteSelf-Hosting GuidePrivacy Policy

License: MIT Coverage


You talk. We type.  No limits. No word caps. No catch.  What you say stays with you.  Self-host. Your server, your rules.

Why Diction?

  • Works in every app. Tap the mic, speak, watch text land in whatever app you're in - Telegram, Mail, Notes, the search bar, anywhere a keyboard appears.
  • Self-hosted in minutes. docker compose up -d and paste your server's IP. Your hardware, your models, your data.
  • Works with any Whisper-compatible server. The gateway speaks the OpenAI transcription API (POST /v1/audio/transcriptions). Point it at any endpoint that implements it.
  • On-device. Whisper runs locally on your iPhone via WhisperKit. No network, no server, nothing leaves the device.
  • AI transcript cleanup. Wire any OpenAI-compatible LLM - OpenAI, Groq, Ollama, Anthropic - into the gateway to strip filler words and fix punctuation before text reaches the app. BYO prompt.
  • End-to-end encrypted. AES-256-GCM with X25519 key exchange between the app and the gateway. Same primitives used by Signal and WireGuard.
  • Zero tracking in the app. No analytics, no telemetry, no data collection. Audit the source yourself.
  • Free and unlimited. On-device and self-hosted modes have no caps, no word limits, no expiry.

Self-Hosting

The Diction app streams audio over a WebSocket connection, so you need the Diction Gateway in front of whatever speech model you run. The gateway handles the WebSocket protocol, end-to-end encryption, optional LLM cleanup, and model routing.

Full walkthrough with screenshots: How to Set Up Diction - the self-hosted speech-to-text alternative to Wispr Flow

Requirements:

  • Any machine that can run Docker: Mac, Linux box, NUC, home server, VPS. Apple Silicon works (via Rosetta).
  • iPhone running iOS 17.0 or later.

Step 1 - Write the Compose File

Create a folder for the stack and save this as docker-compose.yml:

services:
  whisper-small:
    image: fedirz/faster-whisper-server:latest-cpu
    container_name: diction-whisper-small
    restart: unless-stopped
    volumes:
      - whisper-models:/root/.cache/huggingface
    environment:
      WHISPER__MODEL: Systran/faster-whisper-small
      WHISPER__INFERENCE_DEVICE: cpu

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-small
    environment:
      DEFAULT_MODEL: small

volumes:
  whisper-models:

The whisper-models volume persists the model weights (~500 MB for small) so they survive container rebuilds. DEFAULT_MODEL: small maps to the service named whisper-small - see Swap the Speech Model if you change the model.

Step 2 - Start the Stack

docker compose up -d

First run pulls the images and downloads model weights - give it 2–3 minutes.

docker compose logs -f          # watch progress
docker compose ps               # check status

Expected:

NAME                     STATUS
diction-gateway          Up 30 seconds
diction-whisper-small    Up 2 minutes (healthy)
Error Fix
pull access denied on gateway image docker logout ghcr.io and retry
exec format error on Apple Silicon Enable Rosetta in Docker Desktop → Settings → General
health: starting for > 3 minutes Model still downloading - docker compose logs -f whisper-small
Gateway exits immediately Whisper container failed - check its logs

Step 3 - Test the Server

Generate a test audio file (macOS):

say -o test.aiff "Hello from my home server"

Or record a voice memo on your phone and AirDrop it over.

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@test.aiff" \
  -F "model=small"
{"text":"Hello from my home server."}
# Check timing headers
curl -sS -D - -o /dev/null \
  -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@test.aiff" -F "model=small" | grep -i diction

X-Diction-Whisper-Ms shows the speech model's inference latency.

Response Cause
Connection refused Gateway not running - docker compose ps
504 Gateway Timeout Whisper still loading - wait 60s
404 Not Found URL typo - path must be exactly /v1/audio/transcriptions
OOM / container crash Model too large for available RAM

Step 4 - Find Your Server's IP

macOS:

ipconfig getifaddr en0
# or
ifconfig | grep 'inet ' | grep -v 127.0.0.1

Linux:

hostname -I | awk '{print $1}'

Windows:

ipconfig | findstr IPv4

Pick the 192.168.x.x or 10.x.x.x address. Ignore anything starting with 100. - that's Tailscale.

Set a DHCP reservation in your router so the IP doesn't change on reboot. Or use Tailscale for a stable address that follows the machine anywhere.

Step 5 - Connect the App

Install Diction on your iPhone. On first launch:

  1. Settings → General → Keyboard → Keyboards → Add New Keyboard → Diction
  2. Tap Diction in the list → enable Allow Full Access
  3. Grant microphone access when prompted

Point it at your server:

  1. Open Diction → PreferencesModeSelf-Hosted
  2. Enter your endpoint: http://192.168.1.42:8080 (your IP from Step 4)
  3. Tap Test connection - you should get a green check within a second

To dictate: open any app, tap a text field, long-press the globe icon (bottom-left of the iOS keyboard), pick Diction, tap the mic, speak, release.

Reach From Anywhere

Tailscale (recommended)

Tailscale creates a private WireGuard mesh between your devices. Install it on the server and iPhone, sign in to the same account, and use the 100.x.x.x Tailscale IP as your Diction endpoint. Works on cellular, café WiFi, anywhere. Free for personal use.

Cloudflare Tunnel (public URL, no port forwarding)

Add to your compose file:

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: diction-cloudflared
    restart: unless-stopped
    command: tunnel --no-autoupdate run
    environment:
      TUNNEL_TOKEN: "${CLOUDFLARE_TUNNEL_TOKEN}"

Create a tunnel in the Cloudflare Zero Trust dashboard, grab the token, add it to .env, route the public hostname to http://gateway:8080. Free tier. Note: transcripts pass through Cloudflare's network (HTTPS-encrypted, but a third party is in the path).

ngrok (quick testing)

ngrok http 8080

Free tier URLs change on restart - good for a demo, not daily use.


Swap the Speech Model

Change two lines in your compose file:

DEFAULT_MODEL Service name WHISPER__MODEL RAM Notes
small whisper-small Systran/faster-whisper-small ~850 MB Best for CPU
medium whisper-medium Systran/faster-whisper-medium ~2.1 GB More accurate, slower on CPU
large-v3-turbo whisper-large-turbo deepdml/faster-whisper-large-v3-turbo-ct2 ~2.3 GB Best with NVIDIA GPU
parakeet-v3 parakeet - (baked into image) ~2 GB NVIDIA GPU, 25 European languages

Both DEFAULT_MODEL and the service name must match the table - the gateway resolves backends by Docker hostname. A mismatch returns 404 on every request.

docker compose up -d   # recreates only the changed container

NVIDIA GPU

Install the NVIDIA Container Toolkit on the host first.

Option A - Parakeet TDT 0.6B v3 (fastest, 25 European languages)

Parakeet transcribes a 5-second clip in well under a second on a consumer GPU.

Whisper Large-v3 Parakeet TDT 0.6B v3
WER (English) 7.4% ~6.3%
Latency (GPU) Under 2s Sub-second
VRAM (INT8) ~2.3 GB ~2 GB
Languages 99 25 European

Supported languages: English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian.

For languages outside this list, use Option B.

services:
  parakeet:
    image: ghcr.io/achetronic/parakeet:latest-int8
    container_name: diction-parakeet
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - parakeet
    environment:
      DEFAULT_MODEL: parakeet-v3

Model weights are baked into the image - no download on first start. Or use the profile from this repo:

docker compose --profile parakeet up -d

Option B - large-v3-turbo (multilingual, 99 languages)

services:
  whisper-large-turbo:
    image: fedirz/faster-whisper-server:latest-cuda
    container_name: diction-whisper-large-turbo
    restart: unless-stopped
    volumes:
      - whisper-models:/root/.cache/huggingface
    environment:
      WHISPER__MODEL: deepdml/faster-whisper-large-v3-turbo-ct2
      WHISPER__INFERENCE_DEVICE: cuda
      WHISPER__COMPUTE_TYPE: float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-large-turbo
    environment:
      DEFAULT_MODEL: large-v3-turbo

volumes:
  whisper-models:

First boot downloads ~1.6 GB of model weights into the volume. Subsequent starts are instant.


Already Have a Voice Server?

Keep it. Use CUSTOM_BACKEND_URL to put the Diction Gateway in front of your existing server for WebSocket streaming and end-to-end encryption:

services:
  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      CUSTOM_BACKEND_URL: http://your-existing-server:8000
      CUSTOM_BACKEND_MODEL: Systran/faster-whisper-small
Variable Description
CUSTOM_BACKEND_AUTH Authorization header forwarded to your backend, e.g. Bearer sk-xxx
CUSTOM_BACKEND_NEEDS_WAV Set to "true" if your backend only accepts WAV - the gateway converts with ffmpeg
CUSTOM_BACKEND_CANONICAL_ID HuggingFace-style ID advertised via /v1/models (default: CUSTOM_BACKEND_MODEL)

AI Cleanup (BYO LLM)

The gateway passes transcripts through any OpenAI-compatible LLM before returning them. You say "so um basically the meeting went well and uh they agreed to the timeline." The LLM returns "The meeting went well. They agreed to the timeline."

Enable the AI Companion toggle in the app. The gateway forwards the transcript to {LLM_BASE_URL}/chat/completions with your prompt, then returns the cleaned text. If the LLM fails, the raw transcript is returned - dictation never breaks.

Variable Required Description
LLM_BASE_URL Yes OpenAI-compatible endpoint, e.g. https://api.openai.com/v1
LLM_MODEL Yes Model identifier, e.g. gpt-4o-mini
LLM_API_KEY No Bearer token. Not needed for local Ollama.
LLM_PROMPT No System prompt string, or a file path starting with / (mount via volume)

Both LLM_BASE_URL and LLM_MODEL must be set or the feature stays off.

Option A - Cloud LLM (OpenAI, Groq, etc.)

echo "OPENAI_API_KEY=sk-your-key-here" > .env
  gateway:
    environment:
      DEFAULT_MODEL: small
      LLM_BASE_URL: "https://api.openai.com/v1"
      LLM_API_KEY: "${OPENAI_API_KEY}"
      LLM_MODEL: "gpt-4o-mini"
      LLM_PROMPT: "Clean up this voice transcription. Remove filler words (um, uh, like). Fix punctuation and capitalization. Return only the cleaned text, nothing else."

Docker Compose reads ${OPENAI_API_KEY} from .env automatically. Works with any OpenAI-compatible provider - Groq, Together, Fireworks, Mistral, OpenRouter - swap LLM_BASE_URL and LLM_MODEL.

Option B - Local Ollama (zero cost, fully private)

  ollama:
    image: ollama/ollama:latest
    container_name: diction-ollama
    restart: unless-stopped
    volumes:
      - ollama-models:/root/.ollama

  gateway:
    environment:
      DEFAULT_MODEL: small
      LLM_BASE_URL: "http://ollama:11434/v1"
      LLM_MODEL: "gemma2:9b"
      LLM_PROMPT: "Clean up this voice transcription. Remove filler words. Fix punctuation and capitalization. Return only the cleaned text, nothing else."

volumes:
  whisper-models:
  ollama-models:
docker compose up -d
docker exec diction-ollama ollama pull gemma2:9b
Model Memory Notes
gemma2:9b ~6 GB Best cleanup quality at this size
qwen2.5:7b ~5 GB Strong instruction following
llama3.1:8b ~5 GB Most popular, well-tested
gemma3:4b ~3 GB For tighter machines

Models under 7B tend to answer questions about the transcript instead of cleaning it up. 7B or larger recommended.

Testing cleanup

curl -X POST "http://localhost:8080/v1/audio/transcriptions?enhance=true" \
  -F "file=@test.aiff" \
  -F "model=small"
# Confirm LLM fired - look for X-Diction-LLM-Ms in the output
curl -sS -D - -o /dev/null \
  -X POST "http://localhost:8080/v1/audio/transcriptions?enhance=true" \
  -F "file=@test.aiff" -F "model=small" | grep -i diction

Prompt file

Mount a file and point LLM_PROMPT at the path:

  gateway:
    volumes:
      - ./cleanup-prompt.txt:/config/prompt.txt:ro
    environment:
      LLM_PROMPT: "/config/prompt.txt"

If LLM_PROMPT starts with /, the gateway reads it as a file. Otherwise it uses the string directly.


NixOS

The repo ships a flake with a hardened systemd module - no Docker needed.

nix run github:omachala/diction#diction-gateway

Enable as a service:

{
  inputs.diction.url = "github:omachala/diction";

  outputs = { nixpkgs, diction, ... }: {
    nixosConfigurations.your-host = nixpkgs.lib.nixosSystem {
      modules = [
        diction.nixosModules.default
        {
          services.diction-gateway = {
            enable = true;
            openFirewall = true;
            # customBackend.url = "http://127.0.0.1:8000";
            # llm.baseUrl = "http://127.0.0.1:11434/v1";
            # llm.model = "gemma2:9b";
            # environmentFile = "/run/secrets/diction-gateway.env";
          };
        }
      ];
    };
  };
}

The unit runs under DynamicUser with ProtectSystem=strict, NoNewPrivileges, and a narrow syscall filter. Use environmentFile for secrets - they don't end up in the world-readable Nix store. Full option list: nix/module.nix.


OpenAI API Compatibility

The gateway implements the OpenAI audio transcription API - any client that works against api.openai.com/v1/audio/transcriptions works against a Diction gateway.

from openai import OpenAI

client = OpenAI(
    base_url="http://your-server:8080/v1",
    api_key="anything",  # not checked when AUTH_ENABLED=false
)

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        file=f,
        model="small",            # or "Systran/faster-whisper-small"
        response_format="text",
    )
print(result)

Works with the Node SDK, LangChain, Flowise, n8n, or any tool that expects OpenAI's speech API.

Supported:

  • POST /v1/audio/transcriptions - file, model, language, prompt, response_format=json|text
  • GET /v1/models - returns an OpenAI-compatible data[] array plus a providers[] grouping consumed by the iOS app. Both HuggingFace IDs (Systran/faster-whisper-small, nvidia/parakeet-tdt-0.6b-v3) and short aliases (small, medium, large-v3-turbo, parakeet-v3) are accepted.
  • WebSocket /v1/audio/stream - used by the Diction app for low-latency streaming

Not supported:

  • TTS (/v1/audio/speech)
  • response_format=verbose_json|srt|vtt (no word-level timestamps)
  • SSE streaming on REST (use WebSocket /v1/audio/stream instead)
  • Model download/delete (POST/DELETE /v1/models/{id})
  • OpenAI Realtime API (/v1/realtime)

Authentication is off by default (AUTH_ENABLED=false). Pass any non-empty string as the API key from the client - the gateway doesn't check it. To lock down a public-facing deployment, set AUTH_ENABLED=true and configure tokens in the gateway env.

Error shape: errors return {"error":"<message>"}, not OpenAI's nested {"error":{"message":"...","type":"..."}}. Most SDKs surface these as HTTPError rather than APIError.


Privacy

  • On-device: Everything stays on your phone. No network connection is made.
  • Self-hosted: Audio goes to your server only. Neither the gateway nor faster-whisper-server persists audio - it's transcribed and discarded.
  • AI cleanup enabled: The transcript (plain text, no audio) goes to your configured LLM. If you use Ollama locally, nothing leaves your machine.
  • Diction One (cloud): Audio is transcribed and immediately discarded. Not stored, not used for training.
  • Zero third-party SDKs in the app. No analytics, no tracking, no telemetry.
  • Full Access is required by iOS for any keyboard that makes network requests. Diction has no QWERTY input - the only data that leaves the app is the audio recording, sent to the endpoint you configured.

Read the full Privacy Policy.


Diction One

On-device and self-hosted are completely free with no word limits.

If you don't want to run a server, Diction One gives you a fine-tuned cloud model with advanced audio filtering - without the setup. Audio is sent to the Diction endpoint, transcribed, and immediately discarded. Pricing and trial details are in the app.


Contributing

Contributions are welcome. See CONTRIBUTING.md.

License

MIT. See LICENSE.

About

iOS keyboard that transcribes speech to text in any app

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages