GitHub - omachala/diction: iOS keyboard that transcribes speech to text in any app

Speak to Type and Edit

Voice keyboard for iOS. Works in every app.
On-device, cloud, or self-hosted - no limits.

Website • Self-Hosting Guide • Privacy Policy

Why Diction?

Works in every app. Tap the mic, speak, watch text land in whatever app you're in - Telegram, Mail, Notes, the search bar, anywhere a keyboard appears.
Self-hosted in minutes. docker compose up -d and paste your server's IP. Your hardware, your models, your data.
Works with any Whisper-compatible server. The gateway speaks the OpenAI transcription API (POST /v1/audio/transcriptions). Point it at any endpoint that implements it.
On-device. Whisper runs locally on your iPhone via WhisperKit. No network, no server, nothing leaves the device.
AI transcript cleanup. Wire any OpenAI-compatible LLM - OpenAI, Groq, Ollama, Anthropic - into the gateway to strip filler words and fix punctuation before text reaches the app. BYO prompt.
End-to-end encrypted. AES-256-GCM with X25519 key exchange between the app and the gateway. Same primitives used by Signal and WireGuard.
Zero tracking in the app. No analytics, no telemetry, no data collection. Audit the source yourself.
Free and unlimited. On-device and self-hosted modes have no caps, no word limits, no expiry.

Self-Hosting

The Diction app streams audio over a WebSocket connection, so you need the Diction Gateway in front of whatever speech model you run. The gateway handles the WebSocket protocol, end-to-end encryption, optional LLM cleanup, and model routing.

Full walkthrough with screenshots: How to Set Up Diction - the self-hosted speech-to-text alternative to Wispr Flow

Requirements:

Any machine that can run Docker: Mac, Linux box, NUC, home server, VPS. Apple Silicon works (via Rosetta).
iPhone running iOS 17.0 or later.

Step 1 - Write the Compose File

Create a folder for the stack and save this as docker-compose.yml:

services:
  whisper-small:
    image: fedirz/faster-whisper-server:latest-cpu
    container_name: diction-whisper-small
    restart: unless-stopped
    volumes:
      - whisper-models:/root/.cache/huggingface
    environment:
      WHISPER__MODEL: Systran/faster-whisper-small
      WHISPER__INFERENCE_DEVICE: cpu

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-small
    environment:
      DEFAULT_MODEL: small

volumes:
  whisper-models:

The whisper-models volume persists the model weights (~500 MB for small) so they survive container rebuilds. DEFAULT_MODEL: small maps to the service named whisper-small - see Swap the Speech Model if you change the model.

Step 2 - Start the Stack

docker compose up -d

First run pulls the images and downloads model weights - give it 2–3 minutes.

docker compose logs -f          # watch progress
docker compose ps               # check status

Expected:

NAME                     STATUS
diction-gateway          Up 30 seconds
diction-whisper-small    Up 2 minutes (healthy)

Error	Fix
`pull access denied` on gateway image	`docker logout ghcr.io` and retry
`exec format error` on Apple Silicon	Enable Rosetta in Docker Desktop → Settings → General
`health: starting` for > 3 minutes	Model still downloading - `docker compose logs -f whisper-small`
Gateway exits immediately	Whisper container failed - check its logs

Step 3 - Test the Server

Generate a test audio file (macOS):

say -o test.aiff "Hello from my home server"

Or record a voice memo on your phone and AirDrop it over.

curl -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@test.aiff" \
  -F "model=small"

{"text":"Hello from my home server."}

# Check timing headers
curl -sS -D - -o /dev/null \
  -X POST http://localhost:8080/v1/audio/transcriptions \
  -F "file=@test.aiff" -F "model=small" | grep -i diction

X-Diction-Whisper-Ms shows the speech model's inference latency.

Response	Cause
Connection refused	Gateway not running - `docker compose ps`
504 Gateway Timeout	Whisper still loading - wait 60s
404 Not Found	URL typo - path must be exactly `/v1/audio/transcriptions`
OOM / container crash	Model too large for available RAM

Step 4 - Find Your Server's IP

macOS:

ipconfig getifaddr en0
# or
ifconfig | grep 'inet ' | grep -v 127.0.0.1

Linux:

hostname -I | awk '{print $1}'

Windows:

ipconfig | findstr IPv4

Pick the 192.168.x.x or 10.x.x.x address. Ignore anything starting with 100. - that's Tailscale.

Set a DHCP reservation in your router so the IP doesn't change on reboot. Or use Tailscale for a stable address that follows the machine anywhere.

Step 5 - Connect the App

Install Diction on your iPhone. On first launch:

Settings → General → Keyboard → Keyboards → Add New Keyboard → Diction
Tap Diction in the list → enable Allow Full Access
Grant microphone access when prompted

Point it at your server:

Open Diction → Preferences → Mode → Self-Hosted
Enter your endpoint: http://192.168.1.42:8080 (your IP from Step 4)
Tap Test connection - you should get a green check within a second

To dictate: open any app, tap a text field, long-press the globe icon (bottom-left of the iOS keyboard), pick Diction, tap the mic, speak, release.

Reach From Anywhere

Tailscale (recommended)

Tailscale creates a private WireGuard mesh between your devices. Install it on the server and iPhone, sign in to the same account, and use the 100.x.x.x Tailscale IP as your Diction endpoint. Works on cellular, café WiFi, anywhere. Free for personal use.

Cloudflare Tunnel (public URL, no port forwarding)

Add to your compose file:

  cloudflared:
    image: cloudflare/cloudflared:latest
    container_name: diction-cloudflared
    restart: unless-stopped
    command: tunnel --no-autoupdate run
    environment:
      TUNNEL_TOKEN: "${CLOUDFLARE_TUNNEL_TOKEN}"

Create a tunnel in the Cloudflare Zero Trust dashboard, grab the token, add it to .env, route the public hostname to http://gateway:8080. Free tier. Note: transcripts pass through Cloudflare's network (HTTPS-encrypted, but a third party is in the path).

ngrok (quick testing)

ngrok http 8080

Free tier URLs change on restart - good for a demo, not daily use.

Swap the Speech Model

Change two lines in your compose file:

`DEFAULT_MODEL`	Service name	`WHISPER__MODEL`	RAM	Notes
`small`	`whisper-small`	`Systran/faster-whisper-small`	~850 MB	Best for CPU
`medium`	`whisper-medium`	`Systran/faster-whisper-medium`	~2.1 GB	More accurate, slower on CPU
`large-v3-turbo`	`whisper-large-turbo`	`deepdml/faster-whisper-large-v3-turbo-ct2`	~2.3 GB	Best with NVIDIA GPU
`parakeet-v3`	`parakeet`	- (baked into image)	~2 GB	NVIDIA GPU, 25 European languages

Both DEFAULT_MODEL and the service name must match the table - the gateway resolves backends by Docker hostname. A mismatch returns 404 on every request.

docker compose up -d   # recreates only the changed container

NVIDIA GPU

Install the NVIDIA Container Toolkit on the host first.

Option A - Parakeet TDT 0.6B v3 (fastest, 25 European languages)

Parakeet transcribes a 5-second clip in well under a second on a consumer GPU.

	Whisper Large-v3	Parakeet TDT 0.6B v3
WER (English)	7.4%	~6.3%
Latency (GPU)	Under 2s	Sub-second
VRAM (INT8)	~2.3 GB	~2 GB
Languages	99	25 European

Supported languages: English, Bulgarian, Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian.

For languages outside this list, use Option B.

services:
  parakeet:
    image: ghcr.io/achetronic/parakeet:latest-int8
    container_name: diction-parakeet
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - parakeet
    environment:
      DEFAULT_MODEL: parakeet-v3

Model weights are baked into the image - no download on first start. Or use the profile from this repo:

docker compose --profile parakeet up -d

Option B - large-v3-turbo (multilingual, 99 languages)

services:
  whisper-large-turbo:
    image: fedirz/faster-whisper-server:latest-cuda
    container_name: diction-whisper-large-turbo
    restart: unless-stopped
    volumes:
      - whisper-models:/root/.cache/huggingface
    environment:
      WHISPER__MODEL: deepdml/faster-whisper-large-v3-turbo-ct2
      WHISPER__INFERENCE_DEVICE: cuda
      WHISPER__COMPUTE_TYPE: float16
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    depends_on:
      - whisper-large-turbo
    environment:
      DEFAULT_MODEL: large-v3-turbo

volumes:
  whisper-models:

First boot downloads ~1.6 GB of model weights into the volume. Subsequent starts are instant.

Already Have a Voice Server?

Keep it. Use CUSTOM_BACKEND_URL to put the Diction Gateway in front of your existing server for WebSocket streaming and end-to-end encryption:

services:
  gateway:
    image: ghcr.io/omachala/diction-gateway:latest
    platform: linux/amd64
    container_name: diction-gateway
    restart: unless-stopped
    ports:
      - "8080:8080"
    environment:
      CUSTOM_BACKEND_URL: http://your-existing-server:8000
      CUSTOM_BACKEND_MODEL: Systran/faster-whisper-small

Variable	Description
`CUSTOM_BACKEND_AUTH`	Authorization header forwarded to your backend, e.g. `Bearer sk-xxx`
`CUSTOM_BACKEND_NEEDS_WAV`	Set to `"true"` if your backend only accepts WAV - the gateway converts with ffmpeg
`CUSTOM_BACKEND_CANONICAL_ID`	HuggingFace-style ID advertised via `/v1/models` (default: `CUSTOM_BACKEND_MODEL`)

AI Cleanup (BYO LLM)

The gateway passes transcripts through any OpenAI-compatible LLM before returning them. You say "so um basically the meeting went well and uh they agreed to the timeline." The LLM returns "The meeting went well. They agreed to the timeline."

Enable the AI Companion toggle in the app. The gateway forwards the transcript to {LLM_BASE_URL}/chat/completions with your prompt, then returns the cleaned text. If the LLM fails, the raw transcript is returned - dictation never breaks.

Variable	Required	Description
`LLM_BASE_URL`	Yes	OpenAI-compatible endpoint, e.g. `https://api.openai.com/v1`
`LLM_MODEL`	Yes	Model identifier, e.g. `gpt-4o-mini`
`LLM_API_KEY`	No	Bearer token. Not needed for local Ollama.
`LLM_PROMPT`	No	System prompt string, or a file path starting with `/` (mount via volume)

Both LLM_BASE_URL and LLM_MODEL must be set or the feature stays off.

Option A - Cloud LLM (OpenAI, Groq, etc.)

echo "OPENAI_API_KEY=sk-your-key-here" > .env

  gateway:
    environment:
      DEFAULT_MODEL: small
      LLM_BASE_URL: "https://api.openai.com/v1"
      LLM_API_KEY: "${OPENAI_API_KEY}"
      LLM_MODEL: "gpt-4o-mini"
      LLM_PROMPT: "Clean up this voice transcription. Remove filler words (um, uh, like). Fix punctuation and capitalization. Return only the cleaned text, nothing else."

Docker Compose reads ${OPENAI_API_KEY} from .env automatically. Works with any OpenAI-compatible provider - Groq, Together, Fireworks, Mistral, OpenRouter - swap LLM_BASE_URL and LLM_MODEL.

Option B - Local Ollama (zero cost, fully private)

  ollama:
    image: ollama/ollama:latest
    container_name: diction-ollama
    restart: unless-stopped
    volumes:
      - ollama-models:/root/.ollama

  gateway:
    environment:
      DEFAULT_MODEL: small
      LLM_BASE_URL: "http://ollama:11434/v1"
      LLM_MODEL: "gemma2:9b"
      LLM_PROMPT: "Clean up this voice transcription. Remove filler words. Fix punctuation and capitalization. Return only the cleaned text, nothing else."

volumes:
  whisper-models:
  ollama-models:

docker compose up -d
docker exec diction-ollama ollama pull gemma2:9b

Model	Memory	Notes
`gemma2:9b`	~6 GB	Best cleanup quality at this size
`qwen2.5:7b`	~5 GB	Strong instruction following
`llama3.1:8b`	~5 GB	Most popular, well-tested
`gemma3:4b`	~3 GB	For tighter machines

Models under 7B tend to answer questions about the transcript instead of cleaning it up. 7B or larger recommended.

Testing cleanup

curl -X POST "http://localhost:8080/v1/audio/transcriptions?enhance=true" \
  -F "file=@test.aiff" \
  -F "model=small"

# Confirm LLM fired - look for X-Diction-LLM-Ms in the output
curl -sS -D - -o /dev/null \
  -X POST "http://localhost:8080/v1/audio/transcriptions?enhance=true" \
  -F "file=@test.aiff" -F "model=small" | grep -i diction

Prompt file

Mount a file and point LLM_PROMPT at the path:

  gateway:
    volumes:
      - ./cleanup-prompt.txt:/config/prompt.txt:ro
    environment:
      LLM_PROMPT: "/config/prompt.txt"

If LLM_PROMPT starts with /, the gateway reads it as a file. Otherwise it uses the string directly.

NixOS

The repo ships a flake with a hardened systemd module - no Docker needed.

nix run github:omachala/diction#diction-gateway

Enable as a service:

{
  inputs.diction.url = "github:omachala/diction";

  outputs = { nixpkgs, diction, ... }: {
    nixosConfigurations.your-host = nixpkgs.lib.nixosSystem {
      modules = [
        diction.nixosModules.default
        {
          services.diction-gateway = {
            enable = true;
            openFirewall = true;
            # customBackend.url = "http://127.0.0.1:8000";
            # llm.baseUrl = "http://127.0.0.1:11434/v1";
            # llm.model = "gemma2:9b";
            # environmentFile = "/run/secrets/diction-gateway.env";
          };
        }
      ];
    };
  };
}

The unit runs under DynamicUser with ProtectSystem=strict, NoNewPrivileges, and a narrow syscall filter. Use environmentFile for secrets - they don't end up in the world-readable Nix store. Full option list: nix/module.nix.

OpenAI API Compatibility

The gateway implements the OpenAI audio transcription API - any client that works against api.openai.com/v1/audio/transcriptions works against a Diction gateway.

from openai import OpenAI

client = OpenAI(
    base_url="http://your-server:8080/v1",
    api_key="anything",  # not checked when AUTH_ENABLED=false
)

with open("audio.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        file=f,
        model="small",            # or "Systran/faster-whisper-small"
        response_format="text",
    )
print(result)

Works with the Node SDK, LangChain, Flowise, n8n, or any tool that expects OpenAI's speech API.

Supported:

POST /v1/audio/transcriptions - file, model, language, prompt, response_format=json|text
GET /v1/models - returns an OpenAI-compatible data[] array plus a providers[] grouping consumed by the iOS app. Both HuggingFace IDs (Systran/faster-whisper-small, nvidia/parakeet-tdt-0.6b-v3) and short aliases (small, medium, large-v3-turbo, parakeet-v3) are accepted.
WebSocket /v1/audio/stream - used by the Diction app for low-latency streaming

Not supported:

TTS (/v1/audio/speech)
response_format=verbose_json|srt|vtt (no word-level timestamps)
SSE streaming on REST (use WebSocket /v1/audio/stream instead)
Model download/delete (POST/DELETE /v1/models/{id})
OpenAI Realtime API (/v1/realtime)

Authentication is off by default (AUTH_ENABLED=false). Pass any non-empty string as the API key from the client - the gateway doesn't check it. To lock down a public-facing deployment, set AUTH_ENABLED=true and configure tokens in the gateway env.

Error shape: errors return {"error":"<message>"}, not OpenAI's nested {"error":{"message":"...","type":"..."}}. Most SDKs surface these as HTTPError rather than APIError.

Privacy

On-device: Everything stays on your phone. No network connection is made.
Self-hosted: Audio goes to your server only. Neither the gateway nor faster-whisper-server persists audio - it's transcribed and discarded.
AI cleanup enabled: The transcript (plain text, no audio) goes to your configured LLM. If you use Ollama locally, nothing leaves your machine.
Diction One (cloud): Audio is transcribed and immediately discarded. Not stored, not used for training.
Zero third-party SDKs in the app. No analytics, no tracking, no telemetry.
Full Access is required by iOS for any keyboard that makes network requests. Diction has no QWERTY input - the only data that leaves the app is the audio recording, sent to the endpoint you configured.

Read the full Privacy Policy.

Diction One

On-device and self-hosted are completely free with no word limits.

If you don't want to run a server, Diction One gives you a fine-tuned cloud model with advanced audio filtering - without the setup. Audio is sent to the Diction endpoint, transcribed, and immediately discarded. Pricing and trial details are in the app.

Contributing

Contributions are welcome. See CONTRIBUTING.md.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.githooks		.githooks
.github		.github
assets		assets
docs		docs
gateway		gateway
nix		nix
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
flake.lock		flake.lock
flake.nix		flake.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why Diction?

Self-Hosting

Step 1 - Write the Compose File

Step 2 - Start the Stack

Step 3 - Test the Server

Step 4 - Find Your Server's IP

Step 5 - Connect the App

Reach From Anywhere

Swap the Speech Model

NVIDIA GPU

Option A - Parakeet TDT 0.6B v3 (fastest, 25 European languages)

Option B - large-v3-turbo (multilingual, 99 languages)

Already Have a Voice Server?

AI Cleanup (BYO LLM)

Option A - Cloud LLM (OpenAI, Groq, etc.)

Option B - Local Ollama (zero cost, fully private)

Testing cleanup

Prompt file

NixOS

OpenAI API Compatibility

Privacy

Diction One

Contributing

License

About

Uh oh!

Releases 12

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why Diction?

Self-Hosting

Step 1 - Write the Compose File

Step 2 - Start the Stack

Step 3 - Test the Server

Step 4 - Find Your Server's IP

Step 5 - Connect the App

Reach From Anywhere

Swap the Speech Model

NVIDIA GPU

Option A - Parakeet TDT 0.6B v3 (fastest, 25 European languages)

Option B - large-v3-turbo (multilingual, 99 languages)

Already Have a Voice Server?

AI Cleanup (BYO LLM)

Option A - Cloud LLM (OpenAI, Groq, etc.)

Option B - Local Ollama (zero cost, fully private)

Testing cleanup

Prompt file

NixOS

OpenAI API Compatibility

Privacy

Diction One

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 12

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages