AI-powered speech-to-text workflows for desktop and web.
Live dictation, YouTube transcription, file transcription, transcript management, summaries, and export.
Status • Features • Quick Start • Usage • Architecture • API • Configuration • Development • Troubleshooting
Last verified: 2026-06-01
Scriber is a local-first transcription app with a Python backend, a React web UI, and a legacy Tkinter fallback UI. The current primary runtime is Windows with tray integration, global hotkeys, microphone device monitoring, and local SQLite persistence.
Current implementation highlights:
- Live microphone transcription with WebSocket status/audio/transcript events.
- YouTube and file transcription with persistent jobs, retry scheduling, and resume support.
- Multi-provider STT support, including cloud providers and local ONNX/NeMo paths.
- SQLite transcript storage with WAL mode, metadata list loading, pagination, and FTS5 search.
- DeviceMonitor for microphone hotplug handling with native Windows endpoint events where available and polling fallback.
- Recording-aware PortAudio refresh: device refreshes are deferred while a recording stream is active and run once after the stream becomes idle.
- Short-lived microphone device-resolution cache for selected/favorite mic lookup.
- Route-level frontend lazy loading for non-default pages and a single shared WebSocket connection.
Known limits:
SCRIBER_MIC_ALWAYS_ONexists as a setting, but it is not a true app-level always-on/prewarmed microphone stream yet. Per-session streams are closed during cleanup to avoid orphaned PortAudio resources.- Frontend transcript-list virtualization/infinite loading is still open.
- Vite production build can still warn about an initial chunk over 500 kB; manual vendor chunking is still open.
- Some upload preprocessing and export generation still run synchronously in async request paths.
- Very long live sessions can still hit O(n^2)-style string growth when appending final transcript chunks.
- Global hotkey, default
Ctrl+Alt+S. - Modes:
toggle: press once to start, press again to stop.push_to_talk: record while the hotkey is held.
- Live WebSocket events for state, status, audio level, warnings, transcripts, session lifecycle, history updates, and errors.
- Favorite microphone selection with fallback to selected/default device.
- Device hotplug detection via
DeviceMonitor. - Low input-level warning flow for muted/quiet microphones.
- Recording overlay with preparing/recording/transcribing states.
- Text injection into the active app through
auto,sendinput,paste, ortype.
- YouTube search and video lookup through the YouTube Data API.
- Download and audio extraction through
yt-dlpand ffmpeg. - Persistent job lifecycle with retry/resume support.
- Transcript entries are saved as
youtuberecords.
- Multipart upload through
POST /api/file/transcribe. - Supported audio formats:
.mp3,.wav,.m4a,.flac,.aac,.ogg. - Supported video formats:
.mp4,.mov,.webm,.avi,.mkv,.m4v. - Video audio extraction through ffmpeg.
- Default audio upload limit:
200 MB. - Raw video upload hard limit:
2048 MB. - Extracted/compressed audio is limited by the final audio/provider limit.
Provider coverage includes:
- Soniox realtime and async
- Mistral realtime and async
- AssemblyAI Universal-3-Pro async
- Deepgram
- OpenAI
- Azure Speech
- Azure MAI Transcribe
- Gladia
- Groq
- Speechmatics
- ElevenLabs
- AWS Transcribe
- Smallest
- ONNX local models
- NeMo local models
Provider routing, retry scheduling, and circuit-breaker logic exist in the backend. Verify provider-specific behavior in code before changing a provider contract.
- SQLite persistence in
transcripts.db. - Transcript list pagination with
offset/limit. - Type filtering by
mic,youtube, orfile. - FTS5-backed search.
- Detail view with full content and summary.
- Delete, cancel, summarize, export.
- Export as PDF or DOCX.
- Optional automatic summarization after job completion.
- ONNX model list, download, status, delete.
- Quantization options:
int8,fp16,fp32. - Optional ONNX GPU flag.
- NeMo model list, download, delete.
- Python 3.10+
- Node.js 20+ for the web UI
- ffmpeg available on
PATHfor YouTube/file audio extraction - Windows recommended for tray, global hotkey, overlay, and microphone device monitoring
git clone https://github.com/MyButtermilk/Scriber.git
cd Scriber
start.batstart.bat handles:
- Python check
- virtual environment setup
- dependency installation when needed
- initial
.envcreation if missing - tray/web startup when Node and
Frontend/are available - Tkinter fallback when the web UI cannot be started
- backend health check at
http://127.0.0.1:8765/api/health - browser open at
http://localhost:5000
./start.shThe shell script sets up dependencies and starts the Tkinter path. The full tray/hotkey/device-monitor experience is Windows-focused.
# Backend only
python -m src.web_api# Frontend client only
cd Frontend
npm install
npm run dev:client# Frontend Express/Vite dev host
cd Frontend
npm run dev# Frontend production build and start
cd Frontend
npm run build
npm startDefault URLs:
- Backend:
http://127.0.0.1:8765 - Web UI:
http://localhost:5000 - WebSocket:
ws://127.0.0.1:8765/ws
Additional entrypoints:
python -m src.tray
python -m src.main/: Live Mic/youtube: YouTube transcription/file: File transcription/transcript/:id: Transcript detail/settings: Settings
- Select the STT provider and microphone in Settings.
- Optional: set a favorite microphone. It is preferred when available.
- Start from the UI or with the configured hotkey.
- Wait for the overlay/state to switch from preparing to recording before speaking.
- Stop recording through UI or hotkey.
- The final transcript is saved as a
micentry and can be summarized/exported.
Important microphone behavior:
- DeviceMonitor keeps the frontend microphone list updated after USB/dock changes.
- PortAudio refresh is deferred during active recording to avoid native races.
- Mic selection is cached briefly to avoid repeated device scans on consecutive starts.
SCRIBER_MIC_ALWAYS_ON=1does not yet keep a reusable app-level mic stream alive.
- Set
YOUTUBE_API_KEY. - Search or paste a video URL/ID.
- Start transcription.
- Track job progress in the UI and transcript history.
- Open
/file. - Drop or select an audio/video file.
- The backend validates size/type, extracts audio for videos, and starts a transcription job.
- Results appear in transcript history.
The backend settings API manages:
- hotkey and recording mode
- STT provider and provider-specific models
- language
- microphone and favorite microphone
- injection method
- API keys
- ONNX/NeMo local models
- summarization model, prompt, and auto-summary setting
- visualizer bar count
AWS credentials are not fully managed through apiKeys; use the standard AWS environment variables.
flowchart LR
User["Browser / Hotkey / Tray"] -->|"HTTP + WebSocket"| Backend["Python Backend\nsrc.web_api"]
Backend --> Controller["ScriberWebController"]
Controller --> Pipeline["ScriberPipeline\nProviderRouter"]
Controller --> DB[("SQLite\ntranscripts.db")]
Controller --> Jobs["JobStore\nRetryScheduler"]
Controller --> Monitor["DeviceMonitor\nMic Resolution Cache"]
Pipeline --> Providers["STT Providers\nCloud + Local"]
Pipeline --> Mic["MicrophoneInput\nsounddevice"]
Backend <--> Frontend["React UI\nFrontend/client"]
- Live Mic:
POST /api/live-mic/start|stop|toggle- microphone stream
- Pipecat/STT pipeline
- WebSocket events
- transcript persistence
- optional text injection
- YouTube:
- YouTube Data API lookup
yt-dlpdownload- ffmpeg audio extraction
- STT pipeline/direct provider path
- job persistence and retry/resume
- File:
- multipart upload
- size/type validation
- optional ffmpeg extraction/compression
- STT pipeline/direct provider path
- transcript persistence
- Frontend:
- REST for commands and data
- single shared WebSocket for live events
- React Query for server state
src/web_api.py: REST, WebSocket, settings, jobs, transcript API.src/pipeline.py: provider creation, STT pipeline, analyzer cache, mic resolution.src/microphone.py:sounddevicetransport and audio callback.src/audio_devices.py: deduplication, host API priority, compatibility.src/device_monitor.py: hotplug detection and PortAudio refresh.src/database.py: SQLite persistence and FTS.src/runtime/: provider router and retry scheduler.src/core/: state machine, circuit breaker, error taxonomy, event contracts, tracing.
- Vite 7 + React 19 + TypeScript.
- Wouter routing.
- TanStack Query for API data.
- Single
WebSocketProvider. - LiveMic is eagerly loaded for the default route.
- YouTube, File, Settings, TranscriptDetail, and NotFound are lazy-loaded chunks.
- Tailwind v4 CSS-first setup through
Frontend/client/src/index.css. - Radix/shadcn-style primitives and existing neumorphic classes.
GET /api/healthGET /api/stateGET /api/metrics/hot-path?limit=n
limit for hot-path metrics is clamped to 1..500.
GET /ws
Core event types:
statestatustranscriptaudio_levelinput_warningtranscribingsession_startedsession_finishedhistory_updatederror
POST /api/live-mic/startPOST /api/live-mic/stopPOST /api/live-mic/toggle
GET /api/transcripts?offset=0&limit=50&type={mic|youtube|file}&q={query}GET /api/transcripts/{id}DELETE /api/transcripts/{id}POST /api/transcripts/{id}/summarizePOST /api/transcripts/{id}/cancelGET /api/transcripts/{id}/export/{format}
limit defaults to 50 and is clamped to 1..100. Export format is pdf or docx.
GET /api/youtube/search?q={query}&maxResults={n}&pageToken={token}GET /api/youtube/video?id={id}GET /api/youtube/video?url={url}POST /api/youtube/transcribe
POST /api/file/transcribe
Expected body: multipart/form-data with field file.
GET /api/settingsPUT /api/settingsGET /api/microphonesGET /api/autostartPOST /api/autostart
GET /api/onnx/modelsGET /api/onnx/models/{model_id}POST /api/onnx/downloadDELETE /api/onnx/models/{model_id}GET /api/nemo/modelsPOST /api/nemo/downloadDELETE /api/nemo/models/{model_id}
ONNX model status/delete can use an optional quantization query parameter.
Configuration is loaded from environment variables and .env. Multi-line summarization prompt state can also be stored in settings.json.
Do not commit .env, settings.json, transcripts.db, downloads/, or generated local artifacts.
SCRIBER_WEB_HOST=127.0.0.1
SCRIBER_WEB_PORT=8765
SCRIBER_ALLOWED_ORIGINS=Default CORS allows localhost, 127.0.0.1, and ::1. SCRIBER_ALLOWED_ORIGINS=* allows all origins.
VITE_BACKEND_URL=http://127.0.0.1:8765
PORT=5000SCRIBER_HOTKEY=ctrl+alt+s
SCRIBER_MODE=toggle
SCRIBER_DEFAULT_STT=soniox
SCRIBER_STT_FALLBACKS=
SCRIBER_LANGUAGE=auto
SCRIBER_DEBUG=0
SCRIBER_CUSTOM_VOCAB=SCRIBER_SONIOX_MODE=realtime
SCRIBER_SONIOX_ASYNC_MODEL=stt-async-v4
SCRIBER_SONIOX_RT_MODEL=stt-rt-v4
SCRIBER_MISTRAL_RT_MODEL=voxtral-mini-transcribe-realtime-2602
SCRIBER_MISTRAL_ASYNC_MODEL=voxtral-mini-2602
SCRIBER_OPENAI_STT_MODEL=gpt-4o-mini-transcribe-2025-12-15
SCRIBER_AZURE_MAI_REGION=northeuropeSCRIBER_MIC_DEVICE=default
SCRIBER_FAVORITE_MIC=
SCRIBER_MIC_ALWAYS_ON=0
SCRIBER_MIC_BLOCK_SIZE=512
SCRIBER_MIC_DEVICE_CACHE_TTL_SEC=10.0
SCRIBER_MIC_LOW_RMS_THRESHOLD=0.001
SCRIBER_MIC_LOW_RMS_CLEAR_THRESHOLD=0.0025
SCRIBER_MIC_LOW_RMS_WARN_AFTER_SECS=6.0
SCRIBER_INJECT_METHOD=auto
SCRIBER_PASTE_PRE_DELAY_MS=80
SCRIBER_PASTE_RESTORE_DELAY_MS=1500SCRIBER_MIC_ALWAYS_ON is currently not a real persistent prewarm stream. Leave it off unless you are testing the surrounding setting flow.
SCRIBER_UPLOAD_MAX_MB=200
SCRIBER_UPLOAD_MAX_BYTES=
SCRIBER_DOWNLOADS_DIR=downloads
SCRIBER_JOB_MAX_ATTEMPTS=3
SCRIBER_JOB_RETRY_BASE_SEC=5
SCRIBER_JOB_RETRY_MAX_SEC=120
SCRIBER_TIMEOUT_FILE_TRANSCRIBE_SEC=600
SCRIBER_TIMEOUT_YOUTUBE_TRANSCRIBE_SEC=600
SCRIBER_TIMEOUT_YOUTUBE_DOWNLOAD_SEC=300SCRIBER_BREAKER_FAILURE_THRESHOLD=3
SCRIBER_BREAKER_COOLDOWN_SEC=30
SCRIBER_VALIDATE_WS_CONTRACTS=0
SCRIBER_HOTKEY_DISPATCH_DEBOUNCE_SEC=0.25
SCRIBER_LOG_STDERR=1SCRIBER_SUMMARIZATION_MODEL=gemini-flash-latest
SCRIBER_AUTO_SUMMARIZE=0
SCRIBER_SUMMARY_MIN_WORDS=180
SCRIBER_SUMMARY_MAX_WORDS=2200
SCRIBER_SUMMARIZATION_PROMPT=...Current default summarization model: gemini-flash-latest.
SONIOX_API_KEY=...
MISTRAL_API_KEY=...
ASSEMBLYAI_API_KEY=...
DEEPGRAM_API_KEY=...
OPENAI_API_KEY=...
AZURE_SPEECH_KEY=...
AZURE_SPEECH_REGION=...
GLADIA_API_KEY=...
GROQ_API_KEY=...
SPEECHMATICS_API_KEY=...
ELEVENLABS_API_KEY=...
GOOGLE_API_KEY=...
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
YOUTUBE_API_KEY=...AWS uses standard SDK environment variables:
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=...SCRIBER_ONNX_MODEL=nemo-parakeet-tdt-0.6b-v3
SCRIBER_ONNX_QUANTIZATION=int8
SCRIBER_ONNX_USE_GPU=0
SCRIBER_NEMO_MODEL=parakeet-primeline
SCRIBER_VISUALIZER_BAR_COUNT=60python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python check_imports.py
python -m src.web_apicd Frontend
npm install
npm run dev:client
npm run check
npm run build
npm startDo not run npm run dev:client and npm run dev at the same time on the default port.
pytest
pytest tests/test_device_monitor.py
pytest tests/test_microphone_device_resolution.py tests/test_microphone_callback.py
pytest tests/test_web_api_security.py::test_origin_allowed_defaults
pytest -k origin_allowedCurrent test layout includes backend, runtime, core, data, contract, and perf tests under tests/.
Useful focused tests:
- Device monitor and mic selection:
pytest tests/test_device_monitor.py tests/test_microphone_device_resolution.py
- Microphone callback/channel handling:
pytest tests/test_microphone_channel_selection.py tests/test_microphone_callback.py
- Pipeline lifecycle:
pytest tests/test_pipeline_stop.py tests/test_web_api_lifecycle.py
- WebSocket contracts:
pytest tests/contract/test_ws_events.py
- Provider routing/circuit breaker:
pytest tests/runtime/test_provider_router.py tests/core/test_provider_circuit_breaker.py
python -m py_compile src\microphone.py src\pipeline.py src\web_api.py
git diff --checkcd Frontend
npm run check
npm run buildScriber/
├── src/
│ ├── web_api.py # aiohttp REST + WebSocket API
│ ├── pipeline.py # STT pipeline and provider factory
│ ├── microphone.py # sounddevice input transport
│ ├── audio_devices.py # mic normalization/dedup/compatibility
│ ├── device_monitor.py # hotplug detection and PortAudio refresh
│ ├── audio_file_input.py # ffmpeg file input transport
│ ├── config.py # env + settings.json configuration
│ ├── database.py # SQLite persistence and FTS
│ ├── injector.py # text injection
│ ├── summarization.py # Gemini/OpenAI summaries
│ ├── youtube_api.py # YouTube Data API
│ ├── youtube_download.py # yt-dlp + ffmpeg extraction
│ ├── export.py # PDF/DOCX export
│ ├── overlay.py # recording overlay
│ ├── tray.py # tray lifecycle
│ ├── main.py # Tkinter fallback
│ ├── core/ # state, contracts, tracing, breakers
│ ├── data/ # job and metrics stores
│ └── runtime/ # provider routing and retry scheduling
├── Frontend/
│ ├── client/ # React app
│ ├── server/ # Express/Vite host
│ └── shared/ # shared TS schema/types
├── tests/ # pytest suite
├── docs/ # architecture and status docs
├── start.bat
├── start.sh
├── requirements.txt
└── README.md
Run:
python -m src.web_apiThen check latest.log / structured logs if present. Also run:
python check_imports.pyCheck:
- backend health:
http://127.0.0.1:8765/api/health - frontend port:
http://localhost:5000 VITE_BACKEND_URLif backend host/port is customized- CORS via
SCRIBER_ALLOWED_ORIGINS
Check:
GET /api/microphones- Windows microphone privacy settings
- selected/favorite mic in Settings
- dock/USB reconnect
The DeviceMonitor should pick up hotplug changes. During active recording, PortAudio refresh is intentionally deferred until after stop.
- Confirm the device label in
GET /api/microphones. - Clear or update
SCRIBER_FAVORITE_MIC. - Device resolution is cached briefly; changing mic settings or hotplug events invalidate the cache.
- Wait until the overlay/state switches from preparing to recording.
SCRIBER_MIC_ALWAYS_ONis not true app-level prewarming yet.- Check
docs/Mic-Performance-Enhancement.mdfor current mic latency status.
- Set
YOUTUBE_API_KEY. - Verify
yt-dlpand ffmpeg availability. - Check timeout settings and provider API keys.
- Verify extension and size limits.
- For video, ensure ffmpeg can extract audio.
- Check provider-specific upload limits in backend logs/settings.
- Check ONNX/NeMo dependencies.
- Use the Settings UI or
/api/onnx/modelsand/api/nemo/models. - Ensure model directories are writable.
- Real app-level microphone prewarming for
SCRIBER_MIC_ALWAYS_ON. - Frontend transcript-list virtualization or infinite query.
- Vite manual vendor chunking for smaller initial chunks.
- WebSocket no-client fast path before JSON serialization and task scheduling.
- Background/off-thread upload preprocessing and export generation.
- O(n^2) live transcript content append behavior in very long sessions.
- More hardware regression tests for dock/USB mic add/remove and favorite fallback.
- Stronger typed API contract between backend and frontend.
- Smaller backend modules by splitting
src/web_api.pyinto domains.
MIT license metadata is used by the project. A standalone root LICENSE file is not currently present.
Efficient, resumable, multi-provider speech-to-text workflows.




