Skip to content

MyButtermilk/Scriber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

152 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scriber

Scriber Logo

Scriber

AI-powered speech-to-text workflows for desktop and web.
Live dictation, YouTube transcription, file transcription, transcript management, summaries, and export.

StatusFeaturesQuick StartUsageArchitectureAPIConfigurationDevelopmentTroubleshooting


Status

Last verified: 2026-06-01

Scriber is a local-first transcription app with a Python backend, a React web UI, and a legacy Tkinter fallback UI. The current primary runtime is Windows with tray integration, global hotkeys, microphone device monitoring, and local SQLite persistence.

Current implementation highlights:

  • Live microphone transcription with WebSocket status/audio/transcript events.
  • YouTube and file transcription with persistent jobs, retry scheduling, and resume support.
  • Multi-provider STT support, including cloud providers and local ONNX/NeMo paths.
  • SQLite transcript storage with WAL mode, metadata list loading, pagination, and FTS5 search.
  • DeviceMonitor for microphone hotplug handling with native Windows endpoint events where available and polling fallback.
  • Recording-aware PortAudio refresh: device refreshes are deferred while a recording stream is active and run once after the stream becomes idle.
  • Short-lived microphone device-resolution cache for selected/favorite mic lookup.
  • Route-level frontend lazy loading for non-default pages and a single shared WebSocket connection.

Known limits:

  • SCRIBER_MIC_ALWAYS_ON exists as a setting, but it is not a true app-level always-on/prewarmed microphone stream yet. Per-session streams are closed during cleanup to avoid orphaned PortAudio resources.
  • Frontend transcript-list virtualization/infinite loading is still open.
  • Vite production build can still warn about an initial chunk over 500 kB; manual vendor chunking is still open.
  • Some upload preprocessing and export generation still run synchronously in async request paths.
  • Very long live sessions can still hit O(n^2)-style string growth when appending final transcript chunks.

Features

Live Microphone Dictation

  • Global hotkey, default Ctrl+Alt+S.
  • Modes:
    • toggle: press once to start, press again to stop.
    • push_to_talk: record while the hotkey is held.
  • Live WebSocket events for state, status, audio level, warnings, transcripts, session lifecycle, history updates, and errors.
  • Favorite microphone selection with fallback to selected/default device.
  • Device hotplug detection via DeviceMonitor.
  • Low input-level warning flow for muted/quiet microphones.
  • Recording overlay with preparing/recording/transcribing states.
  • Text injection into the active app through auto, sendinput, paste, or type.

YouTube Transcription

  • YouTube search and video lookup through the YouTube Data API.
  • Download and audio extraction through yt-dlp and ffmpeg.
  • Persistent job lifecycle with retry/resume support.
  • Transcript entries are saved as youtube records.

File Transcription

  • Multipart upload through POST /api/file/transcribe.
  • Supported audio formats: .mp3, .wav, .m4a, .flac, .aac, .ogg.
  • Supported video formats: .mp4, .mov, .webm, .avi, .mkv, .m4v.
  • Video audio extraction through ffmpeg.
  • Default audio upload limit: 200 MB.
  • Raw video upload hard limit: 2048 MB.
  • Extracted/compressed audio is limited by the final audio/provider limit.

STT Providers

Provider coverage includes:

  • Soniox realtime and async
  • Mistral realtime and async
  • AssemblyAI Universal-3-Pro async
  • Deepgram
  • OpenAI
  • Azure Speech
  • Azure MAI Transcribe
  • Gladia
  • Groq
  • Speechmatics
  • ElevenLabs
  • Google
  • AWS Transcribe
  • Smallest
  • ONNX local models
  • NeMo local models

Provider routing, retry scheduling, and circuit-breaker logic exist in the backend. Verify provider-specific behavior in code before changing a provider contract.

Transcript Management

  • SQLite persistence in transcripts.db.
  • Transcript list pagination with offset/limit.
  • Type filtering by mic, youtube, or file.
  • FTS5-backed search.
  • Detail view with full content and summary.
  • Delete, cancel, summarize, export.
  • Export as PDF or DOCX.
  • Optional automatic summarization after job completion.

Local Models

  • ONNX model list, download, status, delete.
  • Quantization options: int8, fp16, fp32.
  • Optional ONNX GPU flag.
  • NeMo model list, download, delete.

Screenshots

Live Mic

Live Mic Interface

YouTube

YouTube Transcription

File Upload

File Upload

Transcript Detail

Transcript Detail

Settings

Settings


Quick Start

Prerequisites

  • Python 3.10+
  • Node.js 20+ for the web UI
  • ffmpeg available on PATH for YouTube/file audio extraction
  • Windows recommended for tray, global hotkey, overlay, and microphone device monitoring

Windows

git clone https://github.com/MyButtermilk/Scriber.git
cd Scriber
start.bat

start.bat handles:

  • Python check
  • virtual environment setup
  • dependency installation when needed
  • initial .env creation if missing
  • tray/web startup when Node and Frontend/ are available
  • Tkinter fallback when the web UI cannot be started
  • backend health check at http://127.0.0.1:8765/api/health
  • browser open at http://localhost:5000

Linux/macOS

./start.sh

The shell script sets up dependencies and starts the Tkinter path. The full tray/hotkey/device-monitor experience is Windows-focused.

Manual Backend and Frontend

# Backend only
python -m src.web_api
# Frontend client only
cd Frontend
npm install
npm run dev:client
# Frontend Express/Vite dev host
cd Frontend
npm run dev
# Frontend production build and start
cd Frontend
npm run build
npm start

Default URLs:

  • Backend: http://127.0.0.1:8765
  • Web UI: http://localhost:5000
  • WebSocket: ws://127.0.0.1:8765/ws

Additional entrypoints:

python -m src.tray
python -m src.main

Usage

Web Routes

  • /: Live Mic
  • /youtube: YouTube transcription
  • /file: File transcription
  • /transcript/:id: Transcript detail
  • /settings: Settings

Live Mic

  1. Select the STT provider and microphone in Settings.
  2. Optional: set a favorite microphone. It is preferred when available.
  3. Start from the UI or with the configured hotkey.
  4. Wait for the overlay/state to switch from preparing to recording before speaking.
  5. Stop recording through UI or hotkey.
  6. The final transcript is saved as a mic entry and can be summarized/exported.

Important microphone behavior:

  • DeviceMonitor keeps the frontend microphone list updated after USB/dock changes.
  • PortAudio refresh is deferred during active recording to avoid native races.
  • Mic selection is cached briefly to avoid repeated device scans on consecutive starts.
  • SCRIBER_MIC_ALWAYS_ON=1 does not yet keep a reusable app-level mic stream alive.

YouTube

  1. Set YOUTUBE_API_KEY.
  2. Search or paste a video URL/ID.
  3. Start transcription.
  4. Track job progress in the UI and transcript history.

File Upload

  1. Open /file.
  2. Drop or select an audio/video file.
  3. The backend validates size/type, extracts audio for videos, and starts a transcription job.
  4. Results appear in transcript history.

Settings

The backend settings API manages:

  • hotkey and recording mode
  • STT provider and provider-specific models
  • language
  • microphone and favorite microphone
  • injection method
  • API keys
  • ONNX/NeMo local models
  • summarization model, prompt, and auto-summary setting
  • visualizer bar count

AWS credentials are not fully managed through apiKeys; use the standard AWS environment variables.


Architecture

flowchart LR
    User["Browser / Hotkey / Tray"] -->|"HTTP + WebSocket"| Backend["Python Backend\nsrc.web_api"]
    Backend --> Controller["ScriberWebController"]
    Controller --> Pipeline["ScriberPipeline\nProviderRouter"]
    Controller --> DB[("SQLite\ntranscripts.db")]
    Controller --> Jobs["JobStore\nRetryScheduler"]
    Controller --> Monitor["DeviceMonitor\nMic Resolution Cache"]
    Pipeline --> Providers["STT Providers\nCloud + Local"]
    Pipeline --> Mic["MicrophoneInput\nsounddevice"]
    Backend <--> Frontend["React UI\nFrontend/client"]
Loading

Runtime Paths

  • Live Mic:
    • POST /api/live-mic/start|stop|toggle
    • microphone stream
    • Pipecat/STT pipeline
    • WebSocket events
    • transcript persistence
    • optional text injection
  • YouTube:
    • YouTube Data API lookup
    • yt-dlp download
    • ffmpeg audio extraction
    • STT pipeline/direct provider path
    • job persistence and retry/resume
  • File:
    • multipart upload
    • size/type validation
    • optional ffmpeg extraction/compression
    • STT pipeline/direct provider path
    • transcript persistence
  • Frontend:
    • REST for commands and data
    • single shared WebSocket for live events
    • React Query for server state

Backend Modules

  • src/web_api.py: REST, WebSocket, settings, jobs, transcript API.
  • src/pipeline.py: provider creation, STT pipeline, analyzer cache, mic resolution.
  • src/microphone.py: sounddevice transport and audio callback.
  • src/audio_devices.py: deduplication, host API priority, compatibility.
  • src/device_monitor.py: hotplug detection and PortAudio refresh.
  • src/database.py: SQLite persistence and FTS.
  • src/runtime/: provider router and retry scheduler.
  • src/core/: state machine, circuit breaker, error taxonomy, event contracts, tracing.

Frontend Architecture

  • Vite 7 + React 19 + TypeScript.
  • Wouter routing.
  • TanStack Query for API data.
  • Single WebSocketProvider.
  • LiveMic is eagerly loaded for the default route.
  • YouTube, File, Settings, TranscriptDetail, and NotFound are lazy-loaded chunks.
  • Tailwind v4 CSS-first setup through Frontend/client/src/index.css.
  • Radix/shadcn-style primitives and existing neumorphic classes.

API

System

  • GET /api/health
  • GET /api/state
  • GET /api/metrics/hot-path?limit=n

limit for hot-path metrics is clamped to 1..500.

WebSocket

  • GET /ws

Core event types:

  • state
  • status
  • transcript
  • audio_level
  • input_warning
  • transcribing
  • session_started
  • session_finished
  • history_updated
  • error

Live Mic

  • POST /api/live-mic/start
  • POST /api/live-mic/stop
  • POST /api/live-mic/toggle

Transcripts

  • GET /api/transcripts?offset=0&limit=50&type={mic|youtube|file}&q={query}
  • GET /api/transcripts/{id}
  • DELETE /api/transcripts/{id}
  • POST /api/transcripts/{id}/summarize
  • POST /api/transcripts/{id}/cancel
  • GET /api/transcripts/{id}/export/{format}

limit defaults to 50 and is clamped to 1..100. Export format is pdf or docx.

YouTube

  • GET /api/youtube/search?q={query}&maxResults={n}&pageToken={token}
  • GET /api/youtube/video?id={id}
  • GET /api/youtube/video?url={url}
  • POST /api/youtube/transcribe

File

  • POST /api/file/transcribe

Expected body: multipart/form-data with field file.

Settings, Devices, Autostart

  • GET /api/settings
  • PUT /api/settings
  • GET /api/microphones
  • GET /api/autostart
  • POST /api/autostart

Local Models

  • GET /api/onnx/models
  • GET /api/onnx/models/{model_id}
  • POST /api/onnx/download
  • DELETE /api/onnx/models/{model_id}
  • GET /api/nemo/models
  • POST /api/nemo/download
  • DELETE /api/nemo/models/{model_id}

ONNX model status/delete can use an optional quantization query parameter.


Configuration

Configuration is loaded from environment variables and .env. Multi-line summarization prompt state can also be stored in settings.json.

Do not commit .env, settings.json, transcripts.db, downloads/, or generated local artifacts.

Web/API

SCRIBER_WEB_HOST=127.0.0.1
SCRIBER_WEB_PORT=8765
SCRIBER_ALLOWED_ORIGINS=

Default CORS allows localhost, 127.0.0.1, and ::1. SCRIBER_ALLOWED_ORIGINS=* allows all origins.

Frontend

VITE_BACKEND_URL=http://127.0.0.1:8765
PORT=5000

Recording and Provider Selection

SCRIBER_HOTKEY=ctrl+alt+s
SCRIBER_MODE=toggle
SCRIBER_DEFAULT_STT=soniox
SCRIBER_STT_FALLBACKS=
SCRIBER_LANGUAGE=auto
SCRIBER_DEBUG=0
SCRIBER_CUSTOM_VOCAB=

Provider Models

SCRIBER_SONIOX_MODE=realtime
SCRIBER_SONIOX_ASYNC_MODEL=stt-async-v4
SCRIBER_SONIOX_RT_MODEL=stt-rt-v4
SCRIBER_MISTRAL_RT_MODEL=voxtral-mini-transcribe-realtime-2602
SCRIBER_MISTRAL_ASYNC_MODEL=voxtral-mini-2602
SCRIBER_OPENAI_STT_MODEL=gpt-4o-mini-transcribe-2025-12-15
SCRIBER_AZURE_MAI_REGION=northeurope

Microphone and Injection

SCRIBER_MIC_DEVICE=default
SCRIBER_FAVORITE_MIC=
SCRIBER_MIC_ALWAYS_ON=0
SCRIBER_MIC_BLOCK_SIZE=512
SCRIBER_MIC_DEVICE_CACHE_TTL_SEC=10.0
SCRIBER_MIC_LOW_RMS_THRESHOLD=0.001
SCRIBER_MIC_LOW_RMS_CLEAR_THRESHOLD=0.0025
SCRIBER_MIC_LOW_RMS_WARN_AFTER_SECS=6.0
SCRIBER_INJECT_METHOD=auto
SCRIBER_PASTE_PRE_DELAY_MS=80
SCRIBER_PASTE_RESTORE_DELAY_MS=1500

SCRIBER_MIC_ALWAYS_ON is currently not a real persistent prewarm stream. Leave it off unless you are testing the surrounding setting flow.

Uploads, Jobs, Timeouts

SCRIBER_UPLOAD_MAX_MB=200
SCRIBER_UPLOAD_MAX_BYTES=
SCRIBER_DOWNLOADS_DIR=downloads
SCRIBER_JOB_MAX_ATTEMPTS=3
SCRIBER_JOB_RETRY_BASE_SEC=5
SCRIBER_JOB_RETRY_MAX_SEC=120
SCRIBER_TIMEOUT_FILE_TRANSCRIBE_SEC=600
SCRIBER_TIMEOUT_YOUTUBE_TRANSCRIBE_SEC=600
SCRIBER_TIMEOUT_YOUTUBE_DOWNLOAD_SEC=300

Circuit Breaker and Diagnostics

SCRIBER_BREAKER_FAILURE_THRESHOLD=3
SCRIBER_BREAKER_COOLDOWN_SEC=30
SCRIBER_VALIDATE_WS_CONTRACTS=0
SCRIBER_HOTKEY_DISPATCH_DEBOUNCE_SEC=0.25
SCRIBER_LOG_STDERR=1

Summarization

SCRIBER_SUMMARIZATION_MODEL=gemini-flash-latest
SCRIBER_AUTO_SUMMARIZE=0
SCRIBER_SUMMARY_MIN_WORDS=180
SCRIBER_SUMMARY_MAX_WORDS=2200
SCRIBER_SUMMARIZATION_PROMPT=...

Current default summarization model: gemini-flash-latest.

API Keys

SONIOX_API_KEY=...
MISTRAL_API_KEY=...
ASSEMBLYAI_API_KEY=...
DEEPGRAM_API_KEY=...
OPENAI_API_KEY=...
AZURE_SPEECH_KEY=...
AZURE_SPEECH_REGION=...
GLADIA_API_KEY=...
GROQ_API_KEY=...
SPEECHMATICS_API_KEY=...
ELEVENLABS_API_KEY=...
GOOGLE_API_KEY=...
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
YOUTUBE_API_KEY=...

AWS uses standard SDK environment variables:

AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=...

Local Models and UI

SCRIBER_ONNX_MODEL=nemo-parakeet-tdt-0.6b-v3
SCRIBER_ONNX_QUANTIZATION=int8
SCRIBER_ONNX_USE_GPU=0
SCRIBER_NEMO_MODEL=parakeet-primeline
SCRIBER_VISUALIZER_BAR_COUNT=60

Development

Backend Commands

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python check_imports.py
python -m src.web_api

Frontend Commands

cd Frontend
npm install
npm run dev:client
npm run check
npm run build
npm start

Do not run npm run dev:client and npm run dev at the same time on the default port.

Tests

pytest
pytest tests/test_device_monitor.py
pytest tests/test_microphone_device_resolution.py tests/test_microphone_callback.py
pytest tests/test_web_api_security.py::test_origin_allowed_defaults
pytest -k origin_allowed

Current test layout includes backend, runtime, core, data, contract, and perf tests under tests/.

Useful focused tests:

  • Device monitor and mic selection:
    • pytest tests/test_device_monitor.py tests/test_microphone_device_resolution.py
  • Microphone callback/channel handling:
    • pytest tests/test_microphone_channel_selection.py tests/test_microphone_callback.py
  • Pipeline lifecycle:
    • pytest tests/test_pipeline_stop.py tests/test_web_api_lifecycle.py
  • WebSocket contracts:
    • pytest tests/contract/test_ws_events.py
  • Provider routing/circuit breaker:
    • pytest tests/runtime/test_provider_router.py tests/core/test_provider_circuit_breaker.py

Quality Checks

python -m py_compile src\microphone.py src\pipeline.py src\web_api.py
git diff --check
cd Frontend
npm run check
npm run build

Project Structure

Scriber/
├── src/
│   ├── web_api.py                  # aiohttp REST + WebSocket API
│   ├── pipeline.py                 # STT pipeline and provider factory
│   ├── microphone.py               # sounddevice input transport
│   ├── audio_devices.py            # mic normalization/dedup/compatibility
│   ├── device_monitor.py           # hotplug detection and PortAudio refresh
│   ├── audio_file_input.py         # ffmpeg file input transport
│   ├── config.py                   # env + settings.json configuration
│   ├── database.py                 # SQLite persistence and FTS
│   ├── injector.py                 # text injection
│   ├── summarization.py            # Gemini/OpenAI summaries
│   ├── youtube_api.py              # YouTube Data API
│   ├── youtube_download.py         # yt-dlp + ffmpeg extraction
│   ├── export.py                   # PDF/DOCX export
│   ├── overlay.py                  # recording overlay
│   ├── tray.py                     # tray lifecycle
│   ├── main.py                     # Tkinter fallback
│   ├── core/                       # state, contracts, tracing, breakers
│   ├── data/                       # job and metrics stores
│   └── runtime/                    # provider routing and retry scheduling
├── Frontend/
│   ├── client/                     # React app
│   ├── server/                     # Express/Vite host
│   └── shared/                     # shared TS schema/types
├── tests/                          # pytest suite
├── docs/                           # architecture and status docs
├── start.bat
├── start.sh
├── requirements.txt
└── README.md

Troubleshooting

Backend does not start

Run:

python -m src.web_api

Then check latest.log / structured logs if present. Also run:

python check_imports.py

Web UI does not load

Check:

  • backend health: http://127.0.0.1:8765/api/health
  • frontend port: http://localhost:5000
  • VITE_BACKEND_URL if backend host/port is customized
  • CORS via SCRIBER_ALLOWED_ORIGINS

No microphone appears

Check:

  • GET /api/microphones
  • Windows microphone privacy settings
  • selected/favorite mic in Settings
  • dock/USB reconnect

The DeviceMonitor should pick up hotplug changes. During active recording, PortAudio refresh is intentionally deferred until after stop.

Favorite microphone is not used

  • Confirm the device label in GET /api/microphones.
  • Clear or update SCRIBER_FAVORITE_MIC.
  • Device resolution is cached briefly; changing mic settings or hotplug events invalidate the cache.

First words are cut off

  • Wait until the overlay/state switches from preparing to recording.
  • SCRIBER_MIC_ALWAYS_ON is not true app-level prewarming yet.
  • Check docs/Mic-Performance-Enhancement.md for current mic latency status.

YouTube transcription fails

  • Set YOUTUBE_API_KEY.
  • Verify yt-dlp and ffmpeg availability.
  • Check timeout settings and provider API keys.

File upload fails

  • Verify extension and size limits.
  • For video, ensure ffmpeg can extract audio.
  • Check provider-specific upload limits in backend logs/settings.

Local models are missing

  • Check ONNX/NeMo dependencies.
  • Use the Settings UI or /api/onnx/models and /api/nemo/models.
  • Ensure model directories are writable.

Roadmap / Open Engineering Work

  • Real app-level microphone prewarming for SCRIBER_MIC_ALWAYS_ON.
  • Frontend transcript-list virtualization or infinite query.
  • Vite manual vendor chunking for smaller initial chunks.
  • WebSocket no-client fast path before JSON serialization and task scheduling.
  • Background/off-thread upload preprocessing and export generation.
  • O(n^2) live transcript content append behavior in very long sessions.
  • More hardware regression tests for dock/USB mic add/remove and favorite fallback.
  • Stronger typed API contract between backend and frontend.
  • Smaller backend modules by splitting src/web_api.py into domains.

License

MIT license metadata is used by the project. A standalone root LICENSE file is not currently present.


Efficient, resumable, multi-provider speech-to-text workflows.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors