feat(whisper): native AVAudioEngine capture bypasses renderer getUserMedia#448
Merged
shobhit99 merged 1 commit intoMay 28, 2026
Merged
Conversation
…Media Add a native Swift audio-capturer helper that uses AVAudioEngine to capture microphone audio directly from the main process, eliminating 200-500ms of browser getUserMedia/AudioContext negotiation latency. The main process now starts recording immediately on hotkey press, in parallel with opening the renderer overlay. The renderer's SuperCmdWhisper component detects the already-running native capture and hooks into it for the visualizer and transcription, falling back to the getUserMedia path if native capture isn't available. Also includes prior optimizations: - Persistent whisper.cpp server (model stays loaded in memory) - setTimeout-chain for first transcription at 1s (vs 3.5s setInterval) - Non-blocking AI transcript refinement (paste raw immediately, refine async) - Parallel getUserMedia + resolveSessionConfig - Session config caching (10s TTL) - whisper.cpp server warmup on component mount
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Eliminate the 200–500 ms
getUserMediastartup latency for whisper dictation by capturing microphone audio natively viaAVAudioEnginein a persistent Swift helper process, bypassing the browser audio stack entirely.The Problem
When the whisper dictation hotkey is pressed, the startup path goes through the Electron renderer:
The biggest bottleneck is
getUserMedia— the browser has to negotiate with the OS audio subsystem through multiple abstraction layers, re-validate permissions, create anAudioContext, set up the PCM pipeline, and so on. This adds 200-500ms before audio capture even begins.For context, Hex — a native Swift STT transcriber alternative — uses
AVAudioEnginedirectly and starts recording in ~10-30ms because it bypasses all browser abstractions.The Solution
A new native Swift helper (
audio-capturer.swift) that talks to Core Audio directly. The main process starts it on hotkey press, in parallel with opening the renderer overlay. By the time the renderer mounts and checks the capturer status, the mic is already recording:Total time from hotkey press to audio capture: ~30ms (down from 200-500ms).
The mic green dot appears when the hotkey is pressed and disappears when the whisper overlay closes — the
stopEnginecommand is sent to the capturer to release the AVAudioEngine.New File:
src/native/audio-capturer.swiftA persistent CLI process that communicates via JSON-over-stdin/stdout (same pattern as
whisper-transcriberserve mode):warmup{"ready":true}start{"recording":true}stop{"file":"...","duration":2}snapshot{"file":"...","duration":1}stopEngine{"stopped":true}meter{"meter":{"average":0.3,"peak":0.5}}exitKey design decisions:
Main Process Changes (
src/main/main.ts)AudioCapturermodule: Process lifecycle management (warmAudioCapturer,killAudioCapturer,startNativeAudioCapture,stopNativeAudioCapture,takeNativeAudioSnapshot), same pattern as the whisper.cpp server manageraudio-capturer-warmup,audio-capturer-start,audio-capturer-stop,audio-capturer-snapshot,audio-capturer-meter,audio-capturer-statuswhisper-transcribe-file: Transcribes a WAV file by path — for whisper.cpp, sends the file path directly to the persistent server (avoids reading into Node buffer then writing again)killAudioCapturer()inwill-quit;stopEnginecommand when whisper overlay closesRenderer Changes (
src/renderer/src/SuperCmdWhisper.tsx)startListening: ChecksaudioCapturerStatus()— if the native capturer is already recording (started by main process on hotkey press), skipsgetUserMediaentirelystartNativeVisualizer/stopNativeVisualizer: PollsaudioCapturerMeter()for wave bar animation instead of Web AudioAnalyserNodestartNativePeriodicTranscription: UsesaudioCapturerSnapshot()+whisperTranscribeFile()for live partial transcriptions while the user is still speakingfinalizeAndClose: Stops the native capturer, gets the WAV file, transcribes it, and pastes the result with the same paste-and-refine logicgetUserMediapath if native capturer isnt available or failsPrior Optimizations (Included in This PR)
These were implemented in earlier iterations and are part of this diff:
servesubcommand with JSON-over-stdin/stdout protocolsetTimeout-chain fires at 1s (vs 3.5ssetInterval)Files Changed
src/native/audio-capturer.swiftsrc/main/main.tswhisper-transcribe-file,whisper-transcriberserve modesrc/main/preload.tssrc/renderer/types/electron.d.tssrc/renderer/src/SuperCmdWhisper.tsxsrc/native/whisper-transcriber.swiftservesubcommandscripts/build-native.mjsaudio-capturerto build list