Skip to content

feat(whisper): native AVAudioEngine capture bypasses renderer getUserMedia#448

Merged
shobhit99 merged 1 commit into
SuperCmdLabs:mainfrom
monotykamary:feat/native-audio-capture
May 28, 2026
Merged

feat(whisper): native AVAudioEngine capture bypasses renderer getUserMedia#448
shobhit99 merged 1 commit into
SuperCmdLabs:mainfrom
monotykamary:feat/native-audio-capture

Conversation

@monotykamary
Copy link
Copy Markdown
Contributor

@monotykamary monotykamary commented May 26, 2026

Overview

Eliminate the 200–500 ms getUserMedia startup latency for whisper dictation by capturing microphone audio natively via AVAudioEngine in a persistent Swift helper process, bypassing the browser audio stack entirely.

The Problem

When the whisper dictation hotkey is pressed, the startup path goes through the Electron renderer:

Hotkey → main.ts → IPC to renderer → React render → window.open()
→ React render → SuperCmdWhisper mounts → getUserMedia (200-500ms)
→ AudioContext → PCM capture → resolveSessionConfig IPC → warmup IPC
→ first transcription at 3.5s

The biggest bottleneck is getUserMedia — the browser has to negotiate with the OS audio subsystem through multiple abstraction layers, re-validate permissions, create an AudioContext, set up the PCM pipeline, and so on. This adds 200-500ms before audio capture even begins.

For context, Hex — a native Swift STT transcriber alternative — uses AVAudioEngine directly and starts recording in ~10-30ms because it bypasses all browser abstractions.

The Solution

A new native Swift helper (audio-capturer.swift) that talks to Core Audio directly. The main process starts it on hotkey press, in parallel with opening the renderer overlay. By the time the renderer mounts and checks the capturer status, the mic is already recording:

Hotkey pressed
  ├── main.ts: warmAudioCapturer() + startNativeAudioCapture() (~30ms)
  │     AVAudioEngine starts → ring buffer capturing PCM
  │
  └── main.ts → IPC to renderer → React render → window.open()
        → React render → SuperCmdWhisper mounts
        → checks audioCapturerStatus().recording → TRUE
        → hooks into native capture (no getUserMedia needed!)
        → native meter polling for visualizer
        → key released → audioCapturerStop() → WAV file
        → whisperTranscribeFile() → paste result

Total time from hotkey press to audio capture: ~30ms (down from 200-500ms).

The mic green dot appears when the hotkey is pressed and disappears when the whisper overlay closes — the stopEngine command is sent to the capturer to release the AVAudioEngine.

New File: src/native/audio-capturer.swift

A persistent CLI process that communicates via JSON-over-stdin/stdout (same pattern as whisper-transcriber serve mode):

Command Description Response
warmup Start AVAudioEngine (mic hot) {"ready":true}
start Begin capturing to ring buffer {"recording":true}
stop Stop and write WAV file {"file":"...","duration":2}
snapshot Write ring buffer to WAV (keep recording) {"file":"...","duration":1}
stopEngine Stop AVAudioEngine (mic cold) {"stopped":true}
meter Current audio level {"meter":{"average":0.3,"peak":0.5}}
exit Clean shutdown

Key design decisions:

  • Ring buffer: 30 seconds of 16kHz mono PCM in memory — enables pre-roll capture and snapshots
  • WAV output: 16-bit PCM at 16kHz, directly consumable by whisper.cpp server
  • AVAudioConverter: Handles arbitrary input sample rates/channels → 16kHz mono

Main Process Changes (src/main/main.ts)

  • AudioCapturer module: Process lifecycle management (warmAudioCapturer, killAudioCapturer, startNativeAudioCapture, stopNativeAudioCapture, takeNativeAudioSnapshot), same pattern as the whisper.cpp server manager
  • Speak-toggle hotkey: Starts native audio capture immediately on press (both standard and Fn-only paths)
  • New IPC handlers: audio-capturer-warmup, audio-capturer-start, audio-capturer-stop, audio-capturer-snapshot, audio-capturer-meter, audio-capturer-status
  • whisper-transcribe-file: Transcribes a WAV file by path — for whisper.cpp, sends the file path directly to the persistent server (avoids reading into Node buffer then writing again)
  • Cleanup: killAudioCapturer() in will-quit; stopEngine command when whisper overlay closes

Renderer Changes (src/renderer/src/SuperCmdWhisper.tsx)

  • Native capture fast path in startListening: Checks audioCapturerStatus() — if the native capturer is already recording (started by main process on hotkey press), skips getUserMedia entirely
  • startNativeVisualizer/stopNativeVisualizer: Polls audioCapturerMeter() for wave bar animation instead of Web Audio AnalyserNode
  • startNativePeriodicTranscription: Uses audioCapturerSnapshot() + whisperTranscribeFile() for live partial transcriptions while the user is still speaking
  • Native capture finalize path in finalizeAndClose: Stops the native capturer, gets the WAV file, transcribes it, and pastes the result with the same paste-and-refine logic
  • Full backward compatibility: Falls back to getUserMedia path if native capturer isnt available or fails

Prior Optimizations (Included in This PR)

These were implemented in earlier iterations and are part of this diff:

  • Persistent whisper.cpp server: Model stays loaded in memory; serve subcommand with JSON-over-stdin/stdout protocol
  • Immediate first periodic transcription: setTimeout-chain fires at 1s (vs 3.5s setInterval)
  • Non-blocking AI transcript refinement: Raw transcript pasted immediately; AI refinement runs async and replaces in-place if different
  • Parallel getUserMedia + resolveSessionConfig: Both run concurrently instead of serially
  • Session config caching: 10s TTL avoids redundant IPC round-trips
  • Whisper.cpp server warmup: Kicked off on component mount so the model is loaded by the time transcription runs

Files Changed

File Change
src/native/audio-capturer.swift New — Native AVAudioEngine capture helper
src/main/main.ts AudioCapturer module, IPC handlers, hotkey integration, whisper-transcribe-file, whisper-transcriber serve mode
src/main/preload.ts Bridge for new IPCs
src/renderer/types/electron.d.ts Type declarations
src/renderer/src/SuperCmdWhisper.tsx Native capture fast path, visualizer, periodic transcription, finalize
src/native/whisper-transcriber.swift Persistent serve subcommand
scripts/build-native.mjs Added audio-capturer to build list

…Media

Add a native Swift audio-capturer helper that uses AVAudioEngine to
capture microphone audio directly from the main process, eliminating
200-500ms of browser getUserMedia/AudioContext negotiation latency.

The main process now starts recording immediately on hotkey press,
in parallel with opening the renderer overlay. The renderer's
SuperCmdWhisper component detects the already-running native capture
and hooks into it for the visualizer and transcription, falling back
to the getUserMedia path if native capture isn't available.

Also includes prior optimizations:
- Persistent whisper.cpp server (model stays loaded in memory)
- setTimeout-chain for first transcription at 1s (vs 3.5s setInterval)
- Non-blocking AI transcript refinement (paste raw immediately, refine async)
- Parallel getUserMedia + resolveSessionConfig
- Session config caching (10s TTL)
- whisper.cpp server warmup on component mount
@shobhit99 shobhit99 merged commit 14967c0 into SuperCmdLabs:main May 28, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants