feat(whisper): native AVAudioEngine capture bypasses renderer getUserMedia by monotykamary · Pull Request #448 · SuperCmdLabs/SuperCmd

monotykamary · 2026-05-26T02:36:02Z

Overview

Eliminate the 200–500 ms getUserMedia startup latency for whisper dictation by capturing microphone audio natively via AVAudioEngine in a persistent Swift helper process, bypassing the browser audio stack entirely.

The Problem

When the whisper dictation hotkey is pressed, the startup path goes through the Electron renderer:

Hotkey → main.ts → IPC to renderer → React render → window.open()
→ React render → SuperCmdWhisper mounts → getUserMedia (200-500ms)
→ AudioContext → PCM capture → resolveSessionConfig IPC → warmup IPC
→ first transcription at 3.5s

The biggest bottleneck is getUserMedia — the browser has to negotiate with the OS audio subsystem through multiple abstraction layers, re-validate permissions, create an AudioContext, set up the PCM pipeline, and so on. This adds 200-500ms before audio capture even begins.

For context, Hex — a native Swift STT transcriber alternative — uses AVAudioEngine directly and starts recording in ~10-30ms because it bypasses all browser abstractions.

The Solution

A new native Swift helper (audio-capturer.swift) that talks to Core Audio directly. The main process starts it on hotkey press, in parallel with opening the renderer overlay. By the time the renderer mounts and checks the capturer status, the mic is already recording:

Hotkey pressed
  ├── main.ts: warmAudioCapturer() + startNativeAudioCapture() (~30ms)
  │     AVAudioEngine starts → ring buffer capturing PCM
  │
  └── main.ts → IPC to renderer → React render → window.open()
        → React render → SuperCmdWhisper mounts
        → checks audioCapturerStatus().recording → TRUE
        → hooks into native capture (no getUserMedia needed!)
        → native meter polling for visualizer
        → key released → audioCapturerStop() → WAV file
        → whisperTranscribeFile() → paste result

Total time from hotkey press to audio capture: ~30ms (down from 200-500ms).

The mic green dot appears when the hotkey is pressed and disappears when the whisper overlay closes — the stopEngine command is sent to the capturer to release the AVAudioEngine.

New File: `src/native/audio-capturer.swift`

A persistent CLI process that communicates via JSON-over-stdin/stdout (same pattern as whisper-transcriber serve mode):

Command	Description	Response
`warmup`	Start AVAudioEngine (mic hot)	`{"ready":true}`
`start`	Begin capturing to ring buffer	`{"recording":true}`
`stop`	Stop and write WAV file	`{"file":"...","duration":2}`
`snapshot`	Write ring buffer to WAV (keep recording)	`{"file":"...","duration":1}`
`stopEngine`	Stop AVAudioEngine (mic cold)	`{"stopped":true}`
`meter`	Current audio level	`{"meter":{"average":0.3,"peak":0.5}}`
`exit`	Clean shutdown	—

Key design decisions:

Ring buffer: 30 seconds of 16kHz mono PCM in memory — enables pre-roll capture and snapshots
WAV output: 16-bit PCM at 16kHz, directly consumable by whisper.cpp server
AVAudioConverter: Handles arbitrary input sample rates/channels → 16kHz mono

Main Process Changes (`src/main/main.ts`)

AudioCapturer module: Process lifecycle management (warmAudioCapturer, killAudioCapturer, startNativeAudioCapture, stopNativeAudioCapture, takeNativeAudioSnapshot), same pattern as the whisper.cpp server manager
Speak-toggle hotkey: Starts native audio capture immediately on press (both standard and Fn-only paths)
New IPC handlers: audio-capturer-warmup, audio-capturer-start, audio-capturer-stop, audio-capturer-snapshot, audio-capturer-meter, audio-capturer-status
whisper-transcribe-file: Transcribes a WAV file by path — for whisper.cpp, sends the file path directly to the persistent server (avoids reading into Node buffer then writing again)
Cleanup: killAudioCapturer() in will-quit; stopEngine command when whisper overlay closes

Renderer Changes (`src/renderer/src/SuperCmdWhisper.tsx`)

Native capture fast path in startListening: Checks audioCapturerStatus() — if the native capturer is already recording (started by main process on hotkey press), skips getUserMedia entirely
startNativeVisualizer/stopNativeVisualizer: Polls audioCapturerMeter() for wave bar animation instead of Web Audio AnalyserNode
startNativePeriodicTranscription: Uses audioCapturerSnapshot() + whisperTranscribeFile() for live partial transcriptions while the user is still speaking
Native capture finalize path in finalizeAndClose: Stops the native capturer, gets the WAV file, transcribes it, and pastes the result with the same paste-and-refine logic
Full backward compatibility: Falls back to getUserMedia path if native capturer isnt available or fails

Prior Optimizations (Included in This PR)

These were implemented in earlier iterations and are part of this diff:

Persistent whisper.cpp server: Model stays loaded in memory; serve subcommand with JSON-over-stdin/stdout protocol
Immediate first periodic transcription: setTimeout-chain fires at 1s (vs 3.5s setInterval)
Non-blocking AI transcript refinement: Raw transcript pasted immediately; AI refinement runs async and replaces in-place if different
Parallel getUserMedia + resolveSessionConfig: Both run concurrently instead of serially
Session config caching: 10s TTL avoids redundant IPC round-trips
Whisper.cpp server warmup: Kicked off on component mount so the model is loaded by the time transcription runs

Files Changed

File	Change
`src/native/audio-capturer.swift`	New — Native AVAudioEngine capture helper
`src/main/main.ts`	AudioCapturer module, IPC handlers, hotkey integration, `whisper-transcribe-file`, `whisper-transcriber` serve mode
`src/main/preload.ts`	Bridge for new IPCs
`src/renderer/types/electron.d.ts`	Type declarations
`src/renderer/src/SuperCmdWhisper.tsx`	Native capture fast path, visualizer, periodic transcription, finalize
`src/native/whisper-transcriber.swift`	Persistent `serve` subcommand
`scripts/build-native.mjs`	Added `audio-capturer` to build list

…Media Add a native Swift audio-capturer helper that uses AVAudioEngine to capture microphone audio directly from the main process, eliminating 200-500ms of browser getUserMedia/AudioContext negotiation latency. The main process now starts recording immediately on hotkey press, in parallel with opening the renderer overlay. The renderer's SuperCmdWhisper component detects the already-running native capture and hooks into it for the visualizer and transcription, falling back to the getUserMedia path if native capture isn't available. Also includes prior optimizations: - Persistent whisper.cpp server (model stays loaded in memory) - setTimeout-chain for first transcription at 1s (vs 3.5s setInterval) - Non-blocking AI transcript refinement (paste raw immediately, refine async) - Parallel getUserMedia + resolveSessionConfig - Session config caching (10s TTL) - whisper.cpp server warmup on component mount

shobhit99 merged commit 14967c0 into SuperCmdLabs:main May 28, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(whisper): native AVAudioEngine capture bypasses renderer getUserMedia#448

feat(whisper): native AVAudioEngine capture bypasses renderer getUserMedia#448
shobhit99 merged 1 commit into
SuperCmdLabs:mainfrom
monotykamary:feat/native-audio-capture

monotykamary commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

monotykamary commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

The Problem

The Solution

New File: src/native/audio-capturer.swift

Main Process Changes (src/main/main.ts)

Renderer Changes (src/renderer/src/SuperCmdWhisper.tsx)

Prior Optimizations (Included in This PR)

Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

monotykamary commented May 26, 2026 •

edited

Loading

New File: `src/native/audio-capturer.swift`

Main Process Changes (`src/main/main.ts`)

Renderer Changes (`src/renderer/src/SuperCmdWhisper.tsx`)