Local-first voice processing pipeline that turns audio recordings into speaker-labeled transcripts and AI-powered meeting summaries. Built for Apple Silicon, with CUDA fallback.
The entire pipeline runs locally by default. Transcription, diarization, speaker identification, summarization, and voice analysis all run on your hardware. Audio files are never uploaded anywhere. No cloud account required.
Optionally, you can enable cloud AI (Claude) for higher-quality speaker identification and summaries. Even then, only text is sent — never audio.
- Fully local pipeline — every stage runs on your machine out of the box. Ollama handles speaker inference and summaries locally. No API keys needed for the core workflow. Your recordings never leave your personal computer — Mac or PC.
- Multilingual Whisper — auto-detects language per speaker. Multiple languages in the same meeting? Each speaker gets transcribed in their own language and merged chronologically.
- Full-day recordings — handles any length — from a few-second voice memo to 8+ hour all-day recordings. No minimum duration. Auto-splits long recordings into individual meetings at natural boundaries.
- Multi-meeting detection — a single recording containing back-to-back meetings gets split, each with its own transcript and summary.
- Every audio format —
.m4a,.mp3,.wav,.ogg,.flac,.mp4,.mov,.webm,.opus,.aac— anything ffmpeg supports. - Speaker diarization — pyannote.audio 3.1 separates who said what. Works with 2-10+ speakers.
- AI speaker identification — maps generic
SPEAKER_00labels to real names using dialogue context, calendar integration, and knowledge base. Runs locally via Ollama, or optionally via Claude for higher accuracy. - Advanced summarization — structured meeting summaries with executive overview, key decisions, action items with owners and deadlines, discussion points, and emotional sentiment analysis.
- Voice normalization & pause reduction — Silero-VAD strips silence and non-speech segments before transcription. Reduces processing time and eliminates Whisper hallucinations caused by long pauses.
- Transcription speed — on Apple M1, MLX-Whisper large-v3 transcribes at ~6-10x real-time speed. A 1-hour meeting transcribes in 6-10 minutes. Medium model runs at ~15-20x real-time.
- Optional cloud AI — enable Claude (Sonnet/Opus) for premium speaker inference and summaries when you want the best quality. Only transcript text is sent — never audio.
By default, nothing leaves your machine. The entire pipeline is local.
| Component | Runs locally | Cloud option |
|---|---|---|
| Transcription (MLX-Whisper / OpenAI Whisper) | Yes | — |
| Speaker diarization (pyannote.audio) | Yes | — |
| Voice Activity Detection (Silero-VAD) | Yes | — |
| Audio analysis (ffprobe + librosa) | Yes | — |
| Speaker inference | Yes (Ollama) | Claude Sonnet (sends transcript text, opt-in) |
| Summary generation | Yes (Ollama) | Claude Opus (sends transcript text, opt-in) |
Default (fully local):
ollama pull qwen2.5:7b
python unified_pipeline.py meeting.m4a --local-summaryWith optional cloud AI (higher quality summaries):
# Set ANTHROPIC_API_KEY in .env, then:
python unified_pipeline.py meeting.m4a- Python 3.10+
- ffmpeg (
brew install ffmpegon macOS)
git clone https://github.com/mavliev/myVoiceNotes.git
cd myVoiceNotes
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env with your tokens| Variable | Required | Purpose |
|---|---|---|
HF_TOKEN |
No | HuggingFace token for pyannote speaker diarization. Without it, diarization is skipped (transcription-only mode). Get from huggingface.co/settings/tokens and accept licenses for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0. |
ANTHROPIC_API_KEY |
No | Claude API for speaker inference (Sonnet) and summaries. Skip with --skip-summary. |
RECORDING_OWNER |
No | Your name — helps speaker inference identify the recorder. |
python unified_pipeline.py --preflight# Basic usage — fully local, auto-detect language
python unified_pipeline.py meeting.m4a --local-summary
# Specify language
python unified_pipeline.py meeting.m4a -l en --local-summary
# Enable cloud AI for higher quality (requires ANTHROPIC_API_KEY in .env)
python unified_pipeline.py meeting.m4a
# Transcription only (fastest, no diarization or summary)
python unified_pipeline.py meeting.m4a --skip-diarization --skip-summarypython unified_pipeline.py meeting.m4a -l enPipeline: VAD -> Transcription -> Audio Analysis -> Diarization -> Speaker ID -> Rich Transcript -> Summary
For meetings with speakers in different languages:
python unified_pipeline.py meeting.m4a --multilingualDiarizes first, detects each speaker's language, then transcribes each speaker in their own language and merges chronologically.
| Model | Speed (M1) | Quality | RAM |
|---|---|---|---|
tiny |
~50x real-time | Low | 2 GB |
small |
~25x real-time | Good | 4 GB |
medium |
~15x real-time | Very good | 8 GB |
large-v3 (default) |
~8x real-time | Best | 16 GB |
python unified_pipeline.py meeting.m4a -m medium # faster
python unified_pipeline.py meeting.m4a -m large-v3 # best quality (default)# Interactive speaker naming from existing transcript
python name_speakers.py transcript.txt
# Generate summary from existing transcript
python create_summary.py transcript.txt
# Legacy CUDA pipeline (Windows/Linux)
python process_audio.py meeting.m4a -l ruUse asitop to monitor GPU, CPU, and memory utilization in real-time while processing:
# Install
pip install asitop
# Run (requires sudo for powermetrics access)
sudo asitopIn a split terminal, run your transcription in one pane and sudo asitop in another. You'll see:
- GPU — Metal GPU utilization during MLX-Whisper transcription (typically 60-90%)
- ANE — Apple Neural Engine usage
- CPU — CPU cluster utilization during diarization (typically 400-700% across efficiency + performance cores)
- Memory — Unified memory bandwidth and pressure
This helps you choose the right Whisper model size for your hardware — if GPU utilization is near 100% and memory pressure is high, step down to medium model.
For input meeting.m4a:
| File | Description |
|---|---|
meeting_transcript.txt |
Timestamped transcript with speaker labels and metadata header |
meeting_detailed.json |
Full metadata — segments, speaker stats, audio info |
meeting_SUMMARY.md |
AI-generated summary (if not skipped) |
================================================================================
TRANSCRIPTION METADATA
================================================================================
AUDIO INFORMATION
-----------------
File: meeting.m4a
Duration: 00:45:12
Detected Language: Russian (ru, 97.3% confidence)
SPEAKER STATISTICS
------------------
Speaker Time % Words Turns
Alice 18:24 41% 3,241 47
Bob 14:51 33% 2,108 38
================================================================================
[00:00:05] Alice:
Good morning everyone, let's start with the agenda.
[00:00:12] Bob:
Sure, I wanted to raise the deployment timeline first.
The AI summary includes structured sections:
- Executive Summary — 2-3 sentence overview
- Key Decisions — what was decided, by whom
- Action Items — tasks with owners, deadlines, and priority
- Discussion Points — main topics with positions taken
- Participant Analysis — speaking time, engagement level, sentiment
- Open Questions — unresolved items for follow-up
python unified_pipeline.py <audio_file> [options]
Options:
-m, --model MODEL Whisper model: tiny, base, small, medium, large-v3 (default: large-v3)
-l, --language LANG Language code (auto-detect if not set)
-o, --output-dir DIR Output directory (default: same as input)
-b, --basename NAME Override output file basename
--skip-diarization Skip speaker separation
--skip-summary Skip AI summary generation
--skip-ollama Skip AI speaker inference (no transcript sent to cloud)
--no-vad Disable Silero-VAD preprocessing
--multilingual Per-speaker language detection mode
--local-summary Use local Ollama for summaries instead of Claude
--local-model MODEL Ollama model for summaries (default: qwen2.5:72b)
--preflight Validate dependencies only, don't process
--skip-preflight Skip dependency validation
--resume Resume from last checkpoint
--no-auto-split Disable auto-splitting of recordings >2h
--json Output pipeline result as JSON
| File | Role |
|---|---|
unified_pipeline.py |
Main orchestrator — all stages, CLI entry point |
_mlx_transcribe_worker.py |
MLX-Whisper subprocess worker (isolated from torch) |
transcribe_hybrid_ane.py |
MLX-Whisper transcription engine (Apple Silicon) |
audio_analyzer.py |
Audio metadata extraction (duration, SNR, noise level) |
transcript_formatter.py |
Rich transcript builder with hallucination filtering |
claude_speaker_inference.py |
Speaker identification via Claude Sonnet |
ollama_speaker_inference.py |
Speaker identification via local Ollama (no API key) |
meeting_context.py |
Context aggregation (calendar, photos, knowledge base) |
create_summary.py |
Meeting summary generation via Claude |
generate_summary.py |
Standalone summary tool |
name_speakers.py |
Interactive CLI for manual speaker naming |
process_audio.py |
Legacy pipeline (OpenAI Whisper + CUDA) |
| Platform | Transcription | Diarization | Recommended RAM |
|---|---|---|---|
| Apple Silicon M1/M2/M3/M4 | MLX + Metal GPU | Metal (MPS) | 8 GB+ (16 GB for large-v3) |
| NVIDIA GPU | OpenAI Whisper + CUDA | CUDA | 8 GB+ VRAM |
| CPU only | OpenAI Whisper | CPU | 16 GB+ |
Any format supported by ffmpeg: .m4a, .mp3, .wav, .ogg, .flac, .mp4, .mov, .webm, .opus, .aac, and more.
MIT License — Copyright (c) 2026 Andrew Mavliev. See LICENSE.