myVoiceNotes

Local-first voice processing pipeline that turns audio recordings into speaker-labeled transcripts and AI-powered meeting summaries. Built for Apple Silicon, with CUDA fallback.

Why myVoiceNotes

The entire pipeline runs locally by default. Transcription, diarization, speaker identification, summarization, and voice analysis all run on your hardware. Audio files are never uploaded anywhere. No cloud account required.

Optionally, you can enable cloud AI (Claude) for higher-quality speaker identification and summaries. Even then, only text is sent — never audio.

Key capabilities

Fully local pipeline — every stage runs on your machine out of the box. Ollama handles speaker inference and summaries locally. No API keys needed for the core workflow. Your recordings never leave your personal computer — Mac or PC.
Multilingual Whisper — auto-detects language per speaker. Multiple languages in the same meeting? Each speaker gets transcribed in their own language and merged chronologically.
Full-day recordings — handles any length — from a few-second voice memo to 8+ hour all-day recordings. No minimum duration. Auto-splits long recordings into individual meetings at natural boundaries.
Multi-meeting detection — a single recording containing back-to-back meetings gets split, each with its own transcript and summary.
Every audio format — .m4a, .mp3, .wav, .ogg, .flac, .mp4, .mov, .webm, .opus, .aac — anything ffmpeg supports.
Speaker diarization — pyannote.audio 3.1 separates who said what. Works with 2-10+ speakers.
AI speaker identification — maps generic SPEAKER_00 labels to real names using dialogue context, calendar integration, and knowledge base. Runs locally via Ollama, or optionally via Claude for higher accuracy.
Advanced summarization — structured meeting summaries with executive overview, key decisions, action items with owners and deadlines, discussion points, and emotional sentiment analysis.
Voice normalization & pause reduction — Silero-VAD strips silence and non-speech segments before transcription. Reduces processing time and eliminates Whisper hallucinations caused by long pauses.
Transcription speed — on Apple M1, MLX-Whisper large-v3 transcribes at ~6-10x real-time speed. A 1-hour meeting transcribes in 6-10 minutes. Medium model runs at ~15-20x real-time.
Optional cloud AI — enable Claude (Sonnet/Opus) for premium speaker inference and summaries when you want the best quality. Only transcript text is sent — never audio.

Privacy

By default, nothing leaves your machine. The entire pipeline is local.

Component	Runs locally	Cloud option
Transcription (MLX-Whisper / OpenAI Whisper)	Yes	—
Speaker diarization (pyannote.audio)	Yes	—
Voice Activity Detection (Silero-VAD)	Yes	—
Audio analysis (ffprobe + librosa)	Yes	—
Speaker inference	Yes (Ollama)	Claude Sonnet (sends transcript text, opt-in)
Summary generation	Yes (Ollama)	Claude Opus (sends transcript text, opt-in)

Default (fully local):

ollama pull qwen2.5:7b
python unified_pipeline.py meeting.m4a --local-summary

With optional cloud AI (higher quality summaries):

# Set ANTHROPIC_API_KEY in .env, then:
python unified_pipeline.py meeting.m4a

Quick Start

Prerequisites

Python 3.10+
ffmpeg (brew install ffmpeg on macOS)

Install

git clone https://github.com/mavliev/myVoiceNotes.git
cd myVoiceNotes

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Configure

cp .env.example .env
# Edit .env with your tokens

Variable	Required	Purpose
`HF_TOKEN`	No	HuggingFace token for pyannote speaker diarization. Without it, diarization is skipped (transcription-only mode). Get from huggingface.co/settings/tokens and accept licenses for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0.
`ANTHROPIC_API_KEY`	No	Claude API for speaker inference (Sonnet) and summaries. Skip with `--skip-summary`.
`RECORDING_OWNER`	No	Your name — helps speaker inference identify the recorder.

Verify setup

python unified_pipeline.py --preflight

Process a recording

# Basic usage — fully local, auto-detect language
python unified_pipeline.py meeting.m4a --local-summary

# Specify language
python unified_pipeline.py meeting.m4a -l en --local-summary

# Enable cloud AI for higher quality (requires ANTHROPIC_API_KEY in .env)
python unified_pipeline.py meeting.m4a

# Transcription only (fastest, no diarization or summary)
python unified_pipeline.py meeting.m4a --skip-diarization --skip-summary

Usage

Standard mode

python unified_pipeline.py meeting.m4a -l en

Pipeline: VAD -> Transcription -> Audio Analysis -> Diarization -> Speaker ID -> Rich Transcript -> Summary

Multilingual mode

For meetings with speakers in different languages:

python unified_pipeline.py meeting.m4a --multilingual

Diarizes first, detects each speaker's language, then transcribes each speaker in their own language and merges chronologically.

Whisper model selection

Model	Speed (M1)	Quality	RAM
`tiny`	~50x real-time	Low	2 GB
`small`	~25x real-time	Good	4 GB
`medium`	~15x real-time	Very good	8 GB
`large-v3` (default)	~8x real-time	Best	16 GB

python unified_pipeline.py meeting.m4a -m medium   # faster
python unified_pipeline.py meeting.m4a -m large-v3  # best quality (default)

Post-processing tools

# Interactive speaker naming from existing transcript
python name_speakers.py transcript.txt

# Generate summary from existing transcript
python create_summary.py transcript.txt

# Legacy CUDA pipeline (Windows/Linux)
python process_audio.py meeting.m4a -l ru

Monitoring Apple Silicon GPU Load

Use asitop to monitor GPU, CPU, and memory utilization in real-time while processing:

# Install
pip install asitop

# Run (requires sudo for powermetrics access)
sudo asitop

In a split terminal, run your transcription in one pane and sudo asitop in another. You'll see:

GPU — Metal GPU utilization during MLX-Whisper transcription (typically 60-90%)
ANE — Apple Neural Engine usage
CPU — CPU cluster utilization during diarization (typically 400-700% across efficiency + performance cores)
Memory — Unified memory bandwidth and pressure

This helps you choose the right Whisper model size for your hardware — if GPU utilization is near 100% and memory pressure is high, step down to medium model.

Output Files

For input meeting.m4a:

File	Description
`meeting_transcript.txt`	Timestamped transcript with speaker labels and metadata header
`meeting_detailed.json`	Full metadata — segments, speaker stats, audio info
`meeting_SUMMARY.md`	AI-generated summary (if not skipped)

Transcript format

================================================================================
                         TRANSCRIPTION METADATA
================================================================================

AUDIO INFORMATION
-----------------
File: meeting.m4a
Duration: 00:45:12
Detected Language: Russian (ru, 97.3% confidence)

SPEAKER STATISTICS
------------------
Speaker          Time      %    Words   Turns
Alice            18:24    41%    3,241     47
Bob              14:51    33%    2,108     38

================================================================================

[00:00:05] Alice:
Good morning everyone, let's start with the agenda.

[00:00:12] Bob:
Sure, I wanted to raise the deployment timeline first.

Summary output

The AI summary includes structured sections:

Executive Summary — 2-3 sentence overview
Key Decisions — what was decided, by whom
Action Items — tasks with owners, deadlines, and priority
Discussion Points — main topics with positions taken
Participant Analysis — speaking time, engagement level, sentiment
Open Questions — unresolved items for follow-up

CLI Reference

python unified_pipeline.py <audio_file> [options]

Options:
  -m, --model MODEL       Whisper model: tiny, base, small, medium, large-v3 (default: large-v3)
  -l, --language LANG     Language code (auto-detect if not set)
  -o, --output-dir DIR    Output directory (default: same as input)
  -b, --basename NAME     Override output file basename
  --skip-diarization      Skip speaker separation
  --skip-summary          Skip AI summary generation
  --skip-ollama           Skip AI speaker inference (no transcript sent to cloud)
  --no-vad                Disable Silero-VAD preprocessing
  --multilingual          Per-speaker language detection mode
  --local-summary         Use local Ollama for summaries instead of Claude
  --local-model MODEL     Ollama model for summaries (default: qwen2.5:72b)
  --preflight             Validate dependencies only, don't process
  --skip-preflight        Skip dependency validation
  --resume                Resume from last checkpoint
  --no-auto-split         Disable auto-splitting of recordings >2h
  --json                  Output pipeline result as JSON

Architecture

File	Role
`unified_pipeline.py`	Main orchestrator — all stages, CLI entry point
`_mlx_transcribe_worker.py`	MLX-Whisper subprocess worker (isolated from torch)
`transcribe_hybrid_ane.py`	MLX-Whisper transcription engine (Apple Silicon)
`audio_analyzer.py`	Audio metadata extraction (duration, SNR, noise level)
`transcript_formatter.py`	Rich transcript builder with hallucination filtering
`claude_speaker_inference.py`	Speaker identification via Claude Sonnet
`ollama_speaker_inference.py`	Speaker identification via local Ollama (no API key)
`meeting_context.py`	Context aggregation (calendar, photos, knowledge base)
`create_summary.py`	Meeting summary generation via Claude
`generate_summary.py`	Standalone summary tool
`name_speakers.py`	Interactive CLI for manual speaker naming
`process_audio.py`	Legacy pipeline (OpenAI Whisper + CUDA)

Hardware Requirements

Platform	Transcription	Diarization	Recommended RAM
Apple Silicon M1/M2/M3/M4	MLX + Metal GPU	Metal (MPS)	8 GB+ (16 GB for large-v3)
NVIDIA GPU	OpenAI Whisper + CUDA	CUDA	8 GB+ VRAM
CPU only	OpenAI Whisper	CPU	16 GB+

Supported audio formats

Any format supported by ffmpeg: .m4a, .mp3, .wav, .ogg, .flac, .mp4, .mov, .webm, .opus, .aac, and more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

myVoiceNotes

Why myVoiceNotes

Key capabilities

Privacy

Quick Start

Prerequisites

Install

Configure

Verify setup

Process a recording

Usage

Standard mode

Multilingual mode

Whisper model selection

Post-processing tools

Monitoring Apple Silicon GPU Load

Output Files

Transcript format

Summary output

CLI Reference

Architecture

Hardware Requirements

Supported audio formats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
_mlx_transcribe_worker.py		_mlx_transcribe_worker.py
audio_analyzer.py		audio_analyzer.py
claude_speaker_inference.py		claude_speaker_inference.py
create_summary.py		create_summary.py
generate_summary.py		generate_summary.py
meeting_context.py		meeting_context.py
name_speakers.py		name_speakers.py
ollama_speaker_inference.py		ollama_speaker_inference.py
process_audio.py		process_audio.py
requirements.txt		requirements.txt
transcribe_hybrid_ane.py		transcribe_hybrid_ane.py
transcript_formatter.py		transcript_formatter.py
unified_pipeline.py		unified_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

myVoiceNotes

Why myVoiceNotes

Key capabilities

Privacy

Quick Start

Prerequisites

Install

Configure

Verify setup

Process a recording

Usage

Standard mode

Multilingual mode

Whisper model selection

Post-processing tools

Monitoring Apple Silicon GPU Load

Output Files

Transcript format

Summary output

CLI Reference

Architecture

Hardware Requirements

Supported audio formats

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages