Skip to content

mavliev/myVoiceNotes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

myVoiceNotes

Local-first voice processing pipeline that turns audio recordings into speaker-labeled transcripts and AI-powered meeting summaries. Built for Apple Silicon, with CUDA fallback.


Why myVoiceNotes

The entire pipeline runs locally by default. Transcription, diarization, speaker identification, summarization, and voice analysis all run on your hardware. Audio files are never uploaded anywhere. No cloud account required.

Optionally, you can enable cloud AI (Claude) for higher-quality speaker identification and summaries. Even then, only text is sent — never audio.

Key capabilities

  • Fully local pipeline — every stage runs on your machine out of the box. Ollama handles speaker inference and summaries locally. No API keys needed for the core workflow. Your recordings never leave your personal computer — Mac or PC.
  • Multilingual Whisper — auto-detects language per speaker. Multiple languages in the same meeting? Each speaker gets transcribed in their own language and merged chronologically.
  • Full-day recordings — handles any length — from a few-second voice memo to 8+ hour all-day recordings. No minimum duration. Auto-splits long recordings into individual meetings at natural boundaries.
  • Multi-meeting detection — a single recording containing back-to-back meetings gets split, each with its own transcript and summary.
  • Every audio format.m4a, .mp3, .wav, .ogg, .flac, .mp4, .mov, .webm, .opus, .aac — anything ffmpeg supports.
  • Speaker diarization — pyannote.audio 3.1 separates who said what. Works with 2-10+ speakers.
  • AI speaker identification — maps generic SPEAKER_00 labels to real names using dialogue context, calendar integration, and knowledge base. Runs locally via Ollama, or optionally via Claude for higher accuracy.
  • Advanced summarization — structured meeting summaries with executive overview, key decisions, action items with owners and deadlines, discussion points, and emotional sentiment analysis.
  • Voice normalization & pause reduction — Silero-VAD strips silence and non-speech segments before transcription. Reduces processing time and eliminates Whisper hallucinations caused by long pauses.
  • Transcription speed — on Apple M1, MLX-Whisper large-v3 transcribes at ~6-10x real-time speed. A 1-hour meeting transcribes in 6-10 minutes. Medium model runs at ~15-20x real-time.
  • Optional cloud AI — enable Claude (Sonnet/Opus) for premium speaker inference and summaries when you want the best quality. Only transcript text is sent — never audio.

Privacy

By default, nothing leaves your machine. The entire pipeline is local.

Component Runs locally Cloud option
Transcription (MLX-Whisper / OpenAI Whisper) Yes
Speaker diarization (pyannote.audio) Yes
Voice Activity Detection (Silero-VAD) Yes
Audio analysis (ffprobe + librosa) Yes
Speaker inference Yes (Ollama) Claude Sonnet (sends transcript text, opt-in)
Summary generation Yes (Ollama) Claude Opus (sends transcript text, opt-in)

Default (fully local):

ollama pull qwen2.5:7b
python unified_pipeline.py meeting.m4a --local-summary

With optional cloud AI (higher quality summaries):

# Set ANTHROPIC_API_KEY in .env, then:
python unified_pipeline.py meeting.m4a

Quick Start

Prerequisites

  • Python 3.10+
  • ffmpeg (brew install ffmpeg on macOS)

Install

git clone https://github.com/mavliev/myVoiceNotes.git
cd myVoiceNotes

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Configure

cp .env.example .env
# Edit .env with your tokens
Variable Required Purpose
HF_TOKEN No HuggingFace token for pyannote speaker diarization. Without it, diarization is skipped (transcription-only mode). Get from huggingface.co/settings/tokens and accept licenses for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0.
ANTHROPIC_API_KEY No Claude API for speaker inference (Sonnet) and summaries. Skip with --skip-summary.
RECORDING_OWNER No Your name — helps speaker inference identify the recorder.

Verify setup

python unified_pipeline.py --preflight

Process a recording

# Basic usage — fully local, auto-detect language
python unified_pipeline.py meeting.m4a --local-summary

# Specify language
python unified_pipeline.py meeting.m4a -l en --local-summary

# Enable cloud AI for higher quality (requires ANTHROPIC_API_KEY in .env)
python unified_pipeline.py meeting.m4a

# Transcription only (fastest, no diarization or summary)
python unified_pipeline.py meeting.m4a --skip-diarization --skip-summary

Usage

Standard mode

python unified_pipeline.py meeting.m4a -l en

Pipeline: VAD -> Transcription -> Audio Analysis -> Diarization -> Speaker ID -> Rich Transcript -> Summary

Multilingual mode

For meetings with speakers in different languages:

python unified_pipeline.py meeting.m4a --multilingual

Diarizes first, detects each speaker's language, then transcribes each speaker in their own language and merges chronologically.

Whisper model selection

Model Speed (M1) Quality RAM
tiny ~50x real-time Low 2 GB
small ~25x real-time Good 4 GB
medium ~15x real-time Very good 8 GB
large-v3 (default) ~8x real-time Best 16 GB
python unified_pipeline.py meeting.m4a -m medium   # faster
python unified_pipeline.py meeting.m4a -m large-v3  # best quality (default)

Post-processing tools

# Interactive speaker naming from existing transcript
python name_speakers.py transcript.txt

# Generate summary from existing transcript
python create_summary.py transcript.txt

# Legacy CUDA pipeline (Windows/Linux)
python process_audio.py meeting.m4a -l ru

Monitoring Apple Silicon GPU Load

Use asitop to monitor GPU, CPU, and memory utilization in real-time while processing:

# Install
pip install asitop

# Run (requires sudo for powermetrics access)
sudo asitop

In a split terminal, run your transcription in one pane and sudo asitop in another. You'll see:

  • GPU — Metal GPU utilization during MLX-Whisper transcription (typically 60-90%)
  • ANE — Apple Neural Engine usage
  • CPU — CPU cluster utilization during diarization (typically 400-700% across efficiency + performance cores)
  • Memory — Unified memory bandwidth and pressure

This helps you choose the right Whisper model size for your hardware — if GPU utilization is near 100% and memory pressure is high, step down to medium model.


Output Files

For input meeting.m4a:

File Description
meeting_transcript.txt Timestamped transcript with speaker labels and metadata header
meeting_detailed.json Full metadata — segments, speaker stats, audio info
meeting_SUMMARY.md AI-generated summary (if not skipped)

Transcript format

================================================================================
                         TRANSCRIPTION METADATA
================================================================================

AUDIO INFORMATION
-----------------
File: meeting.m4a
Duration: 00:45:12
Detected Language: Russian (ru, 97.3% confidence)

SPEAKER STATISTICS
------------------
Speaker          Time      %    Words   Turns
Alice            18:24    41%    3,241     47
Bob              14:51    33%    2,108     38

================================================================================

[00:00:05] Alice:
Good morning everyone, let's start with the agenda.

[00:00:12] Bob:
Sure, I wanted to raise the deployment timeline first.

Summary output

The AI summary includes structured sections:

  • Executive Summary — 2-3 sentence overview
  • Key Decisions — what was decided, by whom
  • Action Items — tasks with owners, deadlines, and priority
  • Discussion Points — main topics with positions taken
  • Participant Analysis — speaking time, engagement level, sentiment
  • Open Questions — unresolved items for follow-up

CLI Reference

python unified_pipeline.py <audio_file> [options]

Options:
  -m, --model MODEL       Whisper model: tiny, base, small, medium, large-v3 (default: large-v3)
  -l, --language LANG     Language code (auto-detect if not set)
  -o, --output-dir DIR    Output directory (default: same as input)
  -b, --basename NAME     Override output file basename
  --skip-diarization      Skip speaker separation
  --skip-summary          Skip AI summary generation
  --skip-ollama           Skip AI speaker inference (no transcript sent to cloud)
  --no-vad                Disable Silero-VAD preprocessing
  --multilingual          Per-speaker language detection mode
  --local-summary         Use local Ollama for summaries instead of Claude
  --local-model MODEL     Ollama model for summaries (default: qwen2.5:72b)
  --preflight             Validate dependencies only, don't process
  --skip-preflight        Skip dependency validation
  --resume                Resume from last checkpoint
  --no-auto-split         Disable auto-splitting of recordings >2h
  --json                  Output pipeline result as JSON

Architecture

File Role
unified_pipeline.py Main orchestrator — all stages, CLI entry point
_mlx_transcribe_worker.py MLX-Whisper subprocess worker (isolated from torch)
transcribe_hybrid_ane.py MLX-Whisper transcription engine (Apple Silicon)
audio_analyzer.py Audio metadata extraction (duration, SNR, noise level)
transcript_formatter.py Rich transcript builder with hallucination filtering
claude_speaker_inference.py Speaker identification via Claude Sonnet
ollama_speaker_inference.py Speaker identification via local Ollama (no API key)
meeting_context.py Context aggregation (calendar, photos, knowledge base)
create_summary.py Meeting summary generation via Claude
generate_summary.py Standalone summary tool
name_speakers.py Interactive CLI for manual speaker naming
process_audio.py Legacy pipeline (OpenAI Whisper + CUDA)

Hardware Requirements

Platform Transcription Diarization Recommended RAM
Apple Silicon M1/M2/M3/M4 MLX + Metal GPU Metal (MPS) 8 GB+ (16 GB for large-v3)
NVIDIA GPU OpenAI Whisper + CUDA CUDA 8 GB+ VRAM
CPU only OpenAI Whisper CPU 16 GB+

Supported audio formats

Any format supported by ffmpeg: .m4a, .mp3, .wav, .ogg, .flac, .mp4, .mov, .webm, .opus, .aac, and more.


License

MIT License — Copyright (c) 2026 Andrew Mavliev. See LICENSE.

About

Voice processing pipeline: transcription, speaker diarization, AI speaker inference, and meeting summaries

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages