A local, privacy-first toolkit for transcribing sensitive meetings and generating executive summaries, all without sending data to the cloud.
Security professionals often need to transcribe sensitive meetings (strategy discussions, incident reviews, classified briefings) and generate executive summaries. Cloud-based transcription services pose unacceptable risks for this content.
Secure Speech-to-Text provides an easy, fully local way to transcribe audio and generate executive summaries, keeping sensitive information off third-party servers. All processing happens on your machine using open-source models and local LLMs.
- End-to-End Workflow: Interactive script guides you through transcription, summary, and secure deletion
- Docker Support: Single-command transcription with GPU or CPU containers
- Local Processing: All transcription runs on your machine; no data leaves your network
- Executive Summaries: Generate summaries via any OpenAI-compatible API (LM Studio, Ollama, vLLM)
- Speaker Diarization: Identify and label different speakers in conversations
- Word-Level Timestamps: Accurate timing for each word using alignment models
- Multiple Output Formats: SRT, VTT, TXT, JSON
- GPU Acceleration: CUDA support for fast inference (CPU fallback available)
- Best-Effort Secure Deletion: Overwrite and remove source audio after transcription
The easiest way to run Secure Speech-to-Text is with Docker.
Copy .env.example to .env and add your Hugging Face token (required for speaker diarization):
cp .env.example .env
# Edit .env and set HUGGINGFACE_HUB_TOKEN=your_token_hereGPU (NVIDIA CUDA):
# Build the GPU image
docker compose build gpu
# Place your audio file in the input/ folder, then run:
docker compose run --rm gpu input/meeting.m4a -yCPU Only:
# Build the CPU image
docker compose build cpu
# Place your audio file in the input/ folder, then run:
docker compose run --rm cpu input/meeting.m4a -yResults appear in the output/ folder.
python secure_speech_to_text.py [OPTIONS] <audio_file>| Flag | Description |
|---|---|
-y, --no-interactive |
Skip prompts, run full pipeline |
--no-summary |
Skip executive summary generation |
--no-delete |
Skip secure deletion of source audio |
--no-diarize |
Disable speaker diarization |
--output-dir PATH |
Override output directory (default: output/) |
# Interactive mode (prompts at each step)
python secure_speech_to_text.py input/meeting.m4a
# Non-interactive mode (runs full pipeline)
python secure_speech_to_text.py input/meeting.m4a -y
# Skip summary generation
python secure_speech_to_text.py input/meeting.m4a -y --no-summary
# Custom output directory
python secure_speech_to_text.py meeting.m4a --output-dir ./my-transcripts- Place audio files in
input/ - Results appear in
output/<filename>_<timestamp>/:*.txt: Plain text transcript*.srt,*.vtt: Subtitle formats*.json: Detailed word-level dataexecutive_summary.md: LLM-generated summary
- Python 3.9 to 3.13 (3.14+ not supported by WhisperX)
- FFmpeg installed and on PATH
- A local LLM server for executive summaries (optional)
# Windows (Chocolatey)
choco install ffmpeg
# macOS (Homebrew)
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install -y ffmpeg- Create and activate a virtual environment:
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# macOS/Linux
source .venv/bin/activate-
(GPU only) Install CUDA Toolkit 12.8 before WhisperX. Skip this step if using CPU only.
- Linux: Follow the CUDA Installation Guide for Linux
- Windows: Download and install from CUDA Downloads
-
Install dependencies:
pip install -r requirements.txt- Configure your LLM API (copy
.env.exampleto.envand edit):
cp .env.example .envCreate a .env file (or copy from .env.example):
# OpenAI-compatible API endpoint
API_BASE_URL=http://localhost:1234/v1
# API key (use any value for local servers)
API_KEY=lm-studio
# Model name as shown in your LLM server
MODEL_NAME=local-modelSupported servers:
- LM Studio:
http://localhost:1234/v1 - Ollama:
http://localhost:11434/v1 - vLLM:
http://localhost:8000/v1
WhisperX uses pyannote for speaker diarization. To enable diarization:
-
Create a Hugging Face account and generate a User Access Token with "Read" permissions at https://huggingface.co/settings/tokens
-
Accept the model conditions for both:
-
Provide your token via one of these methods:
Option A: Add to .env file (recommended for Docker):
HUGGINGFACE_HUB_TOKEN=your_token_hereOption B: Login via CLI (for local installation):
huggingface-cli loginOption C: Set environment variable:
# Windows PowerShell
setx HUGGINGFACE_HUB_TOKEN "<YOUR_TOKEN>"
$env:HUGGINGFACE_HUB_TOKEN = "<YOUR_TOKEN>" # for current session# macOS/Linux
export HUGGINGFACE_HUB_TOKEN="<YOUR_TOKEN>"Determine token count of transcript files for sizing your LLM's context window:
python -m utils.token_counter transcript.txtOptions:
--method: Choose tokenizer (tiktokenortransformers, default:tiktoken)--model: Specify model name (default:gpt-4for tiktoken,gpt2for transformers)
# Use transformers library with specific model
python -m utils.token_counter transcript.txt --method transformers --model mistralai/Mistral-7B-v0.1sl5-speech-to-text/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .env.example # API configuration template
├── secure_speech_to_text.py # Main workflow script
├── best_effort_delete.py # Secure deletion helper
├── Dockerfile # GPU container (CUDA 12.8)
├── Dockerfile.cpu # CPU container
├── docker-compose.yml # Docker services
├── input/ # Place audio files here
├── output/ # Transcripts and summaries appear here
└── utils/
└── token_counter.py # Token counting for LLM context sizing
| Problem | Solution |
|---|---|
| "ffmpeg not found" | Ensure ffmpeg is installed and on PATH (see Prerequisites) |
| GPU not used | Check your PyTorch install matches your CUDA version |
| Module not found | Run pip install -r requirements.txt inside your venv |
| Diarization fails | Ensure you've accepted model conditions on Hugging Face |
| LLM summary fails | Check your .env configuration and that your LLM server is running |
| Docker GPU error | Ensure NVIDIA Container Toolkit is installed |
WeightsUnpickler error |
Set env var: TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1 (PyTorch 2.6+ issue) |
- Speaker Identification: Ability to label which speaker is whom (e.g., "SPEAKER_00 is John")
| Component | Dependency |
|---|---|
| Transcription | WhisperX, PyTorch, FFmpeg |
| Diarization | pyannote (via Hugging Face) |
| Executive Summary | openai, python-dotenv |
| Token Counting | tiktoken, transformers (optional) |
| Docker GPU | NVIDIA Container Toolkit |
Created by the SL5 Task Force for the security community.
- WhisperX: Fast Whisper with word-level timestamps
- OpenAI Whisper: Original Whisper model
- pyannote: Speaker diarization toolkit
- LM Studio: Run local LLMs with a GUI
- Ollama: Run local LLMs from the command line
- vLLM: High-throughput LLM inference
- PyTorch: GPU-accelerated deep learning
- NVIDIA Container Toolkit: Docker GPU support