WhisperTransscribe is a local transcription CLI for audio and video files.
It uses:
faster-whisperwith thelarge-v3model for transcriptionpyannote.audiospeaker diarization to separate speakersffmpegto normalize input media into a clean mono 16 kHz WAV before inference
The program generates:
- an
.srtsubtitle file - a
.txttranscript file
Language detection and speaker detection are used internally during transcription. They are not written into the final output. The only special annotation written to output is [Simultaneous speech] when pyannote detects real overlapping speech.
- Accepts audio or video input by file path
- Works locally after models are prepared on disk
- Uses Whisper automatic language detection by default
- Supports optional language override with
--language - Separates speaker turns with pyannote diarization
- Marks overlapping speech when multiple people talk at the same time
- Stores the Hugging Face token in a local
.envfile during setup
Install these before using the project:
- Python
3.12 uvffmpeggitgit-lfs
You also need a Hugging Face account and token for the pyannote offline clone workflow.
The project expects Python 3.12.x.
Install uv from the official Astral instructions, then verify:
uv --versionffmpeg must be available in PATH.
Verify:
ffmpeg -versionVerify:
git --versionpyannote/speaker-diarization-community-1 is cloned through Git LFS.
Verify:
git lfs versionFrom the repository root:
uv syncIf you use CUDA 12.6, install the matching dependency set you already configured in pyproject.toml.
Run the model setup command once:
uv run python .\setup_models.pyWhat happens:
- the script downloads
Systran/faster-whisper-large-v3 - the script clones
pyannote/speaker-diarization-community-1as a local offline checkout - if no token is found, the script prompts once for a Hugging Face token
- the token is stored in
.envasHF_TOKEN=...
The .env file is ignored by git.
To force a refresh of local model folders:
uv run python .\setup_models.py --forceBasic usage:
uv run python .\main.py "C:\path\to\file.mp4"PowerShell wrapper:
.\whispertransscribe.ps1 "C:\path\to\file.mp4"The wrapper forwards all arguments to main.py, so this also works:
.\whispertransscribe.ps1 "C:\path\to\file.mp4" --output-dir "C:\out" --language deThe output files are written next to the input file by default:
file.srtfile.txt
These parameters are supported by main.py:
-
media_path: Path to the input audio or video file. -
-o,--output-dir: Directory where the generated.srtand.txtfiles will be written. If omitted, the files are written next to the input media. -
--language: Optional Whisper language hint such asenorde. If omitted, Whisper detects language automatically. -
--whisper-model-path: Path to the localfaster-whispermodel directory. Default:models/faster-whisper-large-v3. -
--diarization-model-path: Path to the local pyannote diarization checkout. Default:models/pyannote-speaker-diarization-community-1.
Examples:
uv run python .\main.py "C:\media\sample.mp3" --language deuv run python .\main.py "C:\media\sample.mp4" --output-dir "C:\media\out"uv run python .\main.py "C:\media\sample.wav" --whisper-model-path "D:\models\faster-whisper-large-v3" --diarization-model-path "D:\models\pyannote-speaker-diarization-community-1"These parameters are supported by setup_models.py:
-
--models-dir: Base directory where the local model folders are created. Default:models. -
--hf-token: Optional Hugging Face token. If omitted, the script checks environment variables and.env, then prompts interactively if needed. -
--force: Re-downloads Whisper and re-clones the pyannote checkout even if model folders already exist. -
--skip-whisper: Skip the Whisper model download step. -
--skip-diarization: Skip the pyannote clone step.
Examples:
uv run python .\setup_models.py --forceuv run python .\setup_models.py --skip-whisperuv run python .\setup_models.py --models-dir "D:\models"- subtitle timestamps in SRT format
- plain transcript text
[Simultaneous speech]prefix only when overlap is detected
- plain transcript text, one segment per line
[Simultaneous speech]prefix only when overlap is detected
The output does not include:
- speaker names
- detected language labels
Those are used internally only.
ffmpegis still required even thoughfaster-whispercan decode many formats on its own. This project usesffmpegexplicitly for stable normalization and to give both diarization and transcription the same audio input.- pyannote runs on the local model checkout, not a remote inference API.
- Whisper language detection can fail on extremely short segments. In that case the code lets Whisper auto-detect instead of forcing an invalid language code.
Your pyannote checkout is incomplete. Re-run:
uv run python .\setup_models.py --force --skip-whisperInstall ffmpeg and make sure it is available in PATH.
Use a valid Hugging Face token. The setup script can prompt for it and save it to .env.
The setup script now handles read-only Git files during cleanup. Retry the same command:
uv run python .\setup_models.py --force