Drop a meeting recording onto NVIDIA's Video Search & Summarization (VSS) Blueprint and get back a clean transcript or a structured summary. Spectator is a thin CLI wrapper that handles the install, deployment, and lifecycle so you can focus on work — not IT infrastructure.
Copyright (c) 2026 Mikhail Yurasov. Licensed under the Apache License 2.0.
Just want to transcribe a meeting on a Mac? No GPU host required — Apple Silicon runs Whisper locally via MPS at 2-4× faster than real-time:
git clone https://github.com/myurasov/Spectator.git spectator && cd spectator
# local audio venv (~5 min, one-time)
./spectator audio install
# auto-detects MPS on Apple Silicon, CUDA on a Linux box, falls back to CPU
./spectator audio transcribe meeting.mp3Output lands at ~/.spectator/audio-out/<stem>/ (.txt, .srt, .vtt, .json, .tsv). See Transcribe locally on a Mac for the full path.
Want video summarization or Q&A? That needs the VSS Blueprint, which needs a GPU host:
git clone https://github.com/myurasov/Spectator.git spectator && cd spectator
# install local venv + deps
./spectator install
# add your GPU host to ~/.ssh/config (see step 2 below)
# from https://org.ngc.nvidia.com/setup/api-keys
export NGC_CLI_API_KEY="nvapi-..."
# from https://build.nvidia.com (Get API Key)
export NVIDIA_API_KEY="nvapi-..."
# push the tool to the GPU host (~5 min)
./spectator deploy --target <gpu-machine>
# bring up the VSS stack (~30–45 min first time)
./spectator up --target <gpu-machine>
# everyday use:
# transcribe a meeting recording (uses GPU host)
./spectator audio transcribe meeting.mp3 --target <gpu-machine>
# summarize a video
./spectator process video.mp4 --target <gpu-machine>
# ask follow-up questions about indexed videos
./spectator query "What did Alice say about ...?" --target <gpu-machine>For details on each step, read on. For deeper reference (full SSH config, hardware profiles, all subcommands, gotchas), see REFERENCE.md.
- Spectator
Three pipelines, one CLI:
- Audio transcription (Whisper) — upload a meeting / call / interview, get back a clean transcript with timestamps. Auto-detects bilingual recordings; quality presets for clean / standard / phone / very-noisy audio. Auto-detects the best available device (
cuda > mps > cpu), so a Mac with Apple Silicon transcribes locally via MPS at 2-4× faster-than-real-time without any GPU host. - Speaker diarization (pyannote.audio) — figures out who is talking and when, the question Whisper alone can't answer. Standalone (
spectator audio diarize) or chained onto transcribe (--diarize); the merged output assigns each Whisper segment a speaker label via maximum-overlap voting against pyannote turns. Installed alongside Whisper in the same audio-venv; Hugging Face access is needed at run time only, not install time. - Video summarization + Q&A (VSS) — upload a recording, get back a structured summary with timestamps and action items, then ask follow-up questions in plain English. Requires the VSS Blueprint stack, which runs on an NVIDIA GPU host.
Spectator does not reimplement any of these — it automates the install, deployment, and lifecycle steps for you. For audio: device auto-detection means the same command works on your Mac and on a remote Spark. For video: your laptop drives the GPU host over SSH; the VLM and Whisper run on the GPU; the LLM is called remotely on build.nvidia.com.
Audio-only on a Mac (Whisper transcripts, no video / Q&A):
- A macOS laptop (Apple Silicon recommended for MPS speedups; Intel works too via CPU) with
uvinstalled (brew install uv). - That's it. No SSH, no API keys, no GPU host.
Full stack (audio + video summarization + Q&A):
- A macOS or Linux laptop with
uvinstalled (brew install uvon macOS, orcurl -LsSf https://astral.sh/uv/install.sh | sh). - A GPU host you can SSH into. Default target is DGX Spark (GB10); the same workflow runs on H100, L40S, RTX PRO 6000, and Jetson THOR. Your team's hardware lead can point you at one.
- Two NVIDIA credentials (free, takes ~2 min to set up):
- NGC API key — https://org.ngc.nvidia.com/setup/api-keys (used to pull docker images)
- NVIDIA API key — https://build.nvidia.com → "Get API Key" (used by the remote LLM endpoint VSS calls during summarization)
- ~50 GB free disk space on the GPU host (one-time, for the docker image cache).
git clone https://github.com/myurasov/Spectator.git spectator
cd spectator
./spectator install
# confirms it's working — should print a curated overview
./spectator helpEvery command takes --target <gpu-machine>, so pick a short alias for your host and put it in ~/.ssh/config. Use whatever alias name fits — spark, dgx-1, lab-box. The minimum entry (example, host alias and ip address, key/username will be yours):
Host <gpu-machine>
HostName 10.0.0.42
User ubuntu
IdentityFile ~/.ssh/id_ed25519Smoke-test it:
ssh <gpu-machine> "nvidia-smi --query-gpu=name --format=csv,noheader"If you see your GPU's name printed back, you're good. The recommended config (connection multiplexing + keepalives — they make deploy ~10× faster and survive flaky networks) is in REFERENCE.md → SSH access. Use that on your real working setup.
The rest of this README and REFERENCE.md use <gpu-machine> as the placeholder for whatever alias name you picked — substitute your own when you copy commands.
Add the two API keys to your shell profile (~/.zshrc or ~/.bashrc):
export NGC_CLI_API_KEY="nvapi-..."
export NVIDIA_API_KEY="nvapi-..."Reload (source ~/.zshrc) and confirm: echo $NGC_CLI_API_KEY should print your key.
# check driver / CUDA / docker / NGC reachability
./spectator preflight --target <gpu-machine>
# rsync + uv sync + install (~5 min)
./spectator deploy --target <gpu-machine>If preflight flags a missing piece (e.g. user not in the docker group, or the NVIDIA Container Toolkit not registered with docker), run:
./spectator install --apply-system --target <gpu-machine>This is the only command Spectator runs that touches anything outside ~/.spectator/ and ~/.docker/config.json on the host — it asks for sudo over SSH for each system change.
First run pulls multi-GB images and takes 30–45 minutes. The stack runs in tmux on the host, so you can close your laptop and come back later:
./spectator up --target <gpu-machine>
# watch progress; Ctrl-C to detach (the tmux job keeps running)
./spectator logs --target <gpu-machine> --follow
# quick health check any time
./spectator status --target <gpu-machine>When status shows the agent UI on port 3030 and the API on port 8000, you're ready to use the stack.
# one-time: install Whisper on the host (separate from VSS — runs in its own venv)
./spectator audio install --target <gpu-machine>
# transcribe — uploads, runs in tmux, ~5× real-time on a Spark
./spectator audio transcribe meeting.mp3 --target <gpu-machine> --quality meeting
# pull the transcript back to your laptop
./spectator audio fetch --target <gpu-machine> -o ./transcripts/Quality presets:
| Preset | Use case |
|---|---|
studio |
Clean studio mic, podcast feed |
meeting (default) |
typical video-conferencing recordings (clear voice, mixed quality) |
phone |
Voice-coded / low-bitrate phone calls |
extreme |
Distant mic, lots of noise, heavy crosstalk |
For non-English recordings, bilingual / code-switched audio, or "translate to English" mode, see REFERENCE.md → Audio language handling.
Diarization figures out who is talking, which whisper alone can't tell you. Spectator drives pyannote.audio in the same audio-venv as whisper.
The audio-venv install (./spectator audio install) bundles pyannote by default — no Hugging Face account required to install. You only need a Hugging Face token when you actually run a diarize: pyannote downloads the model weights from HF at first use. One-time setup, only needed before the first diarize call:
# 1. Accept the model licenses in the HF web UI. Pyannote's gate is a
# multi-field form (Company, Website, Country, Use case) — not just
# a checkbox. Fill all fields AND submit on each page. Just ticking
# "I accept" leaves the gate locked. Three repos because 4.x reuses
# the community-1 embedding inside every pipeline (including 3.1):
# https://huggingface.co/pyannote/speaker-diarization-3.1
# https://huggingface.co/pyannote/segmentation-3.0
# https://huggingface.co/pyannote/speaker-diarization-community-1
#
# 2. Create a read-scope token at https://huggingface.co/settings/tokens
#
# 3. Pass the token once with --hf-token; it's persisted to
# $workdir/.creds on the target so subsequent runs don't need it.
./spectator audio diarize meeting.mp3 --target <gpu-machine> \
--hf-token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxAfter that, no --hf-token needed:
# standalone diarize (writes <stem>.diar.{rttm,json})
./spectator audio diarize meeting.mp3 --target <gpu-machine>
# chained: whisper + diarize + merge in one tmux session
./spectator audio transcribe meeting.mp3 --target <gpu-machine> --diarizeThe chained form produces <stem>.diarized.{json,txt} alongside the regular whisper output — each whisper segment gets a speaker field via maximum-overlap voting with pyannote's turns. Use --num-speakers N if you know the count in advance.
On a DGX Spark (GB10), diarization runs ~50-150× faster than real-time — a 1 h recording diarizes in roughly 30-60 s once the model is loaded. See REFERENCE.md → audio diarize for the full reference.
Apple Silicon Macs can transcribe audio entirely locally — no SSH, no remote host. Spectator auto-detects the torch device and uses CPU on Apple Silicon by default (see the MPS note below for why):
# one-time: install Whisper + torch into a local audio-venv (~5 min)
./spectator audio install
# transcribe with auto-detected device (cuda > cpu; mps skipped — see below)
./spectator audio transcribe meeting.mp3 --quality meeting
# force a specific device if needed
./spectator audio transcribe meeting.mp3 --device cpu # explicit CPU
./spectator audio transcribe meeting.mp3 --device mps # opt-in to MPS (see caveat)Output lands at ~/.spectator/audio-out/<stem>/ (.txt, .srt, .vtt, .json, .tsv).
Rough performance on a 1-hour meeting-quality recording:
| Hardware | Real-time factor |
|---|---|
| Apple Silicon via CPU | real-time to 2× slower |
| Apple Silicon (M-series) via MPS | 2-4× faster (when working — see below) |
| Intel Mac via CPU | 5-15× slower than real-time |
| NVIDIA GPU via CUDA | 10-30× faster than real-time |
Why MPS isn't the auto-detect default: openai-whisper × torch ≥ 2.x is currently broken on Apple Silicon GPU for every Whisper model Spectator could ship. large-v3 / large-v3-turbo (Spectator's preset models) crash with "Cannot convert a MPS Tensor to float64"; small does the same; base hits a different -inf logits failure; medium exits zero but writes an empty transcript. Tracked upstream as openai/whisper#2151. v0.4.1's auto-detect skips MPS and falls back to CPU. The override knobs (--device mps, SPECTATOR_ALLOW_MPS_AUTO=1) are still there, but they currently produce crashes or empty output regardless of --model. See REFERENCE.md → MPS limitation for the full per-model breakdown. CPU on Apple Silicon works fine (~real-time for meeting-quality recordings); CUDA via --target <gpu-machine> is dramatically faster (10-30×) for anything heavier.
The video has to live on the GPU host (where VSS can read it). The simplest workflow: SSH in, drop the file under ~/.spectator/, run process:
ssh <gpu-machine>
cd ~/.spectator/Spectator
./spectator process /home/ubuntu/.spectator/meeting.mp4 \
--prompt "Summarize the meeting; list action items with timestamps."--prompt is free-form. Useful starters:
"List every action item with the owner's name and a timestamp.""Summarize each slide as a single bullet.""Pull out every customer requirement, grouped by topic."
Once a video is processed (and indexed), you can ask questions about it from any terminal:
./spectator query "What did Alice say about Isaac Sim?" --target <gpu-machine>./spectator ui --target <gpu-machine>The command prints an ssh -L recipe — paste it into another terminal, then open http://localhost:3030 in your browser. The VSS agent UI gives you drag-and-drop upload, video Q&A, and a timeline view.
Spectator also ships its own persistent web UI that wraps the CLI: drag-drop upload, live progress with rt-factor + ETA, VSS lifecycle controls, output download, and a query box for both video (VSS) and audio (transcript) Q&A. Localhost-only by default.
# start the server (runs detached; survives shell exit)
./spectator ui-server start [--port 7777] [--target <gpu-machine>]
# open in your browser
open http://localhost:7777/
# tail the server log
./spectator ui-server logs --follow
# stop when done
./spectator ui-server stopPick --target <gpu-machine> if you want jobs submitted via the UI to run on a remote host. Without it, all jobs run locally on this machine. See REFERENCE.md → Web UI for the full HTTP / WebSocket API surface.
Stops the docker stack and frees the GPU. The image cache stays, so the next up is fast:
./spectator down --target <gpu-machine>REFERENCE.md covers everything else:
- Full SSH config (multiplexing, keepalives, multi-host setup)
- Architecture, topology, and containment policy (what writes where, what doesn't)
- Full subcommand reference table
- Hardware profiles (H100 / L40S / RTX PRO 6000 / Jetson THOR)
- Audio language handling (bilingual, translate-to-English)
- Self-hosted LLM endpoints (override the default
build.nvidia.com) - Notes & caveats (cloud-synced filesystem interactions, port conflicts, bring-up timing)
- Iterative development (the
rsync-only flow for code edits) - Project layout
For AI agents / IDE assistants working in the codebase, see AGENTS.md.
Issues and PRs welcome at https://github.com/myurasov/Spectator. See CONTRIBUTING.md for the contribution workflow, coding conventions, and the DCO sign-off requirement (git commit -s).
Dev workflow at a glance:
# rebuild the venv from a clean state
./spectator install --force
# pytest
./spectator test
# ruff check
./spectator lint
# ruff check --fix + ruff format
./spectator fmtFor security-sensitive issues, please follow the responsible-disclosure process in SECURITY.md (do not open a public issue). Third-party dependencies and their licenses are listed in THIRD_PARTY_NOTICES.md.