Spectator

Drop a meeting recording onto NVIDIA's Video Search & Summarization (VSS) Blueprint and get back a clean transcript or a structured summary. Spectator is a thin CLI wrapper that handles the install, deployment, and lifecycle so you can focus on work — not IT infrastructure.

TL;DR

Just want to transcribe a meeting on a Mac? No GPU host required — Apple Silicon runs Whisper locally via MPS at 2-4× faster than real-time:

git clone https://github.com/myurasov/Spectator.git spectator && cd spectator

# local audio venv (~5 min, one-time)
./spectator audio install

# auto-detects MPS on Apple Silicon, CUDA on a Linux box, falls back to CPU
./spectator audio transcribe meeting.mp3

Output lands at ~/.spectator/audio-out/<stem>/ (.txt, .srt, .vtt, .json, .tsv). See Transcribe locally on a Mac for the full path.

Want video summarization or Q&A? That needs the VSS Blueprint, which needs a GPU host:

git clone https://github.com/myurasov/Spectator.git spectator && cd spectator

# install local venv + deps
./spectator install

# add your GPU host to ~/.ssh/config (see step 2 below)

# from https://org.ngc.nvidia.com/setup/api-keys
export NGC_CLI_API_KEY="nvapi-..."

# from https://build.nvidia.com (Get API Key)
export NVIDIA_API_KEY="nvapi-..."

# push the tool to the GPU host (~5 min)
./spectator deploy --target <gpu-machine>

# bring up the VSS stack (~30–45 min first time)
./spectator up --target <gpu-machine>

# everyday use:

# transcribe a meeting recording (uses GPU host)
./spectator audio transcribe meeting.mp3 --target <gpu-machine>

# summarize a video
./spectator process video.mp4 --target <gpu-machine>

# ask follow-up questions about indexed videos
./spectator query "What did Alice say about ...?" --target <gpu-machine>

For details on each step, read on. For deeper reference (full SSH config, hardware profiles, all subcommands, gotchas), see REFERENCE.md.

What it does

Three pipelines, one CLI:

Audio transcription (Whisper) — upload a meeting / call / interview, get back a clean transcript with timestamps. Auto-detects bilingual recordings; quality presets for clean / standard / phone / very-noisy audio. Auto-detects the best available device (cuda > mps > cpu), so a Mac with Apple Silicon transcribes locally via MPS at 2-4× faster-than-real-time without any GPU host.
Speaker diarization (pyannote.audio) — figures out who is talking and when, the question Whisper alone can't answer. Standalone (spectator audio diarize) or chained onto transcribe (--diarize); the merged output assigns each Whisper segment a speaker label via maximum-overlap voting against pyannote turns. Installed alongside Whisper in the same audio-venv; Hugging Face access is needed at run time only, not install time.
Video summarization + Q&A (VSS) — upload a recording, get back a structured summary with timestamps and action items, then ask follow-up questions in plain English. Requires the VSS Blueprint stack, which runs on an NVIDIA GPU host.

Spectator does not reimplement any of these — it automates the install, deployment, and lifecycle steps for you. For audio: device auto-detection means the same command works on your Mac and on a remote Spark. For video: your laptop drives the GPU host over SSH; the VLM and Whisper run on the GPU; the LLM is called remotely on build.nvidia.com.

What you'll need

Audio-only on a Mac (Whisper transcripts, no video / Q&A):

A macOS laptop (Apple Silicon recommended for MPS speedups; Intel works too via CPU) with uv installed (brew install uv).
That's it. No SSH, no API keys, no GPU host.

Full stack (audio + video summarization + Q&A):

A macOS or Linux laptop with uv installed (brew install uv on macOS, or curl -LsSf https://astral.sh/uv/install.sh | sh).
A GPU host you can SSH into. Default target is DGX Spark (GB10); the same workflow runs on H100, L40S, RTX PRO 6000, and Jetson THOR. Your team's hardware lead can point you at one.
Two NVIDIA credentials (free, takes ~2 min to set up):
- NGC API key — https://org.ngc.nvidia.com/setup/api-keys (used to pull docker images)
- NVIDIA API key — https://build.nvidia.com → "Get API Key" (used by the remote LLM endpoint VSS calls during summarization)
~50 GB free disk space on the GPU host (one-time, for the docker image cache).

Setup

1. Install Spectator on your laptop

git clone https://github.com/myurasov/Spectator.git spectator
cd spectator
./spectator install

# confirms it's working — should print a curated overview
./spectator help

2. Set up an SSH alias for your GPU host

Every command takes --target <gpu-machine>, so pick a short alias for your host and put it in ~/.ssh/config. Use whatever alias name fits — spark, dgx-1, lab-box. The minimum entry (example, host alias and ip address, key/username will be yours):

Host <gpu-machine>
    HostName 10.0.0.42
    User ubuntu
    IdentityFile ~/.ssh/id_ed25519

Smoke-test it:

ssh <gpu-machine> "nvidia-smi --query-gpu=name --format=csv,noheader"

If you see your GPU's name printed back, you're good. The recommended config (connection multiplexing + keepalives — they make deploy ~10× faster and survive flaky networks) is in REFERENCE.md → SSH access. Use that on your real working setup.

The rest of this README and REFERENCE.md use <gpu-machine> as the placeholder for whatever alias name you picked — substitute your own when you copy commands.

3. Set your API keys

Add the two API keys to your shell profile (~/.zshrc or ~/.bashrc):

export NGC_CLI_API_KEY="nvapi-..."
export NVIDIA_API_KEY="nvapi-..."

Reload (source ~/.zshrc) and confirm: echo $NGC_CLI_API_KEY should print your key.

4. Deploy to your GPU host

# check driver / CUDA / docker / NGC reachability
./spectator preflight --target <gpu-machine>

# rsync + uv sync + install (~5 min)
./spectator deploy --target <gpu-machine>

If preflight flags a missing piece (e.g. user not in the docker group, or the NVIDIA Container Toolkit not registered with docker), run:

./spectator install --apply-system --target <gpu-machine>

This is the only command Spectator runs that touches anything outside ~/.spectator/ and ~/.docker/config.json on the host — it asks for sudo over SSH for each system change.

5. Bring the VSS stack up

First run pulls multi-GB images and takes 30–45 minutes. The stack runs in tmux on the host, so you can close your laptop and come back later:

./spectator up --target <gpu-machine>

# watch progress; Ctrl-C to detach (the tmux job keeps running)
./spectator logs --target <gpu-machine> --follow

# quick health check any time
./spectator status --target <gpu-machine>

When status shows the agent UI on port 3030 and the API on port 8000, you're ready to use the stack.

Common tasks

Transcribe a meeting recording

# one-time: install Whisper on the host (separate from VSS — runs in its own venv)
./spectator audio install --target <gpu-machine>

# transcribe — uploads, runs in tmux, ~5× real-time on a Spark
./spectator audio transcribe meeting.mp3 --target <gpu-machine> --quality meeting

# pull the transcript back to your laptop
./spectator audio fetch --target <gpu-machine> -o ./transcripts/

Quality presets:

Preset	Use case
`studio`	Clean studio mic, podcast feed
`meeting` (default)	typical video-conferencing recordings (clear voice, mixed quality)
`phone`	Voice-coded / low-bitrate phone calls
`extreme`	Distant mic, lots of noise, heavy crosstalk

For non-English recordings, bilingual / code-switched audio, or "translate to English" mode, see REFERENCE.md → Audio language handling.

Add speaker labels (diarization)

Diarization figures out who is talking, which whisper alone can't tell you. Spectator drives pyannote.audio in the same audio-venv as whisper.

The audio-venv install (./spectator audio install) bundles pyannote by default — no Hugging Face account required to install. You only need a Hugging Face token when you actually run a diarize: pyannote downloads the model weights from HF at first use. One-time setup, only needed before the first diarize call:

# 1. Accept the model licenses in the HF web UI. Pyannote's gate is a
#    multi-field form (Company, Website, Country, Use case) — not just
#    a checkbox. Fill all fields AND submit on each page. Just ticking
#    "I accept" leaves the gate locked. Three repos because 4.x reuses
#    the community-1 embedding inside every pipeline (including 3.1):
#    https://huggingface.co/pyannote/speaker-diarization-3.1
#    https://huggingface.co/pyannote/segmentation-3.0
#    https://huggingface.co/pyannote/speaker-diarization-community-1
#
# 2. Create a read-scope token at https://huggingface.co/settings/tokens
#
# 3. Pass the token once with --hf-token; it's persisted to
#    $workdir/.creds on the target so subsequent runs don't need it.
./spectator audio diarize meeting.mp3 --target <gpu-machine> \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

After that, no --hf-token needed:

# standalone diarize (writes <stem>.diar.{rttm,json})
./spectator audio diarize meeting.mp3 --target <gpu-machine>

# chained: whisper + diarize + merge in one tmux session
./spectator audio transcribe meeting.mp3 --target <gpu-machine> --diarize

The chained form produces <stem>.diarized.{json,txt} alongside the regular whisper output — each whisper segment gets a speaker field via maximum-overlap voting with pyannote's turns. Use --num-speakers N if you know the count in advance.

On a DGX Spark (GB10), diarization runs ~50-150× faster than real-time — a 1 h recording diarizes in roughly 30-60 s once the model is loaded. See REFERENCE.md → audio diarize for the full reference.

Transcribe locally on a Mac (no GPU host needed)

Apple Silicon Macs can transcribe audio entirely locally — no SSH, no remote host. Spectator auto-detects the torch device and uses CPU on Apple Silicon by default (see the MPS note below for why):

# one-time: install Whisper + torch into a local audio-venv (~5 min)
./spectator audio install

# transcribe with auto-detected device (cuda > cpu; mps skipped — see below)
./spectator audio transcribe meeting.mp3 --quality meeting

# force a specific device if needed
./spectator audio transcribe meeting.mp3 --device cpu     # explicit CPU
./spectator audio transcribe meeting.mp3 --device mps     # opt-in to MPS (see caveat)

Output lands at ~/.spectator/audio-out/<stem>/ (.txt, .srt, .vtt, .json, .tsv).

Rough performance on a 1-hour meeting-quality recording:

Hardware	Real-time factor
Apple Silicon via CPU	real-time to 2× slower
Apple Silicon (M-series) via MPS	2-4× faster (when working — see below)
Intel Mac via CPU	5-15× slower than real-time
NVIDIA GPU via CUDA	10-30× faster than real-time

Why MPS isn't the auto-detect default: openai-whisper × torch ≥ 2.x is currently broken on Apple Silicon GPU for every Whisper model Spectator could ship. large-v3 / large-v3-turbo (Spectator's preset models) crash with "Cannot convert a MPS Tensor to float64"; small does the same; base hits a different -inf logits failure; medium exits zero but writes an empty transcript. Tracked upstream as openai/whisper#2151. v0.4.1's auto-detect skips MPS and falls back to CPU. The override knobs (--device mps, SPECTATOR_ALLOW_MPS_AUTO=1) are still there, but they currently produce crashes or empty output regardless of --model. See REFERENCE.md → MPS limitation for the full per-model breakdown. CPU on Apple Silicon works fine (~real-time for meeting-quality recordings); CUDA via --target <gpu-machine> is dramatically faster (10-30×) for anything heavier.

Summarize a video

The video has to live on the GPU host (where VSS can read it). The simplest workflow: SSH in, drop the file under ~/.spectator/, run process:

ssh <gpu-machine>
cd ~/.spectator/Spectator
./spectator process /home/ubuntu/.spectator/meeting.mp4 \
    --prompt "Summarize the meeting; list action items with timestamps."

--prompt is free-form. Useful starters:

"List every action item with the owner's name and a timestamp."
"Summarize each slide as a single bullet."
"Pull out every customer requirement, grouped by topic."

Ask follow-up questions

Once a video is processed (and indexed), you can ask questions about it from any terminal:

./spectator query "What did Alice say about Isaac Sim?" --target <gpu-machine>

Open the VSS agent's web UI

./spectator ui --target <gpu-machine>

The command prints an ssh -L recipe — paste it into another terminal, then open http://localhost:3030 in your browser. The VSS agent UI gives you drag-and-drop upload, video Q&A, and a timeline view.

Use the Spectator Web UI (v0.2.0+)

Spectator also ships its own persistent web UI that wraps the CLI: drag-drop upload, live progress with rt-factor + ETA, VSS lifecycle controls, output download, and a query box for both video (VSS) and audio (transcript) Q&A. Localhost-only by default.

# start the server (runs detached; survives shell exit)
./spectator ui-server start [--port 7777] [--target <gpu-machine>]

# open in your browser
open http://localhost:7777/

# tail the server log
./spectator ui-server logs --follow

# stop when done
./spectator ui-server stop

Pick --target <gpu-machine> if you want jobs submitted via the UI to run on a remote host. Without it, all jobs run locally on this machine. See REFERENCE.md → Web UI for the full HTTP / WebSocket API surface.

Tear down when you're done

Stops the docker stack and frees the GPU. The image cache stays, so the next up is fast:

./spectator down --target <gpu-machine>

Where to go next

REFERENCE.md covers everything else:

Full SSH config (multiplexing, keepalives, multi-host setup)
Architecture, topology, and containment policy (what writes where, what doesn't)
Full subcommand reference table
Hardware profiles (H100 / L40S / RTX PRO 6000 / Jetson THOR)
Audio language handling (bilingual, translate-to-English)
Self-hosted LLM endpoints (override the default build.nvidia.com)
Notes & caveats (cloud-synced filesystem interactions, port conflicts, bring-up timing)
Iterative development (the rsync-only flow for code edits)
Project layout

For AI agents / IDE assistants working in the codebase, see AGENTS.md.

Contributing

Issues and PRs welcome at https://github.com/myurasov/Spectator. See CONTRIBUTING.md for the contribution workflow, coding conventions, and the DCO sign-off requirement (git commit -s).

Dev workflow at a glance:

# rebuild the venv from a clean state
./spectator install --force

# pytest
./spectator test

# ruff check
./spectator lint

# ruff check --fix + ruff format
./spectator fmt

For security-sensitive issues, please follow the responsible-disclosure process in SECURITY.md (do not open a public issue). Third-party dependencies and their licenses are listed in THIRD_PARTY_NOTICES.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spectator

TL;DR

Table of Contents

What it does

What you'll need

Setup

1. Install Spectator on your laptop

2. Set up an SSH alias for your GPU host

3. Set your API keys

4. Deploy to your GPU host

5. Bring the VSS stack up

Common tasks

Transcribe a meeting recording

Add speaker labels (diarization)

Transcribe locally on a Mac (no GPU host needed)

Summarize a video

Ask follow-up questions

Open the VSS agent's web UI

Use the Spectator Web UI (v0.2.0+)

Tear down when you're done

Where to go next

Contributing

About

Uh oh!

Releases 22

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
ai		ai
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REFERENCE.md		REFERENCE.md
SECURITY.md		SECURITY.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
pyproject.toml		pyproject.toml
spectator		spectator

Folders and files

Latest commit

History

Repository files navigation

Spectator

TL;DR

Table of Contents

What it does

What you'll need

Setup

1. Install Spectator on your laptop

2. Set up an SSH alias for your GPU host

3. Set your API keys

4. Deploy to your GPU host

5. Bring the VSS stack up

Common tasks

Transcribe a meeting recording

Add speaker labels (diarization)

Transcribe locally on a Mac (no GPU host needed)

Summarize a video

Ask follow-up questions

Open the VSS agent's web UI

Use the Spectator Web UI (v0.2.0+)

Tear down when you're done

Where to go next

Contributing

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages