Skip to content

myurasov/Spectator

Repository files navigation

Spectator

Drop a meeting recording onto NVIDIA's Video Search & Summarization (VSS) Blueprint and get back a clean transcript or a structured summary. Spectator is a thin CLI wrapper that handles the install, deployment, and lifecycle so you can focus on work — not IT infrastructure.

Copyright (c) 2026 Mikhail Yurasov. Licensed under the Apache License 2.0.

TL;DR

Just want to transcribe a meeting on a Mac? No GPU host required — Apple Silicon runs Whisper locally via MPS at 2-4× faster than real-time:

git clone https://github.com/myurasov/Spectator.git spectator && cd spectator

# local audio venv (~5 min, one-time)
./spectator audio install

# auto-detects MPS on Apple Silicon, CUDA on a Linux box, falls back to CPU
./spectator audio transcribe meeting.mp3

Output lands at ~/.spectator/audio-out/<stem>/ (.txt, .srt, .vtt, .json, .tsv). See Transcribe locally on a Mac for the full path.

Want video summarization or Q&A? That needs the VSS Blueprint, which needs a GPU host:

git clone https://github.com/myurasov/Spectator.git spectator && cd spectator

# install local venv + deps
./spectator install

# add your GPU host to ~/.ssh/config (see step 2 below)

# from https://org.ngc.nvidia.com/setup/api-keys
export NGC_CLI_API_KEY="nvapi-..."

# from https://build.nvidia.com (Get API Key)
export NVIDIA_API_KEY="nvapi-..."

# push the tool to the GPU host (~5 min)
./spectator deploy --target <gpu-machine>

# bring up the VSS stack (~30–45 min first time)
./spectator up --target <gpu-machine>

# everyday use:

# transcribe a meeting recording (uses GPU host)
./spectator audio transcribe meeting.mp3 --target <gpu-machine>

# summarize a video
./spectator process video.mp4 --target <gpu-machine>

# ask follow-up questions about indexed videos
./spectator query "What did Alice say about ...?" --target <gpu-machine>

For details on each step, read on. For deeper reference (full SSH config, hardware profiles, all subcommands, gotchas), see REFERENCE.md.

Table of Contents

What it does

Three pipelines, one CLI:

  • Audio transcription (Whisper) — upload a meeting / call / interview, get back a clean transcript with timestamps. Auto-detects bilingual recordings; quality presets for clean / standard / phone / very-noisy audio. Auto-detects the best available device (cuda > mps > cpu), so a Mac with Apple Silicon transcribes locally via MPS at 2-4× faster-than-real-time without any GPU host.
  • Speaker diarization (pyannote.audio) — figures out who is talking and when, the question Whisper alone can't answer. Standalone (spectator audio diarize) or chained onto transcribe (--diarize); the merged output assigns each Whisper segment a speaker label via maximum-overlap voting against pyannote turns. Installed alongside Whisper in the same audio-venv; Hugging Face access is needed at run time only, not install time.
  • Video summarization + Q&A (VSS) — upload a recording, get back a structured summary with timestamps and action items, then ask follow-up questions in plain English. Requires the VSS Blueprint stack, which runs on an NVIDIA GPU host.

Spectator does not reimplement any of these — it automates the install, deployment, and lifecycle steps for you. For audio: device auto-detection means the same command works on your Mac and on a remote Spark. For video: your laptop drives the GPU host over SSH; the VLM and Whisper run on the GPU; the LLM is called remotely on build.nvidia.com.

What you'll need

Audio-only on a Mac (Whisper transcripts, no video / Q&A):

  • A macOS laptop (Apple Silicon recommended for MPS speedups; Intel works too via CPU) with uv installed (brew install uv).
  • That's it. No SSH, no API keys, no GPU host.

Full stack (audio + video summarization + Q&A):

  • A macOS or Linux laptop with uv installed (brew install uv on macOS, or curl -LsSf https://astral.sh/uv/install.sh | sh).
  • A GPU host you can SSH into. Default target is DGX Spark (GB10); the same workflow runs on H100, L40S, RTX PRO 6000, and Jetson THOR. Your team's hardware lead can point you at one.
  • Two NVIDIA credentials (free, takes ~2 min to set up):
  • ~50 GB free disk space on the GPU host (one-time, for the docker image cache).

Setup

1. Install Spectator on your laptop

git clone https://github.com/myurasov/Spectator.git spectator
cd spectator
./spectator install

# confirms it's working — should print a curated overview
./spectator help

2. Set up an SSH alias for your GPU host

Every command takes --target <gpu-machine>, so pick a short alias for your host and put it in ~/.ssh/config. Use whatever alias name fits — spark, dgx-1, lab-box. The minimum entry (example, host alias and ip address, key/username will be yours):

Host <gpu-machine>
    HostName 10.0.0.42
    User ubuntu
    IdentityFile ~/.ssh/id_ed25519

Smoke-test it:

ssh <gpu-machine> "nvidia-smi --query-gpu=name --format=csv,noheader"

If you see your GPU's name printed back, you're good. The recommended config (connection multiplexing + keepalives — they make deploy ~10× faster and survive flaky networks) is in REFERENCE.md → SSH access. Use that on your real working setup.

The rest of this README and REFERENCE.md use <gpu-machine> as the placeholder for whatever alias name you picked — substitute your own when you copy commands.

3. Set your API keys

Add the two API keys to your shell profile (~/.zshrc or ~/.bashrc):

export NGC_CLI_API_KEY="nvapi-..."
export NVIDIA_API_KEY="nvapi-..."

Reload (source ~/.zshrc) and confirm: echo $NGC_CLI_API_KEY should print your key.

4. Deploy to your GPU host

# check driver / CUDA / docker / NGC reachability
./spectator preflight --target <gpu-machine>

# rsync + uv sync + install (~5 min)
./spectator deploy --target <gpu-machine>

If preflight flags a missing piece (e.g. user not in the docker group, or the NVIDIA Container Toolkit not registered with docker), run:

./spectator install --apply-system --target <gpu-machine>

This is the only command Spectator runs that touches anything outside ~/.spectator/ and ~/.docker/config.json on the host — it asks for sudo over SSH for each system change.

5. Bring the VSS stack up

First run pulls multi-GB images and takes 30–45 minutes. The stack runs in tmux on the host, so you can close your laptop and come back later:

./spectator up --target <gpu-machine>

# watch progress; Ctrl-C to detach (the tmux job keeps running)
./spectator logs --target <gpu-machine> --follow

# quick health check any time
./spectator status --target <gpu-machine>

When status shows the agent UI on port 3030 and the API on port 8000, you're ready to use the stack.

Common tasks

Transcribe a meeting recording

# one-time: install Whisper on the host (separate from VSS — runs in its own venv)
./spectator audio install --target <gpu-machine>

# transcribe — uploads, runs in tmux, ~5× real-time on a Spark
./spectator audio transcribe meeting.mp3 --target <gpu-machine> --quality meeting

# pull the transcript back to your laptop
./spectator audio fetch --target <gpu-machine> -o ./transcripts/

Quality presets:

Preset Use case
studio Clean studio mic, podcast feed
meeting (default) typical video-conferencing recordings (clear voice, mixed quality)
phone Voice-coded / low-bitrate phone calls
extreme Distant mic, lots of noise, heavy crosstalk

For non-English recordings, bilingual / code-switched audio, or "translate to English" mode, see REFERENCE.md → Audio language handling.

Add speaker labels (diarization)

Diarization figures out who is talking, which whisper alone can't tell you. Spectator drives pyannote.audio in the same audio-venv as whisper.

The audio-venv install (./spectator audio install) bundles pyannote by default — no Hugging Face account required to install. You only need a Hugging Face token when you actually run a diarize: pyannote downloads the model weights from HF at first use. One-time setup, only needed before the first diarize call:

# 1. Accept the model licenses in the HF web UI. Pyannote's gate is a
#    multi-field form (Company, Website, Country, Use case) — not just
#    a checkbox. Fill all fields AND submit on each page. Just ticking
#    "I accept" leaves the gate locked. Three repos because 4.x reuses
#    the community-1 embedding inside every pipeline (including 3.1):
#    https://huggingface.co/pyannote/speaker-diarization-3.1
#    https://huggingface.co/pyannote/segmentation-3.0
#    https://huggingface.co/pyannote/speaker-diarization-community-1
#
# 2. Create a read-scope token at https://huggingface.co/settings/tokens
#
# 3. Pass the token once with --hf-token; it's persisted to
#    $workdir/.creds on the target so subsequent runs don't need it.
./spectator audio diarize meeting.mp3 --target <gpu-machine> \
    --hf-token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

After that, no --hf-token needed:

# standalone diarize (writes <stem>.diar.{rttm,json})
./spectator audio diarize meeting.mp3 --target <gpu-machine>

# chained: whisper + diarize + merge in one tmux session
./spectator audio transcribe meeting.mp3 --target <gpu-machine> --diarize

The chained form produces <stem>.diarized.{json,txt} alongside the regular whisper output — each whisper segment gets a speaker field via maximum-overlap voting with pyannote's turns. Use --num-speakers N if you know the count in advance.

On a DGX Spark (GB10), diarization runs ~50-150× faster than real-time — a 1 h recording diarizes in roughly 30-60 s once the model is loaded. See REFERENCE.md → audio diarize for the full reference.

Transcribe locally on a Mac (no GPU host needed)

Apple Silicon Macs can transcribe audio entirely locally — no SSH, no remote host. Spectator auto-detects the torch device and uses CPU on Apple Silicon by default (see the MPS note below for why):

# one-time: install Whisper + torch into a local audio-venv (~5 min)
./spectator audio install

# transcribe with auto-detected device (cuda > cpu; mps skipped — see below)
./spectator audio transcribe meeting.mp3 --quality meeting

# force a specific device if needed
./spectator audio transcribe meeting.mp3 --device cpu     # explicit CPU
./spectator audio transcribe meeting.mp3 --device mps     # opt-in to MPS (see caveat)

Output lands at ~/.spectator/audio-out/<stem>/ (.txt, .srt, .vtt, .json, .tsv).

Rough performance on a 1-hour meeting-quality recording:

Hardware Real-time factor
Apple Silicon via CPU real-time to 2× slower
Apple Silicon (M-series) via MPS 2-4× faster (when working — see below)
Intel Mac via CPU 5-15× slower than real-time
NVIDIA GPU via CUDA 10-30× faster than real-time

Why MPS isn't the auto-detect default: openai-whisper × torch ≥ 2.x is currently broken on Apple Silicon GPU for every Whisper model Spectator could ship. large-v3 / large-v3-turbo (Spectator's preset models) crash with "Cannot convert a MPS Tensor to float64"; small does the same; base hits a different -inf logits failure; medium exits zero but writes an empty transcript. Tracked upstream as openai/whisper#2151. v0.4.1's auto-detect skips MPS and falls back to CPU. The override knobs (--device mps, SPECTATOR_ALLOW_MPS_AUTO=1) are still there, but they currently produce crashes or empty output regardless of --model. See REFERENCE.md → MPS limitation for the full per-model breakdown. CPU on Apple Silicon works fine (~real-time for meeting-quality recordings); CUDA via --target <gpu-machine> is dramatically faster (10-30×) for anything heavier.

Summarize a video

The video has to live on the GPU host (where VSS can read it). The simplest workflow: SSH in, drop the file under ~/.spectator/, run process:

ssh <gpu-machine>
cd ~/.spectator/Spectator
./spectator process /home/ubuntu/.spectator/meeting.mp4 \
    --prompt "Summarize the meeting; list action items with timestamps."

--prompt is free-form. Useful starters:

  • "List every action item with the owner's name and a timestamp."
  • "Summarize each slide as a single bullet."
  • "Pull out every customer requirement, grouped by topic."

Ask follow-up questions

Once a video is processed (and indexed), you can ask questions about it from any terminal:

./spectator query "What did Alice say about Isaac Sim?" --target <gpu-machine>

Open the VSS agent's web UI

./spectator ui --target <gpu-machine>

The command prints an ssh -L recipe — paste it into another terminal, then open http://localhost:3030 in your browser. The VSS agent UI gives you drag-and-drop upload, video Q&A, and a timeline view.

Use the Spectator Web UI (v0.2.0+)

Spectator also ships its own persistent web UI that wraps the CLI: drag-drop upload, live progress with rt-factor + ETA, VSS lifecycle controls, output download, and a query box for both video (VSS) and audio (transcript) Q&A. Localhost-only by default.

# start the server (runs detached; survives shell exit)
./spectator ui-server start [--port 7777] [--target <gpu-machine>]

# open in your browser
open http://localhost:7777/

# tail the server log
./spectator ui-server logs --follow

# stop when done
./spectator ui-server stop

Pick --target <gpu-machine> if you want jobs submitted via the UI to run on a remote host. Without it, all jobs run locally on this machine. See REFERENCE.md → Web UI for the full HTTP / WebSocket API surface.

Tear down when you're done

Stops the docker stack and frees the GPU. The image cache stays, so the next up is fast:

./spectator down --target <gpu-machine>

Where to go next

REFERENCE.md covers everything else:

  • Full SSH config (multiplexing, keepalives, multi-host setup)
  • Architecture, topology, and containment policy (what writes where, what doesn't)
  • Full subcommand reference table
  • Hardware profiles (H100 / L40S / RTX PRO 6000 / Jetson THOR)
  • Audio language handling (bilingual, translate-to-English)
  • Self-hosted LLM endpoints (override the default build.nvidia.com)
  • Notes & caveats (cloud-synced filesystem interactions, port conflicts, bring-up timing)
  • Iterative development (the rsync-only flow for code edits)
  • Project layout

For AI agents / IDE assistants working in the codebase, see AGENTS.md.

Contributing

Issues and PRs welcome at https://github.com/myurasov/Spectator. See CONTRIBUTING.md for the contribution workflow, coding conventions, and the DCO sign-off requirement (git commit -s).

Dev workflow at a glance:

# rebuild the venv from a clean state
./spectator install --force

# pytest
./spectator test

# ruff check
./spectator lint

# ruff check --fix + ruff format
./spectator fmt

For security-sensitive issues, please follow the responsible-disclosure process in SECURITY.md (do not open a public issue). Third-party dependencies and their licenses are listed in THIRD_PARTY_NOTICES.md.

About

Watches videos, takes notes

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors