From 246933e9a16f931108de907fdb8ea0077cf204b0 Mon Sep 17 00:00:00 2001 From: Nika Siradze Date: Thu, 11 Jun 2026 11:46:32 +0400 Subject: [PATCH 01/41] Add parity harness to lock feature correctness through native-CLI migration Two-layer correctness model around the existing transcript JSON contract: - Layer 1 (committed, CI-ready): synthetic neutral transcript + caption goldens for all 4 styles, word-text normalization + zero-duration-floor assertions. - Layer 2 (local/gitignored): capture_baseline.py freezes current openai-whisper output; compare.py gates a candidate engine on WER + word-timestamp drift. Also adds plans/native-cli.md (full migration plan). --- plans/native-cli.md | 137 +++++++++++++++++++++++ tests/parity/.gitignore | 6 + tests/parity/README.md | 84 ++++++++++++++ tests/parity/capture_baseline.py | 110 ++++++++++++++++++ tests/parity/compare.py | 129 +++++++++++++++++++++ tests/parity/golden/branded.ass.expected | 52 +++++++++ tests/parity/golden/hormozi.ass.expected | 18 +++ tests/parity/golden/karaoke.ass.expected | 16 +++ tests/parity/golden/subtle.ass.expected | 16 +++ tests/parity/test_caption_parity.py | 110 ++++++++++++++++++ tests/parity/transcript_synthetic.json | 22 ++++ 11 files changed, 700 insertions(+) create mode 100644 plans/native-cli.md create mode 100644 tests/parity/.gitignore create mode 100644 tests/parity/README.md create mode 100644 tests/parity/capture_baseline.py create mode 100644 tests/parity/compare.py create mode 100644 tests/parity/golden/branded.ass.expected create mode 100644 tests/parity/golden/hormozi.ass.expected create mode 100644 tests/parity/golden/karaoke.ass.expected create mode 100644 tests/parity/golden/subtle.ass.expected create mode 100644 tests/parity/test_caption_parity.py create mode 100644 tests/parity/transcript_synthetic.json diff --git a/plans/native-cli.md b/plans/native-cli.md new file mode 100644 index 0000000..32b9d0e --- /dev/null +++ b/plans/native-cli.md @@ -0,0 +1,137 @@ +# podcli → Native CLI (codex-style) + +> Goal: turn podcli from a git-clone + `setup.sh` + venv/npm hybrid into a **native CLI you install once and that auto-updates** — like `openai/codex`. Users run `npm i -g podcli` (or `bun add -g podcli`) and `podcli process video.mp4` just works, everywhere, with no Python/Node/FFmpeg setup. + +## North star + +``` +npm i -g podcli # or: bun add -g podcli +podcli process pod.mp4 --top 5 + → first run: silently provisions a hermetic runtime (one time) + → 9:16 clips with burned captions +podcli # auto-updates itself on launch +``` + +No `setup.sh`. No venv. No `pip`. No `npm install` of the engine. No "is the right Python/FFmpeg on PATH?" The system environment becomes irrelevant. + +--- + +## Why this is hard (the core tension) + +codex is a single static Rust binary with **zero** runtime deps. podcli is the opposite — a **three-runtime hybrid**: + +- **Python engine** (`backend/cli.py`, ~188KB): Whisper (→ PyTorch ~2GB), OpenCV face-crop, Pillow, FFmpeg, Google API. +- **Node/TS**: MCP server (`src/server.ts`), React web UI, Remotion → headless **Chromium** (studio bookends). +- **Bash launcher** routing PodStack AI commands to Claude/Codex, everything else to Python. + +You can't fold PyTorch + Chromium + FFmpeg into one static binary. So we **package the hybrid**: a tiny native launcher that provisions and drives hermetic runtimes, and we **kill the single worst dependency (PyTorch) by swapping Whisper → whisper.cpp.** + +--- + +## Locked decisions + +| Area | Decision | +|---|---| +| **Target artifact** | Package the hybrid. Thin Go launcher provisions + drives hermetic runtimes; self-updates. Not a rewrite. | +| **Launcher language** | **Go.** One `go build` → 5 static binaries. Replaces both bash `podcli` and `install.cmd`. | +| **Runtimes** | **Fully hermetic.** Launcher downloads pinned standalone CPython, static FFmpeg, whisper.cpp (+ Node/Chromium later). System python/node/ffmpeg ignored. | +| **Transcription** | **whisper.cpp** replaces `openai-whisper`/PyTorch. GGUF models. Metal on Apple Silicon, CUDA/CPU elsewhere. ~145MB vs ~2GB. | +| **Bundle model** | Tiny launcher; **first run provisions the full core stack** (download-once, like today's `setup.sh` but automatic + cross-platform). | +| **Storage** | **Global** managed dir for runtimes + model cache (`%LOCALAPPDATA%\podcli` / `~/Library/Application Support/podcli` / `~/.local/share/podcli`). **Per-project** `.podcli/` (knowledge, output, history) stays in cwd — podcli stays project-scoped like git. | +| **Distribution** | **npm + bun only.** Thin wrapper package fetches the platform binary on install (codex-style). **No code signing, no brew/winget/curl\|sh/.exe.** | +| **Auto-update** | On launch: fast (~250ms, short-timeout) check against GitHub Releases. Newer → update then load. Offline/slow → proceed on current version (never blocks). Self-replace the managed binary in `~/.podcli/bin/`; if that's impossible, print the matching upgrade command (`npm i -g podcli` / `bun add -g podcli`). | +| **Update opt-out** | Persistent off switch: `podcli config set update.auto off` + `PODCLI_NO_UPDATE=1`. When off: no checks, runs installed version. `podcli update` still works on demand. | +| **AI features** | API key preferred → AI-CLI fallback → core works without. If a key is set, call the Claude/OpenAI API directly (self-contained); else shell to installed Claude/Codex CLI (today's behavior); else the video pipeline still works and AI features print how to enable them. | +| **Platforms** | macOS arm64, macOS x64, Linux x64, Linux arm64, Windows x64. | +| **First milestone** | **Thin vertical slice** — `process` pipeline only, fully hermetic, whisper.cpp, npm/bun, self-update, all 5 platforms. Studio / AI-API / MCP come after. | + +--- + +## Target architecture + +``` +┌─ podcli (Go launcher, ~8MB, per-platform) ──────────────────────────┐ +│ • on-launch self-update (GitHub Releases, throttle-free fast check) │ +│ • first-run provisioning → global managed dir │ +│ • subcommand routing: process/studio/thumbnails… → hermetic python │ +│ studio render → hermetic node │ +│ • config, version pinning, rollback │ +└──────────────────────────────────────────────────────────────────────┘ + │ provisions (pinned versions) + ▼ + Global managed dir (~/.local/share/podcli, etc.) + bin/ podcli- (the real engine binary, self-updatable) + runtime/ cpython-standalone/ ffmpeg whisper.cpp (+ node/ later) + models/ ggml-base-q5_1.bin … (fetched/cached) + venv/ hermetic pip env for backend/ deps (opencv, Pillow, …) + + Per-project (cwd)/.podcli/ + knowledge/ output/ history/ presets/ cache/ (unchanged) +``` + +**Subcommand routing (MVP):** `process` and friends → `runtime/cpython/python backend/cli.py …` with all paths pointing at the hermetic runtime. The Go launcher sets `PYTHON`, `FFMPEG`, model paths, and env so `cli.py` never touches the system. + +--- + +## Transcription swap (the keystone engine change) + +Clean seam: `backend/services/transcription.py::transcribe_file()` returns a fixed dict (`segments`, word-level `words`, `duration`, `language`, speaker fields). Only the engine behind it changes. + +- Replace `import whisper; model.transcribe(..., word_timestamps=True)` with a subprocess call to the vendored `whisper-cli` (whisper.cpp) emitting JSON, then map its output → the existing dict shape. +- **Validation risk to prove early:** whisper.cpp word-level timestamps must be good enough for the karaoke/word-highlight captions. Build a parity test (same clip, compare word timings old vs new) before committing. +- Diarization is already optional/off by default (Claude handles speakers; paste-transcript supports `Speaker (MM:SS)`), so it's not a blocker. +- Models: ship/fetch `ggml-base-q5_1` (~57MB) by default; allow `--model small/medium/large` to lazily fetch bigger GGUFs into `models/`. + +--- + +## Roadmap + +### Phase 0 — Foundation spike (de-risk everything) +- Go launcher skeleton: arg parse, subcommand passthrough to a hand-placed python. +- Managed-dir layout + OS-appropriate paths. +- Hermetic provisioning: download pinned **CPython standalone**, **static FFmpeg**, **whisper.cpp** binary + base-q5 model for the **current** platform; create hermetic venv; `pip install` backend deps into it. +- Prove `go run . process sample.mp4` produces a clip using **only** hermetic components (rename/hide system python+ffmpeg to verify). + +### Phase 1 — whisper.cpp engine swap +- Reimplement `transcribe_file()` on whisper.cpp behind the existing dict contract. +- Word-timestamp parity test vs `openai-whisper` on a fixture; tune `--max-len`/token-timestamp flags. +- Remove `openai-whisper` from `requirements.txt`; confirm captions (karaoke/Hormozi/subtle) still render correctly. + +### Phase 2 — Distribution + self-update (→ first installable release) +- CI matrix builds the Go launcher for all **5 targets**. +- **npm + bun wrapper package**: `postinstall` downloads the platform binary into `~/.podcli/bin/`; `bin` shim execs it. Publish to npm (bun consumes the same registry). +- GitHub Release per version carries: the 5 binaries + a **version manifest** pinning required runtime versions (so an update knows what to re-provision). +- Self-update: fast on-launch check, atomic self-replace of the managed binary, npm/bun fallback message, `PODCLI_NO_UPDATE` + `config set update.auto off`, `podcli update`, keep-previous-binary for `podcli rollback`. +- **Ship.** `npm i -g podcli` → `podcli process` works hermetic on all 5 platforms and auto-updates. ← *this is the MVP gate.* + +### Phase 3 — Lazy tiers + studio +- Demote OpenCV face-crop to lazy (center-crop default offline; fetch opencv on first smart crop — the center-crop fallback already exists at `cli.py:621`). +- Lazy bigger Whisper models. +- **Studio**: provision hermetic **Node + Remotion + Chromium** on first `studio` use; route `studio` render through it. + +### Phase 4 — AI goes native +- Port PodStack prompt files (`.claude/commands/*.md`) into the engine as direct **Claude API** calls. +- `podcli config set api-key …`; precedence: API key → installed Claude/Codex CLI → "enable AI" hint. +- Keep core pipeline fully functional without any AI. + +### Phase 5 — MCP / web UI (or deprecate) +- Decide whether the MCP server still matters once AI is native via API, or it's only for "use podcli from inside Claude/Codex." +- If kept: `podcli serve` (MCP stdio) + `podcli ui` (web dashboard) provisioned on demand via hermetic Node. + +--- + +## Risks / open questions + +- **whisper.cpp timestamp quality** for word-highlight captions — *prove in Phase 1 before deleting PyTorch path.* +- **Hermetic Node + Chromium on Windows** for studio (Phase 3) — Remotion's Chromium download + headless render is the heaviest non-ML surface; expect platform-specific pain. +- **First-run download size/time** — set expectations with a clear progress UI; cache aggressively in the global dir. +- **npm self-update vs package-manager ownership** — managed-binary-in-`~/.podcli/bin` sidesteps it; fallback message covers the rest. +- **GPU acceleration** — whisper.cpp Metal (mac) is automatic; CUDA (linux/win) needs the right prebuilt — decide CPU-only baseline + optional CUDA fetch. +- **Versioning** — single version for launcher + manifest of pinned runtime versions; SemVer; changelog drives the "update available" line. +- **MCP/web UI fate** — genuinely open; resolve at Phase 5. + +--- + +## Immediate next step + +Start **Phase 0** on a branch: stand up the Go launcher + hermetic provisioning for the current platform (darwin/arm64) and get `process` running end-to-end against only hermetic components. That single spike validates the launcher, the managed-dir model, hermetic provisioning, and the whisper.cpp integration surface in one shot. diff --git a/tests/parity/.gitignore b/tests/parity/.gitignore new file mode 100644 index 0000000..7ea24b7 --- /dev/null +++ b/tests/parity/.gitignore @@ -0,0 +1,6 @@ +# Real-audio fixtures and captured baselines may contain podcast content — +# never commit them. The committable parity surface is the synthetic transcript, +# the goldens (*.ass.expected), and the harness code. +local/ +baseline/ +candidate/ diff --git a/tests/parity/README.md b/tests/parity/README.md new file mode 100644 index 0000000..87ea258 --- /dev/null +++ b/tests/parity/README.md @@ -0,0 +1,84 @@ +# Parity harness — keeping every feature correct through the native-CLI migration + +This harness is the safety net for `plans/native-cli.md` (Go launcher, hermetic +runtimes, **whisper.cpp** replacing openai-whisper/PyTorch). Its job: prove that +swapping the transcription engine and relocating the runtime does **not** change +what podcli produces. + +## The correctness model: two layers split by a contract that already exists + +The transcript JSON `{words, segments, ...}` is already a stable, multi-producer +contract (produced by `transcribe_file`, `parse_speaker_transcript`, raw JSON +import, **and** the on-disk cache; consumed by corrections, cropping, moment +selection, and captions). That seam lets correctness decompose into two layers +that are verified independently. + +### Layer 1 — everything *downstream* of the transcript JSON +Moments → crop → captions → normalize → encode. **This code does not change** in +the migration; only the runtime relocates (system → hermetic). So the rule is +absolute: a fixed transcript must yield identical output. Any difference is a +runtime *pinning* bug, never a logic change. + +- `transcript_synthetic.json` — neutral, no podcast content; packs the word-text + edge cases (leading-space token, number+symbol, punctuation, apostrophe, + whitespace-only token, zero-duration token, speaker change). +- `test_caption_parity.py` — renders all four caption styles from that transcript + and diffs against committed goldens (`golden/*.ass.expected`). Also pins the + **word-text normalization** the whisper.cpp boundary must reproduce exactly + (the single highest-risk integration detail) and the 50ms zero-duration floor. + +Run it (fast, no media, CI-friendly): + +``` +venv/bin/python3 -m pytest tests/parity/test_caption_parity.py -q +``` + +Intentionally update goldens after a deliberate change: + +``` +UPDATE_GOLDENS=1 venv/bin/python3 -m pytest tests/parity/test_caption_parity.py -q +``` + +### Layer 2 — the engine (`audio → transcript JSON`) +The only real change. Verified by comparing the new engine's output to a frozen +openai-whisper baseline, with **forgiving** tolerances — because the caption +pipeline already runs in production on evenly-spaced *synthetic* word timings +(`transcript_parser.py:306`), so absolute timestamp fidelity is a quality nicety, +not a correctness requirement. + +1. Capture the baseline **now**, while openai-whisper still works. Drop a few + short representative clips into `tests/parity/local/` (single speaker, two + speakers, music-heavy, fast speech), then: + + ``` + venv/bin/python3 tests/parity/capture_baseline.py + ``` + + Writes `baseline//` = transcript.json + metrics.json + captions per style. + +2. Later, run whisper.cpp into `candidate//transcript.json` and gate: + + ``` + venv/bin/python3 tests/parity/compare.py + ``` + + Checks WER and word-timestamp drift (median/p95) against thresholds + (`PARITY_MAX_WER`, `PARITY_MAX_MEDIAN_DRIFT`, `PARITY_MAX_P95_DRIFT`). Nonzero + exit = regression; wire it into CI as the swap gate. + +## Why this makes "everything still works" tractable + +- **Layer 1 is identical by construction** — pinned runtime + frozen-transcript + goldens. The boring 80% can't drift silently. +- **Layer 2 is bounded against a floor the app already ships** — whisper.cpp only + has to beat evenly-spaced timings, which it does trivially. +- **The cache protects existing work** — already-transcribed videos reuse their + openai-whisper JSON, so they produce byte-identical output under the new binary. +- **Dual-engine release** (`--engine whisper-py`, planned) gives an instant + real-world fallback while whisper.cpp proves itself. + +## What is committed vs local + +Committed: this README, `transcript_synthetic.json`, `golden/*.ass.expected`, +the harness scripts. **Never committed** (`.gitignore`): `local/` fixtures, +`baseline/`, `candidate/` — they can contain podcast content. diff --git a/tests/parity/capture_baseline.py b/tests/parity/capture_baseline.py new file mode 100644 index 0000000..2fe593e --- /dev/null +++ b/tests/parity/capture_baseline.py @@ -0,0 +1,110 @@ +"""Layer-2 engine baseline capture. + +Runs the CURRENT transcription engine (openai-whisper) on real-audio fixtures +and freezes its output as ground truth. When the engine is later swapped to +whisper.cpp, `compare.py` measures the candidate against this baseline with +explicit tolerances. + +The baseline is the transcript JSON contract (`words` + `segments`) plus a +metrics summary and the captions rendered from those real words — i.e. exactly +the things whisper.cpp must reproduce. + +Usage: + venv/bin/python3 tests/parity/capture_baseline.py FIXTURE [FIXTURE ...] + venv/bin/python3 tests/parity/capture_baseline.py # scans tests/parity/local/ + +Outputs (gitignored) under tests/parity/baseline//: + transcript.json full {words, segments, duration, language} + metrics.json n_words, n_segments, duration, language, engine, model + captions_