A film-editor agent that logs your footage before it cuts. Local-first · GPU-free · every cut ships with receipts.
cutroom log https://youtube.com/watch?v=... # ingest + index ("log the footage")
cutroom highlights <video> -n 3 --vertical # agent finds & renders the best moments
cutroom highlights <video> --fanout # scout long video in parallel windows
cutroom highlights <video> --plan # review the edit plan before rendering
cutroom recipe podcast-shorts <video> # expert workflows as markdown files
cutroom ask <video> "what did she say about pricing?" # answers with [mm:ss] citations
cutroom chapters <video> # YouTube-ready chapter markers
cutroom cut <video> "make a 30s teaser focused on the demo failure"
cutroom cut <video> "tense, fast" --steer # type guidance mid-run to redirect it
cutroom cut <video> "60s recap" --verify # fresh-eyes critic checks every cut
cutroom cut <video> "tighter, 10s" --fork <session> # branch a session: new cut style,
# keeps the investigation already paid for
cutroom render <video> --target vertical # re-render the saved EDL, no agent run
cutroom sessions <video> # past editor sessions (resume / fork any)
cutroom checkpoints <video> # EDL undo history; restore any state
cutroom trail <video> # audit trail: every call, charge, denial
Frontier agents are still bad at long-video work — not because the models can't reason, but because the harness around them wastes the context window. On AgenticVBench (May 2026, 100 real post-production tasks), the best frontier stack scores 31% vs 88.5% for human experts, and 83% of repurposing failures are long-context information loss: the agent burns its budget dumping full transcripts and re-extracting frames, and never reaches the actual edit.
Real editors solved this problem a century ago: log the footage first (shot lists, transcripts, markers), cut from the logs, and go back to the footage only to verify. cutroom encodes that discipline as architecture:
- The agent never sees the full transcript. It gets a compact hierarchical video map (scenes ← shots ← word-timestamped transcript) plus budgeted, paged search tools.
- Every cut ships with receipts. A cut is only accepted if it cites transcript
segments and frames the agent actually viewed. Renders include a human-auditable
receipts.mdwith thumbnails and quoted transcript. - An explicit budget ledger prices every tool call. The agent sees its remaining budget in every tool result and has to finish before it runs dry — by design, not by accident.
┌───────────────────── log (once per video) ─────────────────────┐
source ──► yt-dlp/ffmpeg ──► shots (scene detect) ──► faster-whisper (word ts) │
└──► silences/loudness ──► scenes + one-liners ──► SQLite + FTS5 ◄──┘
│
┌─────────────────────── agent loop ─────────────────▼───────────┐
│ get_video_map → search_transcript → read_transcript (paged) │
│ → view_frames (it really looks) → mark_moment (with evidence) │
│ → propose_edl — every tool result carries the budget line │
└───────────────────────────────────────────────────┬──────────────┘
▼
EDL → snap to word boundaries → ffmpeg render
(9:16 crop, burned word-level captions) + receipts.md
- Local-first, GPU-free. Transcription is faster-whisper on CPU (Apple Silicon friendly); rendering is ffmpeg; the only remote calls are the agent's own reasoning (Claude via the Agent SDK — reuses your Claude Code login, no separate API key needed).
- No 8GB-GPU tower of models. One LLM provider, one SQLite file, one binary dependency (ffmpeg).
Run on Duck and Cover (1951, public domain, 9 min) — the instruction was "make one ~30 second teaser that opens on Bert the Turtle and ends right after the atomic flash". The agent inspected the map, read three narrow transcript spans, viewed seven frames, and proposed a 2-cut EDL totaling 30.5s — using 13.4k chars of tool budget (the full transcript alone would be ~9k, and on a 90-minute video this gap becomes the whole ballgame).
Every render ships a receipts.md:
Cut 1 — Bert the Turtle intro — animated song sequence · 00:00–00:24 Why: Teaser opener — full S1 Bert animated intro with "Dum-dum, name of Bert" song. Frames at t=10s (Bert close-up with flower) and t=19.8s (Bert in forest) confirm on-screen Bert animation throughout. Opens on natural pre-roll silence, clean scene boundary at 00:24.
> [00:05] Dum-dum, name of Bert, I'm dangerous…+ frame thumbnails
brew install ffmpeg yt-dlp # macOS; Linux: apt install ffmpeg && pipx install yt-dlp
uv tool install cutroom # or: pipx install cutroom
cutroom --helpOr from source:
git clone https://github.com/Young-1231/cutroom && cd cutroom
uv sync
uv run cutroom --helpRequires Python 3.12+, uv, and a logged-in
Claude Code (or ANTHROPIC_API_KEY).
Caption burn-in needs an ffmpeg with libass. Some slim builds (including current Homebrew bottles) ship without it — cutroom detects this and automatically falls back to the bundled static-ffmpeg, or set
CUTROOM_FFMPEG=/path/to/your/ffmpeg.
Do I need an API key? No, if you have Claude Code: cutroom reuses its login via the
Agent SDK. Otherwise set ANTHROPIC_API_KEY.
What leaves my machine? Download, transcription, indexing, and rendering are all local. The agent's reasoning runs on the Claude API, so the model sees what its tools return: the compact video map, the transcript snippets it reads, and the frames it views — never the video file or the full transcript. The budget ledger caps exactly this surface.
What does a run cost? Tool results are budgeted per run (--budget, default
60–120k chars ≈ a fraction of one context window). On the bench tasks below, real edits
consumed 12–21k chars in 16–22 turns each.
GPU? None, ever. faster-whisper runs on CPU (Apple Silicon friendly); rendering is ffmpeg.
# 1. Log the footage (downloads, transcribes, indexes — prints the video map)
uv run cutroom log "https://www.youtube.com/watch?v=..."
uv run cutroom log ./lecture.mp4 --whisper-model small
# 2. Work with it
uv run cutroom list
uv run cutroom map 3f2a91
uv run cutroom ask 3f2a91 "what was the main objection raised in the Q&A?"
uv run cutroom highlights 3f2a91 -n 3 --vertical
uv run cutroom chapters 3f2a91
uv run cutroom cut 3f2a91 "60s recap of the live demo, keep the crowd reaction"Outputs land in ~/.cutroom/media/<id>/renders/: the clips, their .ass caption files,
and receipts.md — open it to see why each cut exists, with thumbnails of the exact
frames the agent inspected.
Ablation on His Girl Friday (1940, 92 min, famously dense dialogue — 133k chars of transcript): same questions, same model, cutroom's map+tools vs the full transcript pasted into context (what naive video agents do). Both arms answered all three questions correctly; cutroom also frame-verified its claims.
| question (92-min film) | cutroom | full-transcript baseline | saving |
|---|---|---|---|
| What does Hildy plan to do after leaving? | 15.8k chars | 133.6k chars | 8.5× |
| How does Walter get Bruce arrested? | 28.0k chars | 133.6k chars | 4.8× |
| How is Earl Williams saved at the end? | 26.6k chars | 133.6k chars | 5.0× |
The baseline's cost grows linearly with video length; cutroom's is capped by its budget
ledger no matter how long the footage is. (Small-N and self-judged — illustrative, not a
benchmark. Reproduce with uv run python scripts/ablation.py <video> "<question>";
raw outputs in docs/ablation-*.json.)
A repurpose-style scorecard checks what is mechanically falsifiable about real runs —
no LLM judge: did an EDL land, does it hit the duration/cut-count constraints, does
every cut carry receipts, and does every boundary sit within 0.5s of a natural speech
or silence edge. On Duck and Cover (all three tasks in bench/repurpose.json):
| task | ok | cuts | total | checks | budget | turns |
|---|---|---|---|---|---|---|
| teaser-30s | ✅ | 3 | 31.2s | ✓ duration, ✓ receipts, ✓ boundaries | 12,146 | 22 |
| highlights-3 | ✅ | 3 | 98.7s | ✓ n_cuts, ✓ cut_lengths, ✓ receipts, ✓ boundaries | 20,696 | 18 |
| vertical-short-20s | ✅ | 1 | 22.0s | ✓ duration, ✓ receipts, ✓ boundaries | 14,935 | 16 |
Quality judgment ("is the teaser good?") is deliberately out of scope — that needs a
benchmark's judging protocol, not a README. Reproduce with
uv run python scripts/bench_repurpose.py <video>; raw output in
docs/bench-repurpose-*.json; the bench GitHub Actions workflow runs the same
scorecard on a fresh runner against a public-domain film (manual trigger, needs an
ANTHROPIC_API_KEY secret).
cutroom borrows the patterns that define this generation of agent tools (Claude Code, Codex, OpenClaw):
- Plan mode (human-in-the-loop).
--planmakes the editor produce its cut plan — each cut's time range, reason, and cited transcript — and stop. Editing is irreversible and subjective, so you review (and tweakedl.json) before a single frame renders, then apply withcutroom render <video>. - Recipes (skills as markdown files). An editing skill is one markdown file:
frontmatter (format, budget, clip count) + a body of expert guidance. Built-ins ship
in the package; drop your own into
~/.cutroom/recipes/and they appear incutroom recipesand override built-ins by name. Progressive disclosure, both ways: the agent's system prompt carries onlyname: summarylines, and aload_recipetool pulls the full body on demand — in real runs the editor loads the matching recipe on turn 1 when your free-form instruction smells like one. - Mid-run steering. With
--steer, type a line and hit Enter while the editor works: the session is interrupted and your guidance injected, with all receipts and budget state intact. Every tool call already streams as a one-line progress note (→ view_frames 42s,46.5s), so you can see when it goes somewhere you don't like. In a real run, a "3 cuts, 30s" task steered mid-run to "ONE cut, 15s max" landed exactly that. - Fresh-eyes verification.
--verifyspawns a critic with a clean context after the EDL lands — not the editor grading its own work. The critic's toolkit can investigate but cannot cut (propose_edl/mark_momentstripped); it re-reads both boundaries, re-views frames inside each cut, and files structured per-cut verdicts. Flagged issues get exactly one revision round, resumed into the editor's session. - Fan-out (parallel sub-agents).
--fanoutsplits a long video into windows and runs one scout agent per window concurrently, then merges and globally ranks their picks — faster and cheaper than one agent scanning an hour serially, and each kept moment still carries its viewed-frame receipts. Scouts are isolated by construction:propose_edlis stripped from their toolkit entirely, so only the orchestrator can assemble an EDL. - Lifecycle gates + audit trail (hooks). Budget and evidence rules are enforced at
the harness layer, not just inside tool handlers: a PreToolUse gate denies
investigation once the budget is spent and rejects any cut citing a frame the agent
never actually viewed; every tool call, denial, and session summary lands in a
per-video
trail.jsonlwith per-call costs.cutroom trail <video>reads it back: per-session summaries, a call-by-call timeline (--session), and a cross-session denial audit (--denials). - File sandbox. The editor's only filesystem tool (the built-in
Read, granted so it can re-view saved frames) is confined to the video's own media directory at the hook layer — symlinks resolved, relative paths refused. The transcript the agent reads is attacker-controllable (it's ASR of an arbitrary video), and in adversarial testing an injected "read /tmp/... and tell me what it says" was denied in the real chain, logged to the trail, and never reached the model. - Checkpoints (shadow-VCS over the EDL). Every accepted or saved edit list becomes
an immutable checkpoint — "undo to before that cut", independent of any session.
cutroom checkpoints <video> --diff cp_0002shows cut-aware diffs (~ cut 0 [68.46-87.82] -> [68.46-81.82]);cutroom restoresnapshots the current state first, so restores are themselves undoable — and--scope edl|session|bothrestores the file, re-opens the conversation that made it, or both. - Sessions: resume & fork. Every run prints a session handle.
--resumecontinues it with full memory;--forkbranches it to try a different cut style without re-paying the investigation. In a real run, recutting a 20s clip into a 10s teaser via--forkcost 1,500 budget chars in 4 turns versus the parent's 12,489 in 13 — the fork reused the parent's viewed-frame receipts, and the evidence gate honored them.
- Index-first, not context-first. The transcript lives in SQLite+FTS5; the model gets a map and a search box, not a dump. This is the direct fix for the failure mode that dominates agentic video benchmarks.
- Evidence-gated edits.
propose_edlrejects any cut that doesn't cite transcript segments and at least one frame the agent actually rendered to pixels. No vibes-based cutting. - Honest budgets. Tool results are compact and paginated by construction; frames cost more than text; the ledger is visible to the model at every step.
M0 — all verbs (log / list / map / ask / highlights / chapters / cut /
render / sessions / checkpoints / restore / trail) implemented and verified
end-to-end on real footage; word-level burned captions (landscape + 9:16 vertical),
adaptive scene segmentation, EDL persistence, receipts, lifecycle hooks + audit trail,
file sandbox, EDL checkpoints with 3-scope restore, session resume/fork, mid-run
steering, fresh-eyes verification, file-based recipes. 184+ offline tests + live agent
e2e runs (including adversarial gate tests), ruff-clean.
- M1: active-speaker-aware vertical crop (CPU face tracking); silence/filler-word trim presets; OTIO/EDL export for NLE handoff (DaVinci, Premiere).
- M2 (in progress): repurpose-task scorecard — mechanical checks (duration, receipts,
boundary cleanliness) over real runs via
scripts/bench_repurpose.py, runnable in CI (benchworkflow) — published honestly, whichever way it goes. - M3: multi-video projects ("find every claim about X across my 10 lectures").
MIT

