YouTube visual analysis pipeline — chapter-aware frame extraction, OCR-first slide preservation, quality filtering, and LLM-ready manifest synthesis.
The engine still lives in the yt_vision_v2 package during migration, but the primary tool name is now yt-vision-pro. The legacy ytv2 console alias is still available.
- Downloads a YouTube video + captions via yt-dlp
- Parses chapters (or generates synthetic 15-min chunks for unchaptered videos)
- Detects scene boundaries with
ContentDetectororAdaptiveDetector - Extracts scene-start frames plus optional within-scene samples
- Runs OCR on each frame before deduplication
- Filters black/blurry frames and deduplicates with slide-aware pHash thresholds
- Aligns captions to frames (YouTube VTT or whisper fallback)
- Generates chunked Markdown manifests with density metadata for LLM synthesis
Video from: https://www.youtube.com/watch?v=24t04HzoIXY (2hour 20min video) Rough estimate: ~100-120K tokens for the full deep-research run

Video from: https://youtu.be/KZPo15M2DbM (6 min video) Rough estimate: ~50-70K tokens for the full deep-research run

- Python 3.10+
- ffmpeg on PATH (
winget install ffmpegon Windows)
pip install -e ".[dev]"
# Optional: whisper fallback for videos without captions
pip install -e ".[whisper]"# Basic — process a YouTube video
yt-vision-pro <youtube-url>
# Custom cache directory
yt-vision-pro <youtube-url> --cache-dir ./my-cache
# Skip OCR (faster)
yt-vision-pro <youtube-url> --no-ocr
# Skip quality filters
yt-vision-pro <youtube-url> --no-filter
# Use the adaptive detector instead of content-based detection
yt-vision-pro <youtube-url> --detector adaptive
# Sample the first hour densely for lecture-heavy videos
yt-vision-pro <youtube-url> --dense-until 01:00:00
# Force specific chapters to high density
yt-vision-pro <youtube-url> --dense-chapters 0,1,2
# Re-run from scratch
yt-vision-pro <youtube-url> --force
# Resume from a specific stage (fetch, extract, ocr, dedup-with-ocr-context, align, manifest)
yt-vision-pro <youtube-url> --from-stage dedup-with-ocr-context
# Legacy alias still works
ytv2 <youtube-url>high: 3s within-scene sampling, loose near-duplicate removal, strongest slide preservationnormal: 5s within-scene sampling, balanced deduplicationlow: 15s within-scene sampling, aggressive deduplication for conversational videos
Use --density to set the default tier, --dense-chapters to promote specific chapter indices, and --dense-until to promote everything before a time cutoff.
| Stage | Description | Sentinel |
|---|---|---|
| fetch | Download video, captions, info.json via yt-dlp | .stages/fetch.done |
| extract | Scene detection + raw frame extraction | .stages/extract.done |
| ocr | RapidOCR on each frame | .stages/ocr.done |
| dedup-with-ocr-context | Quality filtering + slide-aware dedup | .stages/dedup-with-ocr-context.done |
| align | Parse captions (YouTube VTT or whisper fallback) | .stages/align.done |
| manifest | Generate chunked Markdown manifests | .stages/manifest.done |
Each stage writes a sentinel file. On re-run, completed stages are skipped. Use --force to clear all sentinels or --from-stage <name> to re-run from a specific point.
- Single-chapter videos:
cache/manifest.md - Multi-chapter videos:
cache/manifests/manifest-00-intro.md, etc.
Feed the manifest(s) to an LLM (Copilot Chat, Claude Code) for synthesis into research notes.
pytest tests/ -v