-
Notifications
You must be signed in to change notification settings - Fork 0
Home
FrameRead (VideoAnalyzer) is a local, fully offline AI pipeline that transforms any video file into a comprehensive, multi-page written analysis — combining audio transcription with visual frame understanding.
| Page | Description |
|---|---|
| Home | You are here — project overview, quick start, and navigation |
| Project Specification | Full technical blueprint — architecture, pipeline stages, models, config, error handling |
| README | Quick-start guide, installation, and usage examples |
FrameRead takes a video file as input and produces:
-
A timestamped transcription of all spoken audio (via
distil-whisper/distil-large-v3) -
Detailed visual descriptions of every significant scene change (via
Qwen2-VL-2B-Instruct) -
A master synthesis — a dense, multi-page analysis that weaves together what was said and what was shown (via
qwen3.5:9b) - (Optional) A targeted answer to a specific question about the video, backed by evidence from the summary
All processing happens locally on your machine. No API keys, no cloud calls, no data leaves your device.
┌─────────────┐
│ Video File │
└──────┬──────┘
│
┌──────────────┼──────────────┐
▼ ▼
┌────────────────┐ ┌──────────────────┐
│ Audio Pipeline │ │ Video Pipeline │
│ │ │ │
│ ffmpeg → WAV │ │ OpenCV → Keyframe │
│ Whisper → Text │ │ Qwen2-VL → Desc. │
└───────┬────────┘ └────────┬─────────┘
│ │
└──────────┬──────────────────┘
▼
┌───────────────────┐
│ Synthesis (LLM) │
│ qwen3.5:9b │
│ via Ollama │
└────────┬──────────┘
│
┌───────────┴───────────┐
▼ ▼
Master Summary Prompt Answer
(always generated) (if user asked)
📐 Deep dive: See the Project Specification for the full system architecture, data flow, and module-level breakdowns.
| Dependency | Purpose | Install |
|---|---|---|
| Python ≥ 3.10 | Runtime | python.org |
| ffmpeg | Audio extraction | ffmpeg.org |
| Ollama | LLM inference (summary/Q&A) | ollama.com |
| CUDA GPU (recommended) | Accelerated inference | 8+ GB VRAM recommended |
git clone https://github.com/s-pra1ham/FrameRead.git
cd FrameRead
python -m venv venv && venv\Scripts\activate # Windows
pip install -r requirements.txt# Full summary
python run.py video.mp4
# With a question
python run.py video.mp4 --prompt "What is being demonstrated?"📘 Full usage guide: See the README for all usage modes including Python module import and Google Colab.
| Module | File | Role |
|---|---|---|
| Audio Extractor | src/audio/extractor.py |
Strips audio from video via ffmpeg into 16kHz mono WAV |
| Transcriber | src/audio/transcriber.py |
Runs distil-large-v3 via faster-whisper to produce timestamped text |
| Module | File | Role |
|---|---|---|
| Frame Extractor | src/video/frame_extractor.py |
Scene-change detection using histogram χ² + SSIM, saves keyframe JPEGs |
| Frame Analyzer | src/video/frame_analyzer.py |
Sends keyframes to Qwen2-VL-2B-Instruct (local HuggingFace) for visual description |
| Module | File | Role |
|---|---|---|
| Ollama Manager | src/llm/ollama_manager.py |
Ensures Ollama is running, pulls required models |
| Summarizer | src/llm/summarizer.py |
Generates master summary and handles prompt Q&A via qwen3.5:9b
|
| Module | File | Role |
|---|---|---|
| Hardware Detection | src/utils/hardware.py |
GPU/CPU survey → HardwareConfig dataclass |
| Logger | src/utils/logger.py |
Centralized [TIMESTAMP] [MODULE] message logging |
| Model Manager | src/utils/model_manager.py |
Whisper model cache verification & download |
| Cleanup | src/utils/cleanup.py |
TempDirManager context manager for ephemeral temp directories |
| Module | File | Role |
|---|---|---|
| Analyzer | src/analyzer.py |
Top-level pipeline orchestrator — ties all modules together |
| Config | src/config.py |
All tunable constants: thresholds, model names, prompts |
| Public API | src/__init__.py |
Exports analyze() and AnalysisResult
|
-
Model:
distil-whisper/distil-large-v3 -
Runtime:
faster-whisper(CTranslate2 backend) -
GPU:
float16· CPU:int8 -
Cache:
~/.cache/huggingface/ - Auto-downloaded on first run. Managed by
src/utils/model_manager.py.
-
Model:
Qwen/Qwen2-VL-2B-Instruct - Runtime: HuggingFace Transformers (loaded directly into GPU/CPU memory)
-
GPU dtype:
bfloat16(≥20GB VRAM) orfloat16(<20GB) - Batching: Dynamic — batch size is computed at runtime via a 3-phase VRAM probe protocol
-
Cache:
~/.cache/huggingface/hub/ -
Critical: Model is explicitly unloaded after use (
del model+torch.cuda.empty_cache()) to free VRAM for Ollama.
-
Model:
qwen3.5:9b -
Runtime: Ollama (
/api/chatendpoint) -
Context window:
32768tokens (summary) ·16384tokens (Q&A) - Auto-pulled by
src/llm/ollama_manager.pyif not found.
Rather than using hardcoded batch sizes, the vision analyzer measures actual VRAM consumption at runtime:
- Baseline — Record VRAM usage after loading model weights
- Probe — Run inference on 1 frame, measure peak VRAM delta
-
Calculate —
batch_size = floor((total - baseline - 22.5% safety buffer) / per_frame_cost)
The probe frame's result is preserved (not wasted). On CPU, batch size is always 1.
Keyframes are extracted using a dual-metric system — a frame is saved when either threshold is breached:
| Metric | Threshold | What It Detects |
|---|---|---|
| Histogram χ² distance | > 0.28 |
Broad color palette shifts |
| SSIM (Structural Similarity) | < 0.89 |
Layout and structure changes |
A minimum interval of 8 frames between saves prevents redundant keyframes during gradual transitions.
The pipeline carefully manages GPU memory across stages:
Whisper loads → transcribes → unloads (automatic via faster-whisper)
↓
Qwen2-VL loads → analyzes frames → explicitly unloaded (del + empty_cache)
↓
Ollama loads qwen3.5:9b → synthesizes summary → managed by Ollama process
This sequencing ensures models don't compete for VRAM.
@dataclass
class AnalysisResult:
summary: str # Multi-page master summary
prompt_answer: str | None # Answer (if prompt provided)
keyframe_count: int # Keyframes extracted
transcription: str # Timestamped transcript
duration_seconds: float # Pipeline wall-clock time
video_path: str # Absolute input pathThe master summary is structured into sections:
- OVERVIEW — Video purpose, genre, subject
- DETAILED NARRATIVE — Chronological walkthrough fusing audio + visuals
- KEY POINTS & CONCEPTS — Every idea, claim, or demonstration
- VISUAL HIGHLIGHTS — Notable visual elements, UI, on-screen text
- SPEAKERS & PARTICIPANTS — Who appears and their roles
- TONE & STYLE — Pacing, presentation style, intended audience
The pipeline is designed to be resilient:
| Scenario | Behavior |
|---|---|
| No audio track | Continues with empty transcription — does not crash |
| 0 keyframes extracted | Skips vision, summarizes from transcript only |
| Vision batch fails | Logs warning, writes placeholder, continues |
| Single corrupt frame | Skips it, processes the rest |
| Q&A fails | Returns placeholder text — does not crash |
| Cleanup fails | Logs warning only — result is already returned |
| ffmpeg missing | Immediate RuntimeError with install instructions |
| Ollama missing | Immediate RuntimeError with install URL |
📋 Full error matrix: See §13 of the Project Specification.
| Resource | Link |
|---|---|
| 📖 Full Technical Specification | VideoAnalyzer_ProjectSpec.md |
| 🚀 Quick Start & Usage | README.md |
| ☁️ Run on Google Colab (T4 GPU) | Open in Colab |
| 🤗 Qwen2-VL Model Card | HuggingFace |
| 🤗 Distil-Whisper Model Card | HuggingFace |
| 🦙 Ollama | ollama.com |
Built by S. Pratham · Fully local · Fully offline · No API keys required