Home

📖 FrameRead Wiki — Home

FrameRead (VideoAnalyzer) is a local, fully offline AI pipeline that transforms any video file into a comprehensive, multi-page written analysis — combining audio transcription with visual frame understanding.

🗂️ Wiki Navigation

Page	Description
Home	You are here — project overview, quick start, and navigation
Project Specification	Full technical blueprint — architecture, pipeline stages, models, config, error handling
README	Quick-start guide, installation, and usage examples

🧠 What is FrameRead?

FrameRead takes a video file as input and produces:

A timestamped transcription of all spoken audio (via distil-whisper/distil-large-v3)
Detailed visual descriptions of every significant scene change (via Qwen2-VL-2B-Instruct)
A master synthesis — a dense, multi-page analysis that weaves together what was said and what was shown (via qwen3.5:9b)
(Optional) A targeted answer to a specific question about the video, backed by evidence from the summary

All processing happens locally on your machine. No API keys, no cloud calls, no data leaves your device.

🏛️ Architecture at a Glance

                         ┌─────────────┐
                         │  Video File  │
                         └──────┬──────┘
                                │
                 ┌──────────────┼──────────────┐
                 ▼                             ▼
        ┌────────────────┐           ┌──────────────────┐
        │ Audio Pipeline │           │  Video Pipeline   │
        │                │           │                   │
        │ ffmpeg → WAV   │           │ OpenCV → Keyframe │
        │ Whisper → Text │           │ Qwen2-VL → Desc.  │
        └───────┬────────┘           └────────┬─────────┘
                │                             │
                └──────────┬──────────────────┘
                           ▼
                 ┌───────────────────┐
                 │  Synthesis (LLM)  │
                 │  qwen3.5:9b       │
                 │  via Ollama       │
                 └────────┬──────────┘
                          │
              ┌───────────┴───────────┐
              ▼                       ▼
      Master Summary            Prompt Answer
      (always generated)       (if user asked)

📐 Deep dive: See the Project Specification for the full system architecture, data flow, and module-level breakdowns.

⚡ Quick Start

Prerequisites

Dependency	Purpose	Install
Python ≥ 3.10	Runtime	python.org
ffmpeg	Audio extraction	ffmpeg.org
Ollama	LLM inference (summary/Q&A)	ollama.com
CUDA GPU (recommended)	Accelerated inference	8+ GB VRAM recommended

Install

git clone https://github.com/s-pra1ham/FrameRead.git
cd FrameRead
python -m venv venv && venv\Scripts\activate   # Windows
pip install -r requirements.txt

Run

# Full summary
python run.py video.mp4

# With a question
python run.py video.mp4 --prompt "What is being demonstrated?"

📘 Full usage guide: See the README for all usage modes including Python module import and Google Colab.

🔩 Core Modules

Audio Pipeline

Module	File	Role
Audio Extractor	`src/audio/extractor.py`	Strips audio from video via `ffmpeg` into 16kHz mono WAV
Transcriber	`src/audio/transcriber.py`	Runs `distil-large-v3` via `faster-whisper` to produce timestamped text

Video Pipeline

Module	File	Role
Frame Extractor	`src/video/frame_extractor.py`	Scene-change detection using histogram χ² + SSIM, saves keyframe JPEGs
Frame Analyzer	`src/video/frame_analyzer.py`	Sends keyframes to `Qwen2-VL-2B-Instruct` (local HuggingFace) for visual description

Synthesis & LLM

Module	File	Role
Ollama Manager	`src/llm/ollama_manager.py`	Ensures Ollama is running, pulls required models
Summarizer	`src/llm/summarizer.py`	Generates master summary and handles prompt Q&A via `qwen3.5:9b`

Utilities

Module	File	Role
Hardware Detection	`src/utils/hardware.py`	GPU/CPU survey → `HardwareConfig` dataclass
Logger	`src/utils/logger.py`	Centralized `[TIMESTAMP] [MODULE] message` logging
Model Manager	`src/utils/model_manager.py`	Whisper model cache verification & download
Cleanup	`src/utils/cleanup.py`	`TempDirManager` context manager for ephemeral temp directories

Orchestration

Module	File	Role
Analyzer	`src/analyzer.py`	Top-level pipeline orchestrator — ties all modules together
Config	`src/config.py`	All tunable constants: thresholds, model names, prompts
Public API	`src/__init__.py`	Exports `analyze()` and `AnalysisResult`

🤖 Models

Whisper (Audio → Text)

Model: distil-whisper/distil-large-v3
Runtime: faster-whisper (CTranslate2 backend)
GPU: float16 · CPU: int8
Cache: ~/.cache/huggingface/
Auto-downloaded on first run. Managed by src/utils/model_manager.py.

Qwen2-VL (Frame → Description)

Model: Qwen/Qwen2-VL-2B-Instruct
Runtime: HuggingFace Transformers (loaded directly into GPU/CPU memory)
GPU dtype: bfloat16 (≥20GB VRAM) or float16 (<20GB)
Batching: Dynamic — batch size is computed at runtime via a 3-phase VRAM probe protocol
Cache: ~/.cache/huggingface/hub/
Critical: Model is explicitly unloaded after use (del model + torch.cuda.empty_cache()) to free VRAM for Ollama.

Qwen3.5 (Synthesis + Q&A)

Model: qwen3.5:9b
Runtime: Ollama (/api/chat endpoint)
Context window: 32768 tokens (summary) · 16384 tokens (Q&A)
Auto-pulled by src/llm/ollama_manager.py if not found.

🔬 Key Technical Details

Dynamic VRAM-Probed Batching

Rather than using hardcoded batch sizes, the vision analyzer measures actual VRAM consumption at runtime:

Baseline — Record VRAM usage after loading model weights
Probe — Run inference on 1 frame, measure peak VRAM delta
Calculate — batch_size = floor((total - baseline - 22.5% safety buffer) / per_frame_cost)

The probe frame's result is preserved (not wasted). On CPU, batch size is always 1.

Scene-Change Detection

Keyframes are extracted using a dual-metric system — a frame is saved when either threshold is breached:

Metric	Threshold	What It Detects
Histogram χ² distance	`> 0.28`	Broad color palette shifts
SSIM (Structural Similarity)	`< 0.89`	Layout and structure changes

A minimum interval of 8 frames between saves prevents redundant keyframes during gradual transitions.

VRAM Lifecycle

The pipeline carefully manages GPU memory across stages:

Whisper loads → transcribes → unloads (automatic via faster-whisper)
     ↓
Qwen2-VL loads → analyzes frames → explicitly unloaded (del + empty_cache)
     ↓
Ollama loads qwen3.5:9b → synthesizes summary → managed by Ollama process

This sequencing ensures models don't compete for VRAM.

📊 Pipeline Output

AnalysisResult

@dataclass
class AnalysisResult:
    summary: str                  # Multi-page master summary
    prompt_answer: str | None     # Answer (if prompt provided)
    keyframe_count: int           # Keyframes extracted
    transcription: str            # Timestamped transcript
    duration_seconds: float       # Pipeline wall-clock time
    video_path: str               # Absolute input path

Summary Structure

The master summary is structured into sections:

OVERVIEW — Video purpose, genre, subject
DETAILED NARRATIVE — Chronological walkthrough fusing audio + visuals
KEY POINTS & CONCEPTS — Every idea, claim, or demonstration
VISUAL HIGHLIGHTS — Notable visual elements, UI, on-screen text
SPEAKERS & PARTICIPANTS — Who appears and their roles
TONE & STYLE — Pacing, presentation style, intended audience

🛡️ Error Handling

The pipeline is designed to be resilient:

Scenario	Behavior
No audio track	Continues with empty transcription — does not crash
0 keyframes extracted	Skips vision, summarizes from transcript only
Vision batch fails	Logs warning, writes placeholder, continues
Single corrupt frame	Skips it, processes the rest
Q&A fails	Returns placeholder text — does not crash
Cleanup fails	Logs warning only — result is already returned
ffmpeg missing	Immediate `RuntimeError` with install instructions
Ollama missing	Immediate `RuntimeError` with install URL

📋 Full error matrix: See §13 of the Project Specification.

🔗 Links

Resource	Link
📖 Full Technical Specification	VideoAnalyzer_ProjectSpec.md
🚀 Quick Start & Usage	README.md
☁️ Run on Google Colab (T4 GPU)	Open in Colab
🤗 Qwen2-VL Model Card	HuggingFace
🤗 Distil-Whisper Model Card	HuggingFace
🦙 Ollama	ollama.com

Built by S. Pratham · Fully local · Fully offline · No API keys required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly