Skip to content
S. Pratham edited this page Apr 11, 2026 · 1 revision

📖 FrameRead Wiki — Home

FrameRead (VideoAnalyzer) is a local, fully offline AI pipeline that transforms any video file into a comprehensive, multi-page written analysis — combining audio transcription with visual frame understanding.


🗂️ Wiki Navigation

Page Description
Home You are here — project overview, quick start, and navigation
Project Specification Full technical blueprint — architecture, pipeline stages, models, config, error handling
README Quick-start guide, installation, and usage examples

🧠 What is FrameRead?

FrameRead takes a video file as input and produces:

  1. A timestamped transcription of all spoken audio (via distil-whisper/distil-large-v3)
  2. Detailed visual descriptions of every significant scene change (via Qwen2-VL-2B-Instruct)
  3. A master synthesis — a dense, multi-page analysis that weaves together what was said and what was shown (via qwen3.5:9b)
  4. (Optional) A targeted answer to a specific question about the video, backed by evidence from the summary

All processing happens locally on your machine. No API keys, no cloud calls, no data leaves your device.


🏛️ Architecture at a Glance

                         ┌─────────────┐
                         │  Video File  │
                         └──────┬──────┘
                                │
                 ┌──────────────┼──────────────┐
                 ▼                             ▼
        ┌────────────────┐           ┌──────────────────┐
        │ Audio Pipeline │           │  Video Pipeline   │
        │                │           │                   │
        │ ffmpeg → WAV   │           │ OpenCV → Keyframe │
        │ Whisper → Text │           │ Qwen2-VL → Desc.  │
        └───────┬────────┘           └────────┬─────────┘
                │                             │
                └──────────┬──────────────────┘
                           ▼
                 ┌───────────────────┐
                 │  Synthesis (LLM)  │
                 │  qwen3.5:9b       │
                 │  via Ollama       │
                 └────────┬──────────┘
                          │
              ┌───────────┴───────────┐
              ▼                       ▼
      Master Summary            Prompt Answer
      (always generated)       (if user asked)

📐 Deep dive: See the Project Specification for the full system architecture, data flow, and module-level breakdowns.


⚡ Quick Start

Prerequisites

Dependency Purpose Install
Python ≥ 3.10 Runtime python.org
ffmpeg Audio extraction ffmpeg.org
Ollama LLM inference (summary/Q&A) ollama.com
CUDA GPU (recommended) Accelerated inference 8+ GB VRAM recommended

Install

git clone https://github.com/s-pra1ham/FrameRead.git
cd FrameRead
python -m venv venv && venv\Scripts\activate   # Windows
pip install -r requirements.txt

Run

# Full summary
python run.py video.mp4

# With a question
python run.py video.mp4 --prompt "What is being demonstrated?"

📘 Full usage guide: See the README for all usage modes including Python module import and Google Colab.


🔩 Core Modules

Audio Pipeline

Module File Role
Audio Extractor src/audio/extractor.py Strips audio from video via ffmpeg into 16kHz mono WAV
Transcriber src/audio/transcriber.py Runs distil-large-v3 via faster-whisper to produce timestamped text

Video Pipeline

Module File Role
Frame Extractor src/video/frame_extractor.py Scene-change detection using histogram χ² + SSIM, saves keyframe JPEGs
Frame Analyzer src/video/frame_analyzer.py Sends keyframes to Qwen2-VL-2B-Instruct (local HuggingFace) for visual description

Synthesis & LLM

Module File Role
Ollama Manager src/llm/ollama_manager.py Ensures Ollama is running, pulls required models
Summarizer src/llm/summarizer.py Generates master summary and handles prompt Q&A via qwen3.5:9b

Utilities

Module File Role
Hardware Detection src/utils/hardware.py GPU/CPU survey → HardwareConfig dataclass
Logger src/utils/logger.py Centralized [TIMESTAMP] [MODULE] message logging
Model Manager src/utils/model_manager.py Whisper model cache verification & download
Cleanup src/utils/cleanup.py TempDirManager context manager for ephemeral temp directories

Orchestration

Module File Role
Analyzer src/analyzer.py Top-level pipeline orchestrator — ties all modules together
Config src/config.py All tunable constants: thresholds, model names, prompts
Public API src/__init__.py Exports analyze() and AnalysisResult

🤖 Models

Whisper (Audio → Text)

  • Model: distil-whisper/distil-large-v3
  • Runtime: faster-whisper (CTranslate2 backend)
  • GPU: float16 · CPU: int8
  • Cache: ~/.cache/huggingface/
  • Auto-downloaded on first run. Managed by src/utils/model_manager.py.

Qwen2-VL (Frame → Description)

  • Model: Qwen/Qwen2-VL-2B-Instruct
  • Runtime: HuggingFace Transformers (loaded directly into GPU/CPU memory)
  • GPU dtype: bfloat16 (≥20GB VRAM) or float16 (<20GB)
  • Batching: Dynamic — batch size is computed at runtime via a 3-phase VRAM probe protocol
  • Cache: ~/.cache/huggingface/hub/
  • Critical: Model is explicitly unloaded after use (del model + torch.cuda.empty_cache()) to free VRAM for Ollama.

Qwen3.5 (Synthesis + Q&A)

  • Model: qwen3.5:9b
  • Runtime: Ollama (/api/chat endpoint)
  • Context window: 32768 tokens (summary) · 16384 tokens (Q&A)
  • Auto-pulled by src/llm/ollama_manager.py if not found.

🔬 Key Technical Details

Dynamic VRAM-Probed Batching

Rather than using hardcoded batch sizes, the vision analyzer measures actual VRAM consumption at runtime:

  1. Baseline — Record VRAM usage after loading model weights
  2. Probe — Run inference on 1 frame, measure peak VRAM delta
  3. Calculatebatch_size = floor((total - baseline - 22.5% safety buffer) / per_frame_cost)

The probe frame's result is preserved (not wasted). On CPU, batch size is always 1.

Scene-Change Detection

Keyframes are extracted using a dual-metric system — a frame is saved when either threshold is breached:

Metric Threshold What It Detects
Histogram χ² distance > 0.28 Broad color palette shifts
SSIM (Structural Similarity) < 0.89 Layout and structure changes

A minimum interval of 8 frames between saves prevents redundant keyframes during gradual transitions.

VRAM Lifecycle

The pipeline carefully manages GPU memory across stages:

Whisper loads → transcribes → unloads (automatic via faster-whisper)
     ↓
Qwen2-VL loads → analyzes frames → explicitly unloaded (del + empty_cache)
     ↓
Ollama loads qwen3.5:9b → synthesizes summary → managed by Ollama process

This sequencing ensures models don't compete for VRAM.


📊 Pipeline Output

AnalysisResult

@dataclass
class AnalysisResult:
    summary: str                  # Multi-page master summary
    prompt_answer: str | None     # Answer (if prompt provided)
    keyframe_count: int           # Keyframes extracted
    transcription: str            # Timestamped transcript
    duration_seconds: float       # Pipeline wall-clock time
    video_path: str               # Absolute input path

Summary Structure

The master summary is structured into sections:

  • OVERVIEW — Video purpose, genre, subject
  • DETAILED NARRATIVE — Chronological walkthrough fusing audio + visuals
  • KEY POINTS & CONCEPTS — Every idea, claim, or demonstration
  • VISUAL HIGHLIGHTS — Notable visual elements, UI, on-screen text
  • SPEAKERS & PARTICIPANTS — Who appears and their roles
  • TONE & STYLE — Pacing, presentation style, intended audience

🛡️ Error Handling

The pipeline is designed to be resilient:

Scenario Behavior
No audio track Continues with empty transcription — does not crash
0 keyframes extracted Skips vision, summarizes from transcript only
Vision batch fails Logs warning, writes placeholder, continues
Single corrupt frame Skips it, processes the rest
Q&A fails Returns placeholder text — does not crash
Cleanup fails Logs warning only — result is already returned
ffmpeg missing Immediate RuntimeError with install instructions
Ollama missing Immediate RuntimeError with install URL

📋 Full error matrix: See §13 of the Project Specification.


🔗 Links

Resource Link
📖 Full Technical Specification VideoAnalyzer_ProjectSpec.md
🚀 Quick Start & Usage README.md
☁️ Run on Google Colab (T4 GPU) Open in Colab
🤗 Qwen2-VL Model Card HuggingFace
🤗 Distil-Whisper Model Card HuggingFace
🦙 Ollama ollama.com

Built by S. Pratham · Fully local · Fully offline · No API keys required