Skip to content

pr1hm/FrameRead

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎬 FrameRead β€” VideoAnalyzer

Turn any video into a detailed, multi-page written analysis β€” fully offline, fully local.

FrameRead is an AI-powered video analysis pipeline that extracts audio transcriptions and visual frame descriptions from any video file, then synthesizes them into an exhaustive natural-language summary. Optionally, ask it a specific question and get a precise, evidence-backed answer.

Everything runs locally on your machine β€” no API keys, no cloud services, no data leaves your device.


✨ Features

  • Dual-Pipeline Analysis β€” Processes both audio (speech) and video (frames) in parallel pipelines, then fuses them into a unified summary.
  • Scene-Change Keyframe Extraction β€” Intelligently detects visual scene changes using histogram + SSIM comparison rather than naive interval sampling.
  • Dynamic GPU Batching β€” Automatically profiles your GPU's VRAM at runtime and calculates the optimal batch size for vision inference. No manual tuning needed.
  • Fully Offline β€” All models run locally. No internet required after initial model downloads.
  • Importable as a Module β€” Use it from the command line or import it into your own Python project.
  • Rich Structured Logging β€” Every pipeline stage emits timestamped, module-tagged logs for full observability.
  • Automatic Cleanup β€” All temporary files (audio, frames, intermediate text) are deleted after each run.
  • Hardware-Adaptive β€” Automatically detects GPU/CPU and selects optimal dtypes, batch sizes, and compute strategies.
  • Prompt Q&A β€” Optionally pass a question to get a targeted answer grounded in the video content.

πŸ—οΈ Architecture

The pipeline follows a three-stage architecture:

Video File ──┬── Audio Pipeline ──→ Transcription (faster-whisper)
             β”‚
             └── Video Pipeline ──→ Frame Descriptions (Qwen2-VL, local HF)
                                          β”‚
                                          β–Ό
                                  Synthesis Layer (qwen3.5:9b via Ollama)
                                          β”‚
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β–Ό                       β–Ό
                       Master Summary           Q&A Answer
                       (always)              (if prompt given)

πŸ“– For the full architecture diagram, pipeline details, and developer blueprint, see VideoAnalyzer_ProjectSpec.md.


πŸ€– Models Used

Model Purpose Runtime Size
distil-whisper/distil-large-v3 Audio transcription faster-whisper (CTranslate2) ~1.5 GB
Qwen/Qwen2-VL-2B-Instruct Visual frame analysis HuggingFace Transformers (local) ~4.5 GB
qwen3.5:9b Summary synthesis & Q&A Ollama (local) ~6 GB

All models are downloaded automatically on first run and cached locally for future use.


πŸ“‹ Prerequisites

  • Python β‰₯ 3.10
  • ffmpeg installed and on PATH (install guide)
  • Ollama installed (ollama.com)
  • CUDA GPU recommended (8+ GB VRAM) β€” CPU mode works but is significantly slower

πŸš€ Installation

# Clone the repository
git clone https://github.com/s-pra1ham/FrameRead.git
cd FrameRead

# Create and activate a virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

πŸ’» Usage

Option 1: Command Line

Generate a full summary (no prompt)

python run.py video.mp4

This processes the entire video and prints a comprehensive multi-page summary covering:

  • Overview & purpose
  • Detailed chronological narrative
  • Key points & concepts
  • Visual highlights
  • Speakers & participants
  • Tone & style

Ask a specific question (with prompt)

python run.py video.mp4 --prompt "What tools or technologies are mentioned?"
python run.py lecture.mp4 -p "Summarize the main argument in 3 bullet points."

This generates the full summary internally, then uses it to answer your question with cited evidence.


Option 2: Import as a Python Module

Use FrameRead programmatically in your own scripts or projects:

Basic summary (no prompt)

from src import analyze

result = analyze(video_path="path/to/video.mp4")

print(result.summary)            # Full multi-page analysis
print(result.transcription)      # Timestamped transcript
print(result.keyframe_count)     # Number of keyframes extracted
print(result.duration_seconds)   # Total pipeline time in seconds

With a prompt

from src import analyze

result = analyze(
    video_path="path/to/video.mp4",
    prompt="What products are shown in this video?"
)

print(result.prompt_answer)      # Direct answer to your question
print(result.summary)            # Full summary is still available

AnalysisResult fields

Field Type Description
summary str Complete multi-page master summary (always present)
prompt_answer str | None Answer to your prompt (only if prompt was provided)
keyframe_count int Number of scene-change keyframes extracted
transcription str Full timestamped transcript of spoken audio
duration_seconds float Total wall-clock pipeline time
video_path str Absolute path to the analyzed video

Option 3: Google Colab (Free T4 GPU)

Run FrameRead on Google Colab's free T4 GPU β€” no local GPU required:

Open in Colab


πŸ“‚ Project Structure

FrameRead/
β”œβ”€β”€ src/                          ← Main package
β”‚   β”œβ”€β”€ __init__.py               ← Public API: analyze()
β”‚   β”œβ”€β”€ analyzer.py               ← Pipeline orchestrator
β”‚   β”œβ”€β”€ config.py                 ← All tunable constants
β”‚   β”œβ”€β”€ audio/
β”‚   β”‚   β”œβ”€β”€ extractor.py          ← ffmpeg audio extraction
β”‚   β”‚   └── transcriber.py        ← Whisper transcription
β”‚   β”œβ”€β”€ video/
β”‚   β”‚   β”œβ”€β”€ frame_extractor.py    ← Scene-change keyframe extraction
β”‚   β”‚   └── frame_analyzer.py     ← Qwen2-VL vision inference
β”‚   β”œβ”€β”€ llm/
β”‚   β”‚   β”œβ”€β”€ ollama_manager.py     ← Ollama process & model management
β”‚   β”‚   └── summarizer.py         ← Summary + Q&A generation
β”‚   └── utils/
β”‚       β”œβ”€β”€ hardware.py           ← GPU/CPU detection
β”‚       β”œβ”€β”€ logger.py             ← Centralized logging
β”‚       β”œβ”€β”€ model_manager.py      ← Whisper model cache management
β”‚       └── cleanup.py            ← Temp directory lifecycle
β”œβ”€β”€ docs/
β”‚   └── VideoAnalyzer_ProjectSpec.md  ← Full technical specification
β”œβ”€β”€ run.py                        ← CLI entry point
β”œβ”€β”€ requirements.txt
└── setup.py

πŸ“– For the complete developer specification, see docs/VideoAnalyzer_ProjectSpec.md.


βš™οΈ How It Works

  1. Input β€” You provide a video file path and an optional prompt.
  2. Audio Extraction β€” ffmpeg strips the audio track into a 16kHz mono WAV.
  3. Transcription β€” faster-whisper transcribes every spoken word with timestamps.
  4. Keyframe Extraction β€” OpenCV reads every frame; histogram + SSIM comparison detects scene changes and saves only the meaningful keyframes.
  5. Vision Analysis β€” Each keyframe is described in detail by Qwen2-VL-2B-Instruct running locally. On GPU, batch size is dynamically calculated via a VRAM probe to maximize throughput without OOM.
  6. Synthesis β€” The full transcript + all frame descriptions are fed to qwen3.5:9b (via Ollama) which produces the master summary.
  7. Q&A (optional) β€” If a prompt was given, the summary is used as context to answer the question.
  8. Cleanup β€” All temporary files are automatically deleted.

πŸ“Š Example Output

[00:00:00.000] [INIT]       Starting VideoAnalyzer pipeline for: C:\videos\demo.mp4
[00:00:00.012] [HARDWARE]   ── Hardware Survey ──────────────────────────
[00:00:00.013] [HARDWARE]   Device:       CUDA (GPU)
[00:00:00.013] [HARDWARE]   GPU:          NVIDIA GeForce RTX 4060
[00:00:00.013] [HARDWARE]   VRAM:         8.0 GB
[00:00:00.014] [HARDWARE]   Torch dtype:  float16
[00:00:00.014] [HARDWARE]   ─────────────────────────────────────────────
[00:00:01.220] [AUDIO]      Extracting audio from video...
[00:00:03.891] [AUDIO]      βœ“ Audio extracted in 2.7s
[00:00:07.441] [TRANSCRIBE] βœ“ Transcription complete β€” 12 segments, 143 words, 3.5s
[00:00:07.500] [FRAMES]     βœ“ Extraction complete β€” 8 keyframes from 900 total frames (0.5s)
[00:00:08.100] [VISION]     Vision Mode: Local Inference (Dynamic Batch size: 2)
[00:00:45.200] [VISION]     All frames analyzed -- 37.1s total (Dynamic Batch Size: 2)
[00:01:12.300] [SUMMARY]    βœ“ Summary generated β€” 2847 words, 3891 tokens (27.1s)
[00:01:12.400] [CLEANUP]    βœ“ Cleaned up 11 files
[00:01:12.401] [DONE]       ✨ Total pipeline time: 72.4s

πŸ“„ License

This project is for personal and educational use.


πŸ‘€ Author

S. Pratham β€” GitHub

About

FrameRead is an AI-powered video analysis pipeline that extracts audio transcriptions and visual frame descriptions from any video file, then synthesizes them into an exhaustive natural-language summary. Optionally, ask it a specific question and get a precise, evidence-backed answer.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages