Turn any video into a detailed, multi-page written analysis β fully offline, fully local.
FrameRead is an AI-powered video analysis pipeline that extracts audio transcriptions and visual frame descriptions from any video file, then synthesizes them into an exhaustive natural-language summary. Optionally, ask it a specific question and get a precise, evidence-backed answer.
Everything runs locally on your machine β no API keys, no cloud services, no data leaves your device.
- Dual-Pipeline Analysis β Processes both audio (speech) and video (frames) in parallel pipelines, then fuses them into a unified summary.
- Scene-Change Keyframe Extraction β Intelligently detects visual scene changes using histogram + SSIM comparison rather than naive interval sampling.
- Dynamic GPU Batching β Automatically profiles your GPU's VRAM at runtime and calculates the optimal batch size for vision inference. No manual tuning needed.
- Fully Offline β All models run locally. No internet required after initial model downloads.
- Importable as a Module β Use it from the command line or
importit into your own Python project. - Rich Structured Logging β Every pipeline stage emits timestamped, module-tagged logs for full observability.
- Automatic Cleanup β All temporary files (audio, frames, intermediate text) are deleted after each run.
- Hardware-Adaptive β Automatically detects GPU/CPU and selects optimal dtypes, batch sizes, and compute strategies.
- Prompt Q&A β Optionally pass a question to get a targeted answer grounded in the video content.
The pipeline follows a three-stage architecture:
Video File βββ¬ββ Audio Pipeline βββ Transcription (faster-whisper)
β
βββ Video Pipeline βββ Frame Descriptions (Qwen2-VL, local HF)
β
βΌ
Synthesis Layer (qwen3.5:9b via Ollama)
β
βββββββββββββ΄ββββββββββββ
βΌ βΌ
Master Summary Q&A Answer
(always) (if prompt given)
π For the full architecture diagram, pipeline details, and developer blueprint, see
VideoAnalyzer_ProjectSpec.md.
| Model | Purpose | Runtime | Size |
|---|---|---|---|
distil-whisper/distil-large-v3 |
Audio transcription | faster-whisper (CTranslate2) |
~1.5 GB |
Qwen/Qwen2-VL-2B-Instruct |
Visual frame analysis | HuggingFace Transformers (local) | ~4.5 GB |
qwen3.5:9b |
Summary synthesis & Q&A | Ollama (local) | ~6 GB |
All models are downloaded automatically on first run and cached locally for future use.
- Python β₯ 3.10
- ffmpeg installed and on PATH (install guide)
- Ollama installed (ollama.com)
- CUDA GPU recommended (8+ GB VRAM) β CPU mode works but is significantly slower
# Clone the repository
git clone https://github.com/s-pra1ham/FrameRead.git
cd FrameRead
# Create and activate a virtual environment
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtpython run.py video.mp4This processes the entire video and prints a comprehensive multi-page summary covering:
- Overview & purpose
- Detailed chronological narrative
- Key points & concepts
- Visual highlights
- Speakers & participants
- Tone & style
python run.py video.mp4 --prompt "What tools or technologies are mentioned?"python run.py lecture.mp4 -p "Summarize the main argument in 3 bullet points."This generates the full summary internally, then uses it to answer your question with cited evidence.
Use FrameRead programmatically in your own scripts or projects:
from src import analyze
result = analyze(video_path="path/to/video.mp4")
print(result.summary) # Full multi-page analysis
print(result.transcription) # Timestamped transcript
print(result.keyframe_count) # Number of keyframes extracted
print(result.duration_seconds) # Total pipeline time in secondsfrom src import analyze
result = analyze(
video_path="path/to/video.mp4",
prompt="What products are shown in this video?"
)
print(result.prompt_answer) # Direct answer to your question
print(result.summary) # Full summary is still available| Field | Type | Description |
|---|---|---|
summary |
str |
Complete multi-page master summary (always present) |
prompt_answer |
str | None |
Answer to your prompt (only if prompt was provided) |
keyframe_count |
int |
Number of scene-change keyframes extracted |
transcription |
str |
Full timestamped transcript of spoken audio |
duration_seconds |
float |
Total wall-clock pipeline time |
video_path |
str |
Absolute path to the analyzed video |
Run FrameRead on Google Colab's free T4 GPU β no local GPU required:
FrameRead/
βββ src/ β Main package
β βββ __init__.py β Public API: analyze()
β βββ analyzer.py β Pipeline orchestrator
β βββ config.py β All tunable constants
β βββ audio/
β β βββ extractor.py β ffmpeg audio extraction
β β βββ transcriber.py β Whisper transcription
β βββ video/
β β βββ frame_extractor.py β Scene-change keyframe extraction
β β βββ frame_analyzer.py β Qwen2-VL vision inference
β βββ llm/
β β βββ ollama_manager.py β Ollama process & model management
β β βββ summarizer.py β Summary + Q&A generation
β βββ utils/
β βββ hardware.py β GPU/CPU detection
β βββ logger.py β Centralized logging
β βββ model_manager.py β Whisper model cache management
β βββ cleanup.py β Temp directory lifecycle
βββ docs/
β βββ VideoAnalyzer_ProjectSpec.md β Full technical specification
βββ run.py β CLI entry point
βββ requirements.txt
βββ setup.py
π For the complete developer specification, see
docs/VideoAnalyzer_ProjectSpec.md.
- Input β You provide a video file path and an optional prompt.
- Audio Extraction β
ffmpegstrips the audio track into a 16kHz mono WAV. - Transcription β
faster-whispertranscribes every spoken word with timestamps. - Keyframe Extraction β OpenCV reads every frame; histogram + SSIM comparison detects scene changes and saves only the meaningful keyframes.
- Vision Analysis β Each keyframe is described in detail by
Qwen2-VL-2B-Instructrunning locally. On GPU, batch size is dynamically calculated via a VRAM probe to maximize throughput without OOM. - Synthesis β The full transcript + all frame descriptions are fed to
qwen3.5:9b(via Ollama) which produces the master summary. - Q&A (optional) β If a prompt was given, the summary is used as context to answer the question.
- Cleanup β All temporary files are automatically deleted.
[00:00:00.000] [INIT] Starting VideoAnalyzer pipeline for: C:\videos\demo.mp4
[00:00:00.012] [HARDWARE] ββ Hardware Survey ββββββββββββββββββββββββββ
[00:00:00.013] [HARDWARE] Device: CUDA (GPU)
[00:00:00.013] [HARDWARE] GPU: NVIDIA GeForce RTX 4060
[00:00:00.013] [HARDWARE] VRAM: 8.0 GB
[00:00:00.014] [HARDWARE] Torch dtype: float16
[00:00:00.014] [HARDWARE] βββββββββββββββββββββββββββββββββββββββββββββ
[00:00:01.220] [AUDIO] Extracting audio from video...
[00:00:03.891] [AUDIO] β Audio extracted in 2.7s
[00:00:07.441] [TRANSCRIBE] β Transcription complete β 12 segments, 143 words, 3.5s
[00:00:07.500] [FRAMES] β Extraction complete β 8 keyframes from 900 total frames (0.5s)
[00:00:08.100] [VISION] Vision Mode: Local Inference (Dynamic Batch size: 2)
[00:00:45.200] [VISION] All frames analyzed -- 37.1s total (Dynamic Batch Size: 2)
[00:01:12.300] [SUMMARY] β Summary generated β 2847 words, 3891 tokens (27.1s)
[00:01:12.400] [CLEANUP] β Cleaned up 11 files
[00:01:12.401] [DONE] β¨ Total pipeline time: 72.4s
This project is for personal and educational use.
S. Pratham β GitHub