A Retrieval-Augmented Generation (RAG) system that enables users to converse with YouTube videos and retrieve timestamp-grounded insights across multiple videos.
The application combines transcript extraction, semantic search, conversational memory, and LLM-powered reasoning to provide explainable answers linked directly to relevant video segments.
- Chat with any YouTube video using its transcript
- Multi-turn conversations with chat history awareness
- Follow-up question handling through query rewriting
- Transcript-grounded responses with optional LLM reasoning when context is insufficient
- Search YouTube using a natural language query
- Automatically discover and retrieve relevant videos
- Aggregate information across multiple transcripts
- Return timestamp-grounded answers from different videos
- Surface direct video references for further exploration
Unlike traditional transcript chunking approaches, this project preserves temporal information throughout the retrieval pipeline.
- Custom time-based transcript chunking
- Timestamp metadata attached to every chunk
- Source-grounded retrieval
- Direct navigation to relevant video segments
User Question → Query Rewriting → Transcript Retrieval → FAISS Similarity Search → LLM Response Generation → Conversational Memory Update
User Query → YouTube Video Discovery → Transcript Extraction → Timestamp-Aware Chunking → Embedding Generation → FAISS Vector Search → Multi-Video Retrieval → LLM Summarization → Timestamp-Grounded Answers
- HuggingFace Inference API
- Meta Llama 3 8B Instruct
- Google Gemini 2.5 Flash
- LangChain
- FAISS Vector Store
- Sentence Transformers
- YouTube Transcript API
- yt-dlp
- Streamlit
- Python
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAMEpython -m venv myenvmyenv\Scripts\activatesource myenv/bin/activatepip install -r requirements.txtCreate a .env file in the project root:
HUGGINGFACEHUB_API_TOKEN=YOUR_TOKENstreamlit run app.pyTraditional text chunking loses temporal information, making it difficult to trace answers back to the source video.
A custom timestamp-aware chunker was implemented to:
- preserve transcript timing information
- maintain source attribution
- enable timestamp-grounded responses
Responses are consolidated across multiple videos while limiting duplication and prioritizing the most relevant evidence from each source.
Follow-up questions are rewritten into standalone queries before retrieval, improving retrieval quality and enabling natural conversations over video content.
- Hybrid Search (Dense + BM25)
- Cross-Encoder Reranking
- Persistent Vector Database (Chroma / Qdrant)
- Streaming Responses
- Whisper-based Audio Transcription
- Multi-language Support
- Agentic Video Research Workflow
- Evaluation Pipeline for Retrieval Quality
rag genai langchain faiss huggingface streamlit youtube chatbot retrieval-augmented-generation semantic-search vector-database llm conversational-ai