Skip to content

variang/chat-with-pdf

Repository files navigation

Multi-modal RAG with LangChain

A complete implementation of multi-modal Retrieval-Augmented Generation (RAG) that processes PDFs containing text, tables, and images. This project demonstrates how to extract, summarize, index, and query multi-modal content using LangChain and modern LLMs.

Features

  1. Document Processing: Uses unstructured to extract text, tables, and images from PDFs
  2. Summarization:
    • Text/Tables: Groq's Llama 3.1 (free)
    • Images: OpenAI GPT-4o-mini with vision
  3. Vector Storage: ChromaDB with HuggingFace embeddings
  4. Retrieval: MultiVectorRetriever (searches summaries, returns originals)
  5. Generation: OpenAI GPT-4o-mini for multi-modal question answering

Architecture

┌─────────┐
│  User   │
└────┬────┘
     │
     ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        Streamlit App                                 │
│  ┌────────────┐                                  ┌────────────┐     │
│  │ PDF Upload │──────────────────────────────────▶│  Chat UI   │     │
│  └─────┬──────┘                                  └─────▲──────┘     │
└────────┼──────────────────────────────────────────────┼─────────────┘
         │                                               │
         ▼                                               │
    ┌────────────────────────────────────────────┐      │
    │   Extract (text/tables/images)              │      │
    │   • unstructured partition_pdf              │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     ▼                                   │
    ┌────────────────────────────────────────────┐      │
    │   Summarize Content                         │      │
    │   • Text/Tables: Groq Llama 3.1            │      │
    │   • Images: OpenAI GPT-4o-mini             │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     ▼                                   │
    ┌────────────────────────────────────────────┐      │
    │   MultiVectorRetriever                      │      │
    │   ┌──────────────────┐  ┌───────────────┐ │      │
    │   │   Vectorstore    │  │   Docstore    │ │      │
    │   │   (summaries)    │  │  (originals)  │ │      │
    │   │   ChromaDB       │  │               │ │      │
    │   │   HuggingFace    │  │  • Full text  │ │      │
    │   │   Embeddings     │  │  • Tables     │ │      │
    │   └──────────────────┘  │  • Images     │ │      │
    │                         └───────────────┘ │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     ▼                                   │
    ┌────────────────────────────────────────────┐      │
    │   RAG Chain                                 │      │
    │   1. Retrieve relevant summaries            │      │
    │   2. Fetch original content                 │      │
    │   3. Build multi-modal prompt               │      │
    │   4. Generate answer (GPT-4o-mini)          │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     └───────────────────────────────────┘

Prerequisites

System Dependencies (macOS)

brew install poppler tesseract libmagic

Python Requirements

  • Python 3.11 or higher

Setup

  1. Clone the repository
git clone git@github.com:variang/chat-with-pdf.git
cd chat-with-pdf
  1. Create and activate virtual environment
python3.11 -m venv venv
source venv/bin/activate
  1. Install dependencies
pip install -r requirements.txt
  1. Set up environment variables

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key
GROQ_API_KEY=your_groq_api_key
LANGCHAIN_API_KEY=your_langchain_api_key
LANGCHAIN_TRACING_V2=true

Usage

Running the Streamlit App

source venv/bin/activate && streamlit run app.py

The app will open in your browser at http://localhost:8501

Running the Notebook

jupyter notebook chat-with-pdf.ipynb

Key Steps

1. Extract PDF Content

from unstructured.partition.pdf import partition_pdf

chunks = partition_pdf(
    filename="./content/attention.pdf",
    infer_table_structure=True,
    strategy="hi_res",
    extract_image_block_types=["Image"],
    extract_image_block_to_payload=True,
    chunking_strategy="by_title",
    max_characters=10000,
    combine_text_under_n_chars=2000,
    new_after_n_chars=6000,
)

2. Separate Content Types

tables = []
texts = []
images = []

for chunk in chunks:
    if "Table" in str(type(chunk)):
        tables.append(chunk)
    elif "CompositeElement" in str(type(chunk)):
        texts.append(chunk)

3. Generate Summaries

from langchain_groq import ChatGroq

model = ChatGroq(model="llama-3.1-8b-instant")
text_summaries = summarize_chain.batch(texts)

from langchain_openai import ChatOpenAI

image_chain = prompt | ChatOpenAI(model="gpt-4o-mini")
image_summaries = image_chain.batch(images)

4. Build Vector Store

from langchain_community.vectorstores import Chroma
from langchain_classic.retrievers import MultiVectorRetriever

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = Chroma(
    collection_name="multi_modal_rag",
    embedding_function=embeddings
)

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    id_key="doc_id"
)

5. Query the System

response = chain.invoke("What is the attention mechanism?")
print(response)

response = chain_with_sources.invoke("Explain multi-head attention")
print("Answer:", response['response'])
print("Sources:", response['context'])

How It Works

MultiVectorRetriever Pattern

The system uses a two-layer storage approach:

  1. Vectorstore: Stores embeddings of summaries (lightweight, searchable)
  2. Docstore: Stores original content (full text, HTML tables, base64 images)

When you query:

  1. System searches summaries semantically
  2. Retrieves corresponding original content
  3. Passes originals (including images) to the LLM
  4. LLM generates context-aware answer

This approach enables:

  • Better retrieval (summaries match queries better)
  • Richer context (LLM sees full details)
  • Multi-modal support (search text descriptions, return images)

Limitations

  • Image quality: Depends on PDF extraction and LLM vision capabilities
  • Cost: OpenAI API calls required for vision features
  • Context limits: Large documents may exceed token limits
  • Accuracy: Summaries may lose details from original content

Troubleshooting

Rate Limit Errors

If you hit API rate limits:

  • Reduce max_concurrency in batch operations
  • Add delays between requests
  • Switch to free alternatives (Gemini, Ollama)

Memory Issues

If processing large PDFs:

  • Reduce max_characters in chunking
  • Process fewer images at once
  • Use smaller embedding models

Poor Retrieval Results

To improve retrieval:

  • Adjust chunking parameters
  • Use more detailed summarization prompts
  • Increase k parameter in retrieval
  • Try different embedding models

References

About

Multi modal RAG that allows users to chat with PDFs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors