Multi-modal RAG with LangChain

A complete implementation of multi-modal Retrieval-Augmented Generation (RAG) that processes PDFs containing text, tables, and images. This project demonstrates how to extract, summarize, index, and query multi-modal content using LangChain and modern LLMs.

Features

Document Processing: Uses unstructured to extract text, tables, and images from PDFs
Summarization:
- Text/Tables: Groq's Llama 3.1 (free)
- Images: OpenAI GPT-4o-mini with vision
Vector Storage: ChromaDB with HuggingFace embeddings
Retrieval: MultiVectorRetriever (searches summaries, returns originals)
Generation: OpenAI GPT-4o-mini for multi-modal question answering

Architecture

┌─────────┐
│  User   │
└────┬────┘
     │
     ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        Streamlit App                                 │
│  ┌────────────┐                                  ┌────────────┐     │
│  │ PDF Upload │──────────────────────────────────▶│  Chat UI   │     │
│  └─────┬──────┘                                  └─────▲──────┘     │
└────────┼──────────────────────────────────────────────┼─────────────┘
         │                                               │
         ▼                                               │
    ┌────────────────────────────────────────────┐      │
    │   Extract (text/tables/images)              │      │
    │   • unstructured partition_pdf              │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     ▼                                   │
    ┌────────────────────────────────────────────┐      │
    │   Summarize Content                         │      │
    │   • Text/Tables: Groq Llama 3.1            │      │
    │   • Images: OpenAI GPT-4o-mini             │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     ▼                                   │
    ┌────────────────────────────────────────────┐      │
    │   MultiVectorRetriever                      │      │
    │   ┌──────────────────┐  ┌───────────────┐ │      │
    │   │   Vectorstore    │  │   Docstore    │ │      │
    │   │   (summaries)    │  │  (originals)  │ │      │
    │   │   ChromaDB       │  │               │ │      │
    │   │   HuggingFace    │  │  • Full text  │ │      │
    │   │   Embeddings     │  │  • Tables     │ │      │
    │   └──────────────────┘  │  • Images     │ │      │
    │                         └───────────────┘ │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     ▼                                   │
    ┌────────────────────────────────────────────┐      │
    │   RAG Chain                                 │      │
    │   1. Retrieve relevant summaries            │      │
    │   2. Fetch original content                 │      │
    │   3. Build multi-modal prompt               │      │
    │   4. Generate answer (GPT-4o-mini)          │      │
    └────────────────┬───────────────────────────┘      │
                     │                                   │
                     └───────────────────────────────────┘

Prerequisites

System Dependencies (macOS)

brew install poppler tesseract libmagic

Python Requirements

Python 3.11 or higher

Setup

Clone the repository

git clone git@github.com:variang/chat-with-pdf.git
cd chat-with-pdf

Create and activate virtual environment

python3.11 -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Set up environment variables

Create a .env file in the project root:

OPENAI_API_KEY=your_openai_api_key
GROQ_API_KEY=your_groq_api_key
LANGCHAIN_API_KEY=your_langchain_api_key
LANGCHAIN_TRACING_V2=true

Usage

Running the Streamlit App

source venv/bin/activate && streamlit run app.py

The app will open in your browser at http://localhost:8501

Running the Notebook

jupyter notebook chat-with-pdf.ipynb

Key Steps

1. Extract PDF Content

from unstructured.partition.pdf import partition_pdf

chunks = partition_pdf(
    filename="./content/attention.pdf",
    infer_table_structure=True,
    strategy="hi_res",
    extract_image_block_types=["Image"],
    extract_image_block_to_payload=True,
    chunking_strategy="by_title",
    max_characters=10000,
    combine_text_under_n_chars=2000,
    new_after_n_chars=6000,
)

2. Separate Content Types

tables = []
texts = []
images = []

for chunk in chunks:
    if "Table" in str(type(chunk)):
        tables.append(chunk)
    elif "CompositeElement" in str(type(chunk)):
        texts.append(chunk)

3. Generate Summaries

from langchain_groq import ChatGroq

model = ChatGroq(model="llama-3.1-8b-instant")
text_summaries = summarize_chain.batch(texts)

from langchain_openai import ChatOpenAI

image_chain = prompt | ChatOpenAI(model="gpt-4o-mini")
image_summaries = image_chain.batch(images)

4. Build Vector Store

from langchain_community.vectorstores import Chroma
from langchain_classic.retrievers import MultiVectorRetriever

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vectorstore = Chroma(
    collection_name="multi_modal_rag",
    embedding_function=embeddings
)

retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=InMemoryStore(),
    id_key="doc_id"
)

5. Query the System

response = chain.invoke("What is the attention mechanism?")
print(response)

response = chain_with_sources.invoke("Explain multi-head attention")
print("Answer:", response['response'])
print("Sources:", response['context'])

How It Works

MultiVectorRetriever Pattern

The system uses a two-layer storage approach:

Vectorstore: Stores embeddings of summaries (lightweight, searchable)
Docstore: Stores original content (full text, HTML tables, base64 images)

When you query:

System searches summaries semantically
Retrieves corresponding original content
Passes originals (including images) to the LLM
LLM generates context-aware answer

This approach enables:

Better retrieval (summaries match queries better)
Richer context (LLM sees full details)
Multi-modal support (search text descriptions, return images)

Limitations

Image quality: Depends on PDF extraction and LLM vision capabilities
Cost: OpenAI API calls required for vision features
Context limits: Large documents may exceed token limits
Accuracy: Summaries may lose details from original content

Troubleshooting

Rate Limit Errors

If you hit API rate limits:

Reduce max_concurrency in batch operations
Add delays between requests
Switch to free alternatives (Gemini, Ollama)

Memory Issues

If processing large PDFs:

Reduce max_characters in chunking
Process fewer images at once
Use smaller embedding models

Poor Retrieval Results

To improve retrieval:

Adjust chunking parameters
Use more detailed summarization prompts
Increase k parameter in retrieval
Try different embedding models

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
content		content
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
app.py		app.py
chat-with-pdf.ipynb		chat-with-pdf.ipynb
pdf_rag.py		pdf_rag.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal RAG with LangChain

Features

Architecture

Prerequisites

System Dependencies (macOS)

Python Requirements

Setup

Usage

Running the Streamlit App

Running the Notebook

Key Steps

1. Extract PDF Content

2. Separate Content Types

3. Generate Summaries

4. Build Vector Store

5. Query the System

How It Works

MultiVectorRetriever Pattern

Limitations

Troubleshooting

Rate Limit Errors

Memory Issues

Poor Retrieval Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-modal RAG with LangChain

Features

Architecture

Prerequisites

System Dependencies (macOS)

Python Requirements

Setup

Usage

Running the Streamlit App

Running the Notebook

Key Steps

1. Extract PDF Content

2. Separate Content Types

3. Generate Summaries

4. Build Vector Store

5. Query the System

How It Works

MultiVectorRetriever Pattern

Limitations

Troubleshooting

Rate Limit Errors

Memory Issues

Poor Retrieval Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages