A complete implementation of multi-modal Retrieval-Augmented Generation (RAG) that processes PDFs containing text, tables, and images. This project demonstrates how to extract, summarize, index, and query multi-modal content using LangChain and modern LLMs.
- Document Processing: Uses
unstructuredto extract text, tables, and images from PDFs - Summarization:
- Text/Tables: Groq's Llama 3.1 (free)
- Images: OpenAI GPT-4o-mini with vision
- Vector Storage: ChromaDB with HuggingFace embeddings
- Retrieval: MultiVectorRetriever (searches summaries, returns originals)
- Generation: OpenAI GPT-4o-mini for multi-modal question answering
┌─────────┐
│ User │
└────┬────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Streamlit App │
│ ┌────────────┐ ┌────────────┐ │
│ │ PDF Upload │──────────────────────────────────▶│ Chat UI │ │
│ └─────┬──────┘ └─────▲──────┘ │
└────────┼──────────────────────────────────────────────┼─────────────┘
│ │
▼ │
┌────────────────────────────────────────────┐ │
│ Extract (text/tables/images) │ │
│ • unstructured partition_pdf │ │
└────────────────┬───────────────────────────┘ │
│ │
▼ │
┌────────────────────────────────────────────┐ │
│ Summarize Content │ │
│ • Text/Tables: Groq Llama 3.1 │ │
│ • Images: OpenAI GPT-4o-mini │ │
└────────────────┬───────────────────────────┘ │
│ │
▼ │
┌────────────────────────────────────────────┐ │
│ MultiVectorRetriever │ │
│ ┌──────────────────┐ ┌───────────────┐ │ │
│ │ Vectorstore │ │ Docstore │ │ │
│ │ (summaries) │ │ (originals) │ │ │
│ │ ChromaDB │ │ │ │ │
│ │ HuggingFace │ │ • Full text │ │ │
│ │ Embeddings │ │ • Tables │ │ │
│ └──────────────────┘ │ • Images │ │ │
│ └───────────────┘ │ │
└────────────────┬───────────────────────────┘ │
│ │
▼ │
┌────────────────────────────────────────────┐ │
│ RAG Chain │ │
│ 1. Retrieve relevant summaries │ │
│ 2. Fetch original content │ │
│ 3. Build multi-modal prompt │ │
│ 4. Generate answer (GPT-4o-mini) │ │
└────────────────┬───────────────────────────┘ │
│ │
└───────────────────────────────────┘
brew install poppler tesseract libmagic- Python 3.11 or higher
- Clone the repository
git clone git@github.com:variang/chat-with-pdf.git
cd chat-with-pdf- Create and activate virtual environment
python3.11 -m venv venv
source venv/bin/activate- Install dependencies
pip install -r requirements.txt- Set up environment variables
Create a .env file in the project root:
OPENAI_API_KEY=your_openai_api_key
GROQ_API_KEY=your_groq_api_key
LANGCHAIN_API_KEY=your_langchain_api_key
LANGCHAIN_TRACING_V2=truesource venv/bin/activate && streamlit run app.pyThe app will open in your browser at http://localhost:8501
jupyter notebook chat-with-pdf.ipynbfrom unstructured.partition.pdf import partition_pdf
chunks = partition_pdf(
filename="./content/attention.pdf",
infer_table_structure=True,
strategy="hi_res",
extract_image_block_types=["Image"],
extract_image_block_to_payload=True,
chunking_strategy="by_title",
max_characters=10000,
combine_text_under_n_chars=2000,
new_after_n_chars=6000,
)tables = []
texts = []
images = []
for chunk in chunks:
if "Table" in str(type(chunk)):
tables.append(chunk)
elif "CompositeElement" in str(type(chunk)):
texts.append(chunk)from langchain_groq import ChatGroq
model = ChatGroq(model="llama-3.1-8b-instant")
text_summaries = summarize_chain.batch(texts)
from langchain_openai import ChatOpenAI
image_chain = prompt | ChatOpenAI(model="gpt-4o-mini")
image_summaries = image_chain.batch(images)from langchain_community.vectorstores import Chroma
from langchain_classic.retrievers import MultiVectorRetriever
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = Chroma(
collection_name="multi_modal_rag",
embedding_function=embeddings
)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
id_key="doc_id"
)response = chain.invoke("What is the attention mechanism?")
print(response)
response = chain_with_sources.invoke("Explain multi-head attention")
print("Answer:", response['response'])
print("Sources:", response['context'])The system uses a two-layer storage approach:
- Vectorstore: Stores embeddings of summaries (lightweight, searchable)
- Docstore: Stores original content (full text, HTML tables, base64 images)
When you query:
- System searches summaries semantically
- Retrieves corresponding original content
- Passes originals (including images) to the LLM
- LLM generates context-aware answer
This approach enables:
- Better retrieval (summaries match queries better)
- Richer context (LLM sees full details)
- Multi-modal support (search text descriptions, return images)
- Image quality: Depends on PDF extraction and LLM vision capabilities
- Cost: OpenAI API calls required for vision features
- Context limits: Large documents may exceed token limits
- Accuracy: Summaries may lose details from original content
If you hit API rate limits:
- Reduce
max_concurrencyin batch operations - Add delays between requests
- Switch to free alternatives (Gemini, Ollama)
If processing large PDFs:
- Reduce
max_charactersin chunking - Process fewer images at once
- Use smaller embedding models
To improve retrieval:
- Adjust chunking parameters
- Use more detailed summarization prompts
- Increase
kparameter in retrieval - Try different embedding models