Multi-modal AI agent that extracts information from PDFs, images, and documents to answer questions. Combines vision models with RAG architecture for intelligent document understanding.
- PDF document processing and text extraction
- Image analysis using vision models
- Multi-modal document understanding
- RAG (Retrieval-Augmented Generation) architecture
- Question answering over uploaded documents
- Python 3.9+
- OpenAI API key
pip install -r requirements.txtCopy .env.example to .env and add your API keys:
cp .env.example .envfrom src.main import DocuMind
# Initialize DocuMind
app = DocuMind(use_cache=True)
# Process a PDF document
result = app.process_pdf("path/to/document.pdf")
print(f"Processed {result['chunks_processed']} chunks")
# Query the knowledge base
answer = app.query("What is this document about?")
print(f"Answer: {answer['answer']}")
print(f"Sources: {answer['num_sources']}")
# Get application statistics
stats = app.get_stats()
print(f"Documents in store: {stats['documents_in_store']}")Start the API server:
python -m uvicorn src.api:app --reloadUpload a document:
curl -X POST "http://localhost:8000/upload" \
-F "file=@document.pdf"Query documents:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"query": "What is the main topic?", "top_k": 5}'pytest tests/DocuMind uses a multi-modal RAG architecture to process various document types:
- Document Processing: Extract text and images from PDFs
- Vision Analysis: Analyze images and diagrams using vision models
- Embedding Generation: Create vector embeddings for text and visual content
- Retrieval: Find relevant content based on user queries
- Generation: Generate accurate answers using retrieved context
MIT