Skip to content

atahabilder1/DocuMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocuMind

Multi-modal AI agent that extracts information from PDFs, images, and documents to answer questions. Combines vision models with RAG architecture for intelligent document understanding.

Features

  • PDF document processing and text extraction
  • Image analysis using vision models
  • Multi-modal document understanding
  • RAG (Retrieval-Augmented Generation) architecture
  • Question answering over uploaded documents

Getting Started

Prerequisites

  • Python 3.9+
  • OpenAI API key

Installation

pip install -r requirements.txt

Configuration

Copy .env.example to .env and add your API keys:

cp .env.example .env

Usage

Python API

from src.main import DocuMind

# Initialize DocuMind
app = DocuMind(use_cache=True)

# Process a PDF document
result = app.process_pdf("path/to/document.pdf")
print(f"Processed {result['chunks_processed']} chunks")

# Query the knowledge base
answer = app.query("What is this document about?")
print(f"Answer: {answer['answer']}")
print(f"Sources: {answer['num_sources']}")

# Get application statistics
stats = app.get_stats()
print(f"Documents in store: {stats['documents_in_store']}")

REST API

Start the API server:

python -m uvicorn src.api:app --reload

Upload a document:

curl -X POST "http://localhost:8000/upload" \
  -F "file=@document.pdf"

Query documents:

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the main topic?", "top_k": 5}'

Running Tests

pytest tests/

Architecture

DocuMind uses a multi-modal RAG architecture to process various document types:

  1. Document Processing: Extract text and images from PDFs
  2. Vision Analysis: Analyze images and diagrams using vision models
  3. Embedding Generation: Create vector embeddings for text and visual content
  4. Retrieval: Find relevant content based on user queries
  5. Generation: Generate accurate answers using retrieved context

License

MIT

About

Multi-modal AI agent that extracts information from PDFs, images, and documents to answer questions. Combines vision models with RAG architecture for intelligent document understanding. Upload any file and chat with your documents. Built with LangChain, vision APIs, and vector embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages